Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-02-13 Thread L. C. Hsieh
Hi Mich,

The title of this thread is "[DISCUSS]". We need to have a public
discussion on a SPIP proposal collecting comments before we can move
forward to call for a vote on it.


On Mon, Feb 13, 2023 at 2:35 PM Mich Talebzadeh 
wrote:

> Hi,
>
> I thought we already voted to go ahead with this proposal!
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 13 Feb 2023 at 20:41, kazuyuki tanimura 
> wrote:
>
>> Thank you Liang-Chi!
>>
>> Kazu
>>
>> On Feb 11, 2023, at 7:12 PM, L. C. Hsieh  wrote:
>>
>> Thanks all for your feedback.
>>
>> Given this positive feedback, if there is no other comments/discussion, I
>> will go to start a vote in the next few days.
>>
>> Thank you again!
>>
>> On Thu, Feb 2, 2023 at 10:12 AM kazuyuki tanimura <
>> ktanim...@apple.com.invalid> wrote:
>>
>>> Thank you all for +1s and reviewing the SPIP doc.
>>>
>>> Kazu
>>>
>>> On Feb 1, 2023, at 1:28 AM, Dongjoon Hyun 
>>> wrote:
>>>
>>> +1
>>>
>>> On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 +1


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Wed, 1 Feb 2023 at 02:23, huaxin gao  wrote:

> +1
>
> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai  wrote:
>
>> +1
>>
>> Sent from my iPhone
>>
>> On Jan 31, 2023, at 4:16 PM, Yuming Wang  wrote:
>>
>> 
>> +1.
>>
>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura <
>> ktanim...@apple.com.invalid> wrote:
>>
>>> Great! Much appreciated, Mitch!
>>>
>>> Kazu
>>>
>>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>> Thanks, Kazu.
>>>
>>> I followed that template link and indeed as you pointed out it is a
>>> common template. If it works then it is what it is.
>>>
>>> I will be going through your design proposals and hopefully we can
>>> review it.
>>>
>>> Regards,
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>> for any loss, damage or destruction of data or any other property which 
>>> may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary 
>>> damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura 
>>> wrote:
>>>
 Thank you Mich. I followed the instruction at
 https://spark.apache.org/improvement-proposals.html and used its
 template.
 While we are open to revise our design doc, it seems more like you
 are proposing the community to change the instruction per se?

 Kazu

 On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

 Hi,

 Thanks for these proposals. good suggestions. Is this style of
 breaking down your approach standard?

 My view would be that perhaps it makes more sense to follow the
 industry established approach of breaking down
 your technical proposal  into:


1. Background
2. Objective
3. Scope
4. Constraints
5. Assumptions
6. Reporting
7. Deliverables
8. Timelines
9. Appendix

 Your current approach using below

 Q1. What are you trying to do? Articulate your objectives using
 absolutely no jargon. What are you trying to achieve?
 Q2. What problem is this proposal NOT designed to solve? What
 issues the suggested proposal is not going to address
 Q3. How is it done today, and wh

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-02-13 Thread Mich Talebzadeh
Hi,

I thought we already voted to go ahead with this proposal!



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 13 Feb 2023 at 20:41, kazuyuki tanimura  wrote:

> Thank you Liang-Chi!
>
> Kazu
>
> On Feb 11, 2023, at 7:12 PM, L. C. Hsieh  wrote:
>
> Thanks all for your feedback.
>
> Given this positive feedback, if there is no other comments/discussion, I
> will go to start a vote in the next few days.
>
> Thank you again!
>
> On Thu, Feb 2, 2023 at 10:12 AM kazuyuki tanimura <
> ktanim...@apple.com.invalid> wrote:
>
>> Thank you all for +1s and reviewing the SPIP doc.
>>
>> Kazu
>>
>> On Feb 1, 2023, at 1:28 AM, Dongjoon Hyun 
>> wrote:
>>
>> +1
>>
>> On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> +1
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 1 Feb 2023 at 02:23, huaxin gao  wrote:
>>>
 +1

 On Tue, Jan 31, 2023 at 6:10 PM DB Tsai  wrote:

> +1
>
> Sent from my iPhone
>
> On Jan 31, 2023, at 4:16 PM, Yuming Wang  wrote:
>
> 
> +1.
>
> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura <
> ktanim...@apple.com.invalid> wrote:
>
>> Great! Much appreciated, Mitch!
>>
>> Kazu
>>
>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Thanks, Kazu.
>>
>> I followed that template link and indeed as you pointed out it is a
>> common template. If it works then it is what it is.
>>
>> I will be going through your design proposals and hopefully we can
>> review it.
>>
>> Regards,
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura 
>> wrote:
>>
>>> Thank you Mich. I followed the instruction at
>>> https://spark.apache.org/improvement-proposals.html and used its
>>> template.
>>> While we are open to revise our design doc, it seems more like you
>>> are proposing the community to change the instruction per se?
>>>
>>> Kazu
>>>
>>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> Thanks for these proposals. good suggestions. Is this style of
>>> breaking down your approach standard?
>>>
>>> My view would be that perhaps it makes more sense to follow the
>>> industry established approach of breaking down
>>> your technical proposal  into:
>>>
>>>
>>>1. Background
>>>2. Objective
>>>3. Scope
>>>4. Constraints
>>>5. Assumptions
>>>6. Reporting
>>>7. Deliverables
>>>8. Timelines
>>>9. Appendix
>>>
>>> Your current approach using below
>>>
>>> Q1. What are you trying to do? Articulate your objectives using
>>> absolutely no jargon. What are you trying to achieve?
>>> Q2. What problem is this proposal NOT designed to solve? What
>>> issues the suggested proposal is not going to address
>>> Q3. How is it done today, and what are the limits of current
>>> practice?
>>> Q4. What is new in your approach approach and why do you think it
>>> will be successful succeed?
>>> Q5. Who cares? If you are successful, what difference will it make?
>>> If your proposal succeeds, what tangible benefits will it add?
>>> Q6. What are the risks?
>>> Q7. How long will it take?
>>> Q8. What are the midterm and final “exams” to che

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-02-13 Thread kazuyuki tanimura
Thank you Liang-Chi!

Kazu

> On Feb 11, 2023, at 7:12 PM, L. C. Hsieh  wrote:
> 
> Thanks all for your feedback.
> 
> Given this positive feedback, if there is no other comments/discussion, I 
> will go to start a vote in the next few days.
> 
> Thank you again!
> 
> On Thu, Feb 2, 2023 at 10:12 AM kazuyuki tanimura 
>  wrote:
> Thank you all for +1s and reviewing the SPIP doc.
> 
> Kazu
> 
>> On Feb 1, 2023, at 1:28 AM, Dongjoon Hyun > > wrote:
>> 
>> +1
>> 
>> On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh > > wrote:
>> +1
>> 
>> 
>>view my Linkedin profile 
>> 
>> 
>>  https://en.everybodywiki.com/Mich_Talebzadeh 
>> 
>>  
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> 
>> On Wed, 1 Feb 2023 at 02:23, huaxin gao > > wrote:
>> +1
>> 
>> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai > > wrote:
>> +1
>> 
>> Sent from my iPhone
>> 
>>> On Jan 31, 2023, at 4:16 PM, Yuming Wang >> > wrote:
>>> 
>>> 
>>> +1.
>>> 
>>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura 
>>> mailto:ktanim...@apple.com.invalid>> wrote:
>>> Great! Much appreciated, Mitch!
>>> 
>>> Kazu
>>> 
 On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh >>> > wrote:
 
 Thanks, Kazu.
 
 I followed that template link and indeed as you pointed out it is a common 
 template. If it works then it is what it is.
 
 I will be going through your design proposals and hopefully we can review 
 it.
 
 Regards,
 
 Mich
 
 
view my Linkedin profile 
 
 
  https://en.everybodywiki.com/Mich_Talebzadeh 
 
  
 Disclaimer: Use it at your own risk. Any and all responsibility for any 
 loss, damage or destruction of data or any other property which may arise 
 from relying on this email's technical content is explicitly disclaimed. 
 The author will in no case be liable for any monetary damages arising from 
 such loss, damage or destruction.
  
 
 
 On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura >>> > wrote:
 Thank you Mich. I followed the instruction at 
 https://spark.apache.org/improvement-proposals.html 
  and used its 
 template.
 While we are open to revise our design doc, it seems more like you are 
 proposing the community to change the instruction per se?
 
 Kazu
 
> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh  > wrote:
> 
> Hi,
> 
> Thanks for these proposals. good suggestions. Is this style of breaking 
> down your approach standard?
> 
> My view would be that perhaps it makes more sense to follow the industry 
> established approach of breaking down your technical proposal  into:
> 
> Background
> Objective
> Scope
> Constraints
> Assumptions
> Reporting
> Deliverables
> Timelines
> Appendix
> Your current approach using below 
> 
> Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon. What are you trying to achieve?
> Q2. What problem is this proposal NOT designed to solve? What issues the 
> suggested proposal is not going to address
> Q3. How is it done today, and what are the limits of current practice?
> Q4. What is new in your approach approach and why do you think it will be 
> successful succeed?
> Q5. Who cares? If you are successful, what difference will it make? If 
> your proposal succeeds, what tangible benefits will it add?
> Q6. What are the risks?
> Q7. How long will it take?
> Q8. What are the midterm and final “exams” to check for success?
>  
> May not do  justice to your proposal.
> 
> HTH
> 
> Mich
> 
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh 
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any 
> loss, damage or destruction of data or any other property which may arise 
> from relying on this email's technical content is explicitly disclaimed. 
> The author will

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-02-11 Thread L. C. Hsieh
Thanks all for your feedback.

Given this positive feedback, if there is no other comments/discussion, I
will go to start a vote in the next few days.

Thank you again!

On Thu, Feb 2, 2023 at 10:12 AM kazuyuki tanimura
 wrote:

> Thank you all for +1s and reviewing the SPIP doc.
>
> Kazu
>
> On Feb 1, 2023, at 1:28 AM, Dongjoon Hyun  wrote:
>
> +1
>
> On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh 
> wrote:
>
>> +1
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 1 Feb 2023 at 02:23, huaxin gao  wrote:
>>
>>> +1
>>>
>>> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai  wrote:
>>>
 +1

 Sent from my iPhone

 On Jan 31, 2023, at 4:16 PM, Yuming Wang  wrote:

 
 +1.

 On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura <
 ktanim...@apple.com.invalid> wrote:

> Great! Much appreciated, Mitch!
>
> Kazu
>
> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Thanks, Kazu.
>
> I followed that template link and indeed as you pointed out it is a
> common template. If it works then it is what it is.
>
> I will be going through your design proposals and hopefully we can
> review it.
>
> Regards,
>
> Mich
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura 
> wrote:
>
>> Thank you Mich. I followed the instruction at
>> https://spark.apache.org/improvement-proposals.html and used its
>> template.
>> While we are open to revise our design doc, it seems more like you
>> are proposing the community to change the instruction per se?
>>
>> Kazu
>>
>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi,
>>
>> Thanks for these proposals. good suggestions. Is this style of
>> breaking down your approach standard?
>>
>> My view would be that perhaps it makes more sense to follow the
>> industry established approach of breaking down
>> your technical proposal  into:
>>
>>
>>1. Background
>>2. Objective
>>3. Scope
>>4. Constraints
>>5. Assumptions
>>6. Reporting
>>7. Deliverables
>>8. Timelines
>>9. Appendix
>>
>> Your current approach using below
>>
>> Q1. What are you trying to do? Articulate your objectives using
>> absolutely no jargon. What are you trying to achieve?
>> Q2. What problem is this proposal NOT designed to solve? What issues
>> the suggested proposal is not going to address
>> Q3. How is it done today, and what are the limits of current practice?
>> Q4. What is new in your approach approach and why do you think it
>> will be successful succeed?
>> Q5. Who cares? If you are successful, what difference will it make?
>> If your proposal succeeds, what tangible benefits will it add?
>> Q6. What are the risks?
>> Q7. How long will it take?
>> Q8. What are the midterm and final “exams” to check for success?
>>
>>
>> May not do  justice to your proposal.
>>
>> HTH
>>
>> Mich
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>> ktanim...@apple.com.invalid> wrote:
>>
>>> Hi everyone,
>>>
>>> I would like to start a discussion on “Lazy Materialization fo

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-02-02 Thread kazuyuki tanimura
Thank you all for +1s and reviewing the SPIP doc.

Kazu

> On Feb 1, 2023, at 1:28 AM, Dongjoon Hyun  wrote:
> 
> +1
> 
> On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh  > wrote:
> +1
> 
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh 
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Wed, 1 Feb 2023 at 02:23, huaxin gao  > wrote:
> +1
> 
> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai  > wrote:
> +1
> 
> Sent from my iPhone
> 
>> On Jan 31, 2023, at 4:16 PM, Yuming Wang > > wrote:
>> 
>> 
>> +1.
>> 
>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura 
>>  wrote:
>> Great! Much appreciated, Mitch!
>> 
>> Kazu
>> 
>>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh >> > wrote:
>>> 
>>> Thanks, Kazu.
>>> 
>>> I followed that template link and indeed as you pointed out it is a common 
>>> template. If it works then it is what it is.
>>> 
>>> I will be going through your design proposals and hopefully we can review 
>>> it.
>>> 
>>> Regards,
>>> 
>>> Mich
>>> 
>>> 
>>>view my Linkedin profile 
>>> 
>>> 
>>>  https://en.everybodywiki.com/Mich_Talebzadeh 
>>> 
>>>  
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>> 
>>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura >> > wrote:
>>> Thank you Mich. I followed the instruction at 
>>> https://spark.apache.org/improvement-proposals.html 
>>>  and used its template.
>>> While we are open to revise our design doc, it seems more like you are 
>>> proposing the community to change the instruction per se?
>>> 
>>> Kazu
>>> 
 On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh >>> > wrote:
 
 Hi,
 
 Thanks for these proposals. good suggestions. Is this style of breaking 
 down your approach standard?
 
 My view would be that perhaps it makes more sense to follow the industry 
 established approach of breaking down your technical proposal  into:
 
 Background
 Objective
 Scope
 Constraints
 Assumptions
 Reporting
 Deliverables
 Timelines
 Appendix
 Your current approach using below 
 
 Q1. What are you trying to do? Articulate your objectives using absolutely 
 no jargon. What are you trying to achieve?
 Q2. What problem is this proposal NOT designed to solve? What issues the 
 suggested proposal is not going to address
 Q3. How is it done today, and what are the limits of current practice?
 Q4. What is new in your approach approach and why do you think it will be 
 successful succeed?
 Q5. Who cares? If you are successful, what difference will it make? If 
 your proposal succeeds, what tangible benefits will it add?
 Q6. What are the risks?
 Q7. How long will it take?
 Q8. What are the midterm and final “exams” to check for success?
  
 May not do  justice to your proposal.
 
 HTH
 
 Mich
 
 
view my Linkedin profile 
 
 
  https://en.everybodywiki.com/Mich_Talebzadeh 
 
  
 Disclaimer: Use it at your own risk. Any and all responsibility for any 
 loss, damage or destruction of data or any other property which may arise 
 from relying on this email's technical content is explicitly disclaimed. 
 The author will in no case be liable for any monetary damages arising from 
 such loss, damage or destruction.
  
 
 
 On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura 
 mailto:ktanim...@apple.com.invalid>> wrote:
 Hi everyone,
 
 I would like to start a discussion on “Lazy Materialization for Parquet 
 Read Performance Improvement"
 
 Chao and I propose a Parquet reader with lazy materialization. For 
 Spark-SQL filter operations, evaluating the filters first and lazily 
 materializing only the used values

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-02-02 Thread kazuyuki tanimura
Thank you Mich. I addressed your point on the SPIP doc.

Kazu

> On Feb 1, 2023, at 2:04 AM, Mich Talebzadeh  wrote:
> 
> 
> In your statement on Q2 in SPIP, you mention and I quote
> 
> "... File formats other than Parquet are beyond the scope of this SPIP.."
> 
> It is important that you explain why you choose Parquet for this work. Apache 
> Parquet  is an open source column-oriented data 
> format that is widely used in the Apache Hadoop ecosystem and beyond. It is 
> designed for efficient data storage and retrieval. Many data warehouses 
> prefer to store data in external storage in Parquet format. As an ETL 
> workload for Spark, it makes sense to optimise data retrieval as much as 
> possible.
> 
> HTH
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh 
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura  
> wrote:
> Hi everyone,
> 
> I would like to start a discussion on “Lazy Materialization for Parquet Read 
> Performance Improvement"
> 
> Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL 
> filter operations, evaluating the filters first and lazily materializing only 
> the used values can save computation wastes and improve the read performance.
> The current implementation of Spark requires the read values to materialize 
> (i.e. decompress, de-code, etc...) onto memory first before applying the 
> filters even though the filters may eventually throw away many values.
> 
> We made our design doc as follows.
> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 
>  
> SPIP Doc: 
> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>  
> 
> 
> Liang-Chi was kind enough to shepherd this effort. 
> 
> Thank you
> Kazu



Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-02-01 Thread Mich Talebzadeh
In your statement on Q2 in SPIP, you mention and I quote


"... File formats other than Parquet are beyond the scope of this SPIP.."


It is important that you explain why you choose Parquet for this work. Apache
Parquet is an open source *column-oriented
data format *that is widely used in the Apache Hadoop ecosystem and beyond.
It is designed for efficient data storage and retrieval. Many data
warehouses prefer to store data in external storage in Parquet format. As
an ETL workload for Spark, it makes sense to optimise data retrieval as
much as possible.

HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura 
wrote:

> Hi everyone,
>
> I would like to start a discussion on “Lazy Materialization for Parquet
> Read Performance Improvement"
>
> Chao and I propose a Parquet reader with lazy materialization. For
> Spark-SQL filter operations, evaluating the filters first and lazily
> materializing only the used values can save computation wastes and improve
> the read performance.
> The current implementation of Spark requires the read values to
> materialize (i.e. decompress, de-code, etc...) onto memory first before
> applying the filters even though the filters may eventually throw away many
> values.
>
> We made our design doc as follows.
> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
> SPIP Doc:
> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>
> Liang-Chi was kind enough to shepherd this effort.
>
> Thank you
> Kazu
>


Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-02-01 Thread Dongjoon Hyun
+1

On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh 
wrote:

> +1
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 1 Feb 2023 at 02:23, huaxin gao  wrote:
>
>> +1
>>
>> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai  wrote:
>>
>>> +1
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 31, 2023, at 4:16 PM, Yuming Wang  wrote:
>>>
>>> 
>>> +1.
>>>
>>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura
>>>  wrote:
>>>
 Great! Much appreciated, Mitch!

 Kazu

 On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh 
 wrote:

 Thanks, Kazu.

 I followed that template link and indeed as you pointed out it is a
 common template. If it works then it is what it is.

 I will be going through your design proposals and hopefully we can
 review it.

 Regards,

 Mich


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura 
 wrote:

> Thank you Mich. I followed the instruction at
> https://spark.apache.org/improvement-proposals.html and used its
> template.
> While we are open to revise our design doc, it seems more like you are
> proposing the community to change the instruction per se?
>
> Kazu
>
> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Hi,
>
> Thanks for these proposals. good suggestions. Is this style of
> breaking down your approach standard?
>
> My view would be that perhaps it makes more sense to follow the
> industry established approach of breaking down
> your technical proposal  into:
>
>
>1. Background
>2. Objective
>3. Scope
>4. Constraints
>5. Assumptions
>6. Reporting
>7. Deliverables
>8. Timelines
>9. Appendix
>
> Your current approach using below
>
> Q1. What are you trying to do? Articulate your objectives using
> absolutely no jargon. What are you trying to achieve?
> Q2. What problem is this proposal NOT designed to solve? What issues
> the suggested proposal is not going to address
> Q3. How is it done today, and what are the limits of current practice?
> Q4. What is new in your approach approach and why do you think it
> will be successful succeed?
> Q5. Who cares? If you are successful, what difference will it make?
> If your proposal succeeds, what tangible benefits will it add?
> Q6. What are the risks?
> Q7. How long will it take?
> Q8. What are the midterm and final “exams” to check for success?
>
>
> May not do  justice to your proposal.
>
> HTH
>
> Mich
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
> ktanim...@apple.com.invalid> wrote:
>
>> Hi everyone,
>>
>> I would like to start a discussion on “Lazy Materialization for
>> Parquet Read Performance Improvement"
>>
>> Chao and I propose a Parquet reader with lazy materialization. For
>> Spark-SQL filter operations, evaluating the filters first and lazily
>> materializing only the used values can save computation wastes and 
>> improve
>> the read performance.
>> The current implementation of Spark requires the read values to
>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>> applying the filters even though the filters may eventually throw away 
>> many
>>>

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-02-01 Thread Mich Talebzadeh
+1



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 1 Feb 2023 at 02:23, huaxin gao  wrote:

> +1
>
> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai  wrote:
>
>> +1
>>
>> Sent from my iPhone
>>
>> On Jan 31, 2023, at 4:16 PM, Yuming Wang  wrote:
>>
>> 
>> +1.
>>
>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura
>>  wrote:
>>
>>> Great! Much appreciated, Mitch!
>>>
>>> Kazu
>>>
>>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh 
>>> wrote:
>>>
>>> Thanks, Kazu.
>>>
>>> I followed that template link and indeed as you pointed out it is a
>>> common template. If it works then it is what it is.
>>>
>>> I will be going through your design proposals and hopefully we can
>>> review it.
>>>
>>> Regards,
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura 
>>> wrote:
>>>
 Thank you Mich. I followed the instruction at
 https://spark.apache.org/improvement-proposals.html and used its
 template.
 While we are open to revise our design doc, it seems more like you are
 proposing the community to change the instruction per se?

 Kazu

 On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

 Hi,

 Thanks for these proposals. good suggestions. Is this style of breaking
 down your approach standard?

 My view would be that perhaps it makes more sense to follow the
 industry established approach of breaking down
 your technical proposal  into:


1. Background
2. Objective
3. Scope
4. Constraints
5. Assumptions
6. Reporting
7. Deliverables
8. Timelines
9. Appendix

 Your current approach using below

 Q1. What are you trying to do? Articulate your objectives using
 absolutely no jargon. What are you trying to achieve?
 Q2. What problem is this proposal NOT designed to solve? What issues
 the suggested proposal is not going to address
 Q3. How is it done today, and what are the limits of current practice?
 Q4. What is new in your approach approach and why do you think it will be
 successful succeed?
 Q5. Who cares? If you are successful, what difference will it make? If
 your proposal succeeds, what tangible benefits will it add?
 Q6. What are the risks?
 Q7. How long will it take?
 Q8. What are the midterm and final “exams” to check for success?


 May not do  justice to your proposal.

 HTH

 Mich

view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
 ktanim...@apple.com.invalid> wrote:

> Hi everyone,
>
> I would like to start a discussion on “Lazy Materialization for
> Parquet Read Performance Improvement"
>
> Chao and I propose a Parquet reader with lazy materialization. For
> Spark-SQL filter operations, evaluating the filters first and lazily
> materializing only the used values can save computation wastes and improve
> the read performance.
> The current implementation of Spark requires the read values to
> materialize (i.e. decompress, de-code, etc...) onto memory first before
> applying the filters even though the filters may eventually throw away 
> many
> values.
>
> We made our design doc as follows.
> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
> SPIP Doc:
> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>
> Lia

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-01-31 Thread huaxin gao
+1

On Tue, Jan 31, 2023 at 6:10 PM DB Tsai  wrote:

> +1
>
> Sent from my iPhone
>
> On Jan 31, 2023, at 4:16 PM, Yuming Wang  wrote:
>
> 
> +1.
>
> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura
>  wrote:
>
>> Great! Much appreciated, Mitch!
>>
>> Kazu
>>
>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh 
>> wrote:
>>
>> Thanks, Kazu.
>>
>> I followed that template link and indeed as you pointed out it is a
>> common template. If it works then it is what it is.
>>
>> I will be going through your design proposals and hopefully we can review
>> it.
>>
>> Regards,
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura 
>> wrote:
>>
>>> Thank you Mich. I followed the instruction at
>>> https://spark.apache.org/improvement-proposals.html and used its
>>> template.
>>> While we are open to revise our design doc, it seems more like you are
>>> proposing the community to change the instruction per se?
>>>
>>> Kazu
>>>
>>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi,
>>>
>>> Thanks for these proposals. good suggestions. Is this style of breaking
>>> down your approach standard?
>>>
>>> My view would be that perhaps it makes more sense to follow the industry
>>> established approach of breaking down your technical proposal  into:
>>>
>>>
>>>1. Background
>>>2. Objective
>>>3. Scope
>>>4. Constraints
>>>5. Assumptions
>>>6. Reporting
>>>7. Deliverables
>>>8. Timelines
>>>9. Appendix
>>>
>>> Your current approach using below
>>>
>>> Q1. What are you trying to do? Articulate your objectives using
>>> absolutely no jargon. What are you trying to achieve?
>>> Q2. What problem is this proposal NOT designed to solve? What issues
>>> the suggested proposal is not going to address
>>> Q3. How is it done today, and what are the limits of current practice?
>>> Q4. What is new in your approach approach and why do you think it will be
>>> successful succeed?
>>> Q5. Who cares? If you are successful, what difference will it make? If
>>> your proposal succeeds, what tangible benefits will it add?
>>> Q6. What are the risks?
>>> Q7. How long will it take?
>>> Q8. What are the midterm and final “exams” to check for success?
>>>
>>>
>>> May not do  justice to your proposal.
>>>
>>> HTH
>>>
>>> Mich
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>>> ktanim...@apple.com.invalid> wrote:
>>>
 Hi everyone,

 I would like to start a discussion on “Lazy Materialization for Parquet
 Read Performance Improvement"

 Chao and I propose a Parquet reader with lazy materialization. For
 Spark-SQL filter operations, evaluating the filters first and lazily
 materializing only the used values can save computation wastes and improve
 the read performance.
 The current implementation of Spark requires the read values to
 materialize (i.e. decompress, de-code, etc...) onto memory first before
 applying the filters even though the filters may eventually throw away many
 values.

 We made our design doc as follows.
 SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
 SPIP Doc:
 https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME

 Liang-Chi was kind enough to shepherd this effort.

 Thank you
 Kazu

>>>
>>>
>>


Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-01-31 Thread DB Tsai
+1Sent from my iPhoneOn Jan 31, 2023, at 4:16 PM, Yuming Wang  wrote:+1.On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura  wrote:Great! Much appreciated, Mitch!
KazuOn Jan 31, 2023, at 3:07 PM, Mich Talebzadeh  wrote:Thanks, Kazu.I followed that template link and indeed as you pointed out it is a common template. If it works then it is what it is.I will be going through your design proposals and hopefully we can review it.Regards,Mich



   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura  wrote:Thank you Mich. I followed the instruction at https://spark.apache.org/improvement-proposals.html and used its template.While we are open to revise our design doc, it seems more like you are proposing the community to change the instruction per se?
KazuOn Jan 31, 2023, at 11:24 AM, Mich Talebzadeh  wrote:Hi,Thanks for these proposals. good suggestions. Is this style of breaking down your approach standard?My view would be that perhaps it makes more sense to follow the industry established approach of breaking down your technical proposal  into:BackgroundObjectiveScopeConstraintsAssumptionsReportingDeliverablesTimelinesAppendixYour current approach using below Q1. What are you trying to do? Articulate your objectives
using absolutely no jargon. What are you trying to achieve?Q2. What problem is this proposal NOT designed to solve? What issues the suggested proposal is not going to addressQ3. How is it done today, and what are the limits of
current practice?Q4. What is new in your approach approach and why do you think it
will be successful succeed?Q5. Who cares? If you are successful, what difference
will it make? If your proposal succeeds, what tangible benefits will it add?Q6. What are the risks?Q7. How long will it take?Q8. What are the midterm and final “exams” to check for
success? May not do  justice to your proposal.HTHMich


   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura  wrote:Hi everyone,I would like to start a discussion on “Lazy Materialization for Parquet Read Performance Improvement"Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL filter operations, evaluating the filters first and lazily materializing only the used values can save computation wastes and improve the read performance.The current implementation of Spark requires the read values to materialize (i.e. decompress, de-code, etc...) onto memory first before applying the filters even though the filters may eventually throw away many values.We made our design doc as follows.SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 SPIP Doc: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzMELiang-Chi was kind enough to shepherd this effort. Thank youKazu




Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-01-31 Thread Yuming Wang
+1.

On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura
 wrote:

> Great! Much appreciated, Mitch!
>
> Kazu
>
> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh 
> wrote:
>
> Thanks, Kazu.
>
> I followed that template link and indeed as you pointed out it is a common
> template. If it works then it is what it is.
>
> I will be going through your design proposals and hopefully we can review
> it.
>
> Regards,
>
> Mich
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura 
> wrote:
>
>> Thank you Mich. I followed the instruction at
>> https://spark.apache.org/improvement-proposals.html and used its
>> template.
>> While we are open to revise our design doc, it seems more like you are
>> proposing the community to change the instruction per se?
>>
>> Kazu
>>
>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> Thanks for these proposals. good suggestions. Is this style of breaking
>> down your approach standard?
>>
>> My view would be that perhaps it makes more sense to follow the industry
>> established approach of breaking down your technical proposal  into:
>>
>>
>>1. Background
>>2. Objective
>>3. Scope
>>4. Constraints
>>5. Assumptions
>>6. Reporting
>>7. Deliverables
>>8. Timelines
>>9. Appendix
>>
>> Your current approach using below
>>
>> Q1. What are you trying to do? Articulate your objectives using
>> absolutely no jargon. What are you trying to achieve?
>> Q2. What problem is this proposal NOT designed to solve? What issues the
>> suggested proposal is not going to address
>> Q3. How is it done today, and what are the limits of current practice?
>> Q4. What is new in your approach approach and why do you think it will be
>> successful succeed?
>> Q5. Who cares? If you are successful, what difference will it make? If
>> your proposal succeeds, what tangible benefits will it add?
>> Q6. What are the risks?
>> Q7. How long will it take?
>> Q8. What are the midterm and final “exams” to check for success?
>>
>>
>> May not do  justice to your proposal.
>>
>> HTH
>>
>> Mich
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>> ktanim...@apple.com.invalid> wrote:
>>
>>> Hi everyone,
>>>
>>> I would like to start a discussion on “Lazy Materialization for Parquet
>>> Read Performance Improvement"
>>>
>>> Chao and I propose a Parquet reader with lazy materialization. For
>>> Spark-SQL filter operations, evaluating the filters first and lazily
>>> materializing only the used values can save computation wastes and improve
>>> the read performance.
>>> The current implementation of Spark requires the read values to
>>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>>> applying the filters even though the filters may eventually throw away many
>>> values.
>>>
>>> We made our design doc as follows.
>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
>>> SPIP Doc:
>>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>>
>>> Liang-Chi was kind enough to shepherd this effort.
>>>
>>> Thank you
>>> Kazu
>>>
>>
>>
>


Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-01-31 Thread kazuyuki tanimura
Great! Much appreciated, Mitch!

Kazu

> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh  
> wrote:
> 
> Thanks, Kazu.
> 
> I followed that template link and indeed as you pointed out it is a common 
> template. If it works then it is what it is.
> 
> I will be going through your design proposals and hopefully we can review it.
> 
> Regards,
> 
> Mich
> 
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh 
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura  > wrote:
> Thank you Mich. I followed the instruction at 
> https://spark.apache.org/improvement-proposals.html 
>  and used its template.
> While we are open to revise our design doc, it seems more like you are 
> proposing the community to change the instruction per se?
> 
> Kazu
> 
>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh > > wrote:
>> 
>> Hi,
>> 
>> Thanks for these proposals. good suggestions. Is this style of breaking down 
>> your approach standard?
>> 
>> My view would be that perhaps it makes more sense to follow the industry 
>> established approach of breaking down your technical proposal  into:
>> 
>> Background
>> Objective
>> Scope
>> Constraints
>> Assumptions
>> Reporting
>> Deliverables
>> Timelines
>> Appendix
>> Your current approach using below 
>> 
>> Q1. What are you trying to do? Articulate your objectives using absolutely 
>> no jargon. What are you trying to achieve?
>> Q2. What problem is this proposal NOT designed to solve? What issues the 
>> suggested proposal is not going to address
>> Q3. How is it done today, and what are the limits of current practice?
>> Q4. What is new in your approach approach and why do you think it will be 
>> successful succeed?
>> Q5. Who cares? If you are successful, what difference will it make? If your 
>> proposal succeeds, what tangible benefits will it add?
>> Q6. What are the risks?
>> Q7. How long will it take?
>> Q8. What are the midterm and final “exams” to check for success?
>>  
>> May not do  justice to your proposal.
>> 
>> HTH
>> 
>> Mich
>> 
>> 
>>view my Linkedin profile 
>> 
>> 
>>  https://en.everybodywiki.com/Mich_Talebzadeh 
>> 
>>  
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> 
>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura > > wrote:
>> Hi everyone,
>> 
>> I would like to start a discussion on “Lazy Materialization for Parquet Read 
>> Performance Improvement"
>> 
>> Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL 
>> filter operations, evaluating the filters first and lazily materializing 
>> only the used values can save computation wastes and improve the read 
>> performance.
>> The current implementation of Spark requires the read values to materialize 
>> (i.e. decompress, de-code, etc...) onto memory first before applying the 
>> filters even though the filters may eventually throw away many values.
>> 
>> We made our design doc as follows.
>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 
>>  
>> SPIP Doc: 
>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>  
>> 
>> 
>> Liang-Chi was kind enough to shepherd this effort. 
>> 
>> Thank you
>> Kazu
> 



Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-01-31 Thread Mich Talebzadeh
Thanks, Kazu.

I followed that template link and indeed as you pointed out it is a common
template. If it works then it is what it is.

I will be going through your design proposals and hopefully we can review
it.

Regards,

Mich



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura  wrote:

> Thank you Mich. I followed the instruction at
> https://spark.apache.org/improvement-proposals.html and used its template.
> While we are open to revise our design doc, it seems more like you are
> proposing the community to change the instruction per se?
>
> Kazu
>
> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> Thanks for these proposals. good suggestions. Is this style of breaking
> down your approach standard?
>
> My view would be that perhaps it makes more sense to follow the industry
> established approach of breaking down your technical proposal  into:
>
>
>1. Background
>2. Objective
>3. Scope
>4. Constraints
>5. Assumptions
>6. Reporting
>7. Deliverables
>8. Timelines
>9. Appendix
>
> Your current approach using below
>
> Q1. What are you trying to do? Articulate your objectives using
> absolutely no jargon. What are you trying to achieve?
> Q2. What problem is this proposal NOT designed to solve? What issues the
> suggested proposal is not going to address
> Q3. How is it done today, and what are the limits of current practice?
> Q4. What is new in your approach approach and why do you think it will be
> successful succeed?
> Q5. Who cares? If you are successful, what difference will it make? If
> your proposal succeeds, what tangible benefits will it add?
> Q6. What are the risks?
> Q7. How long will it take?
> Q8. What are the midterm and final “exams” to check for success?
>
>
> May not do  justice to your proposal.
>
> HTH
>
> Mich
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
> ktanim...@apple.com.invalid> wrote:
>
>> Hi everyone,
>>
>> I would like to start a discussion on “Lazy Materialization for Parquet
>> Read Performance Improvement"
>>
>> Chao and I propose a Parquet reader with lazy materialization. For
>> Spark-SQL filter operations, evaluating the filters first and lazily
>> materializing only the used values can save computation wastes and improve
>> the read performance.
>> The current implementation of Spark requires the read values to
>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>> applying the filters even though the filters may eventually throw away many
>> values.
>>
>> We made our design doc as follows.
>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
>> SPIP Doc:
>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>
>> Liang-Chi was kind enough to shepherd this effort.
>>
>> Thank you
>> Kazu
>>
>
>


Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-01-31 Thread kazuyuki tanimura
Thank you Mich. I followed the instruction at 
https://spark.apache.org/improvement-proposals.html 
 and used its template.
While we are open to revise our design doc, it seems more like you are 
proposing the community to change the instruction per se?

Kazu

> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh  > wrote:
> 
> Hi,
> 
> Thanks for these proposals. good suggestions. Is this style of breaking down 
> your approach standard?
> 
> My view would be that perhaps it makes more sense to follow the industry 
> established approach of breaking down your technical proposal  into:
> 
> Background
> Objective
> Scope
> Constraints
> Assumptions
> Reporting
> Deliverables
> Timelines
> Appendix
> Your current approach using below 
> 
> Q1. What are you trying to do? Articulate your objectives using absolutely no 
> jargon. What are you trying to achieve?
> Q2. What problem is this proposal NOT designed to solve? What issues the 
> suggested proposal is not going to address
> Q3. How is it done today, and what are the limits of current practice?
> Q4. What is new in your approach approach and why do you think it will be 
> successful succeed?
> Q5. Who cares? If you are successful, what difference will it make? If your 
> proposal succeeds, what tangible benefits will it add?
> Q6. What are the risks?
> Q7. How long will it take?
> Q8. What are the midterm and final “exams” to check for success?
>  
> May not do  justice to your proposal.
> 
> HTH
> 
> Mich
> 
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh 
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura  > wrote:
> Hi everyone,
> 
> I would like to start a discussion on “Lazy Materialization for Parquet Read 
> Performance Improvement"
> 
> Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL 
> filter operations, evaluating the filters first and lazily materializing only 
> the used values can save computation wastes and improve the read performance.
> The current implementation of Spark requires the read values to materialize 
> (i.e. decompress, de-code, etc...) onto memory first before applying the 
> filters even though the filters may eventually throw away many values.
> 
> We made our design doc as follows.
> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 
>  
> SPIP Doc: 
> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>  
> 
> 
> Liang-Chi was kind enough to shepherd this effort. 
> 
> Thank you
> Kazu



Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-01-31 Thread Mich Talebzadeh
Hi,

Thanks for these proposals. good suggestions. Is this style of breaking
down your approach standard?

My view would be that perhaps it makes more sense to follow the industry
established approach of breaking down your technical proposal  into:


   1. Background
   2. Objective
   3. Scope
   4. Constraints
   5. Assumptions
   6. Reporting
   7. Deliverables
   8. Timelines
   9. Appendix

Your current approach using below

Q1. What are you trying to do? Articulate your objectives using absolutely
no jargon. What are you trying to achieve?

Q2. What problem is this proposal NOT designed to solve? What issues the
suggested proposal is not going to address

Q3. How is it done today, and what are the limits of current practice?

Q4. What is new in your approach approach and why do you think it will be
successful succeed?

Q5. Who cares? If you are successful, what difference will it make? If your
proposal succeeds, what tangible benefits will it add?

Q6. What are the risks?

Q7. How long will it take?

Q8. What are the midterm and final “exams” to check for success?


May not do  justice to your proposal.

HTH

Mich


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura 
wrote:

> Hi everyone,
>
> I would like to start a discussion on “Lazy Materialization for Parquet
> Read Performance Improvement"
>
> Chao and I propose a Parquet reader with lazy materialization. For
> Spark-SQL filter operations, evaluating the filters first and lazily
> materializing only the used values can save computation wastes and improve
> the read performance.
> The current implementation of Spark requires the read values to
> materialize (i.e. decompress, de-code, etc...) onto memory first before
> applying the filters even though the filters may eventually throw away many
> values.
>
> We made our design doc as follows.
> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
> SPIP Doc:
> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>
> Liang-Chi was kind enough to shepherd this effort.
>
> Thank you
> Kazu
>


[DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-01-31 Thread kazuyuki tanimura
Hi everyone,

I would like to start a discussion on “Lazy Materialization for Parquet Read 
Performance Improvement"

Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL 
filter operations, evaluating the filters first and lazily materializing only 
the used values can save computation wastes and improve the read performance.
The current implementation of Spark requires the read values to materialize 
(i.e. decompress, de-code, etc...) onto memory first before applying the 
filters even though the filters may eventually throw away many values.

We made our design doc as follows.
SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 
 
SPIP Doc: 
https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME 


Liang-Chi was kind enough to shepherd this effort. 

Thank you
Kazu