Thank you Mich. I addressed your point on the SPIP doc. Kazu
> On Feb 1, 2023, at 2:04 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > > In your statement on Q2 in SPIP, you mention and I quote > > "... File formats other than Parquet are beyond the scope of this SPIP.." > > It is important that you explain why you choose Parquet for this work. Apache > Parquet <https://parquet.apache.org/>is an open source column-oriented data > format that is widely used in the Apache Hadoop ecosystem and beyond. It is > designed for efficient data storage and retrieval. Many data warehouses > prefer to store data in external storage in Parquet format. As an ETL > workload for Spark, it makes sense to optimise data retrieval as much as > possible. > > HTH > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > https://en.everybodywiki.com/Mich_Talebzadeh > <https://en.everybodywiki.com/Mich_Talebzadeh> > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <ktanim...@apple.com.invalid> > wrote: > Hi everyone, > > I would like to start a discussion on “Lazy Materialization for Parquet Read > Performance Improvement" > > Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL > filter operations, evaluating the filters first and lazily materializing only > the used values can save computation wastes and improve the read performance. > The current implementation of Spark requires the read values to materialize > (i.e. decompress, de-code, etc...) onto memory first before applying the > filters even though the filters may eventually throw away many values. > > We made our design doc as follows. > SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 > <https://issues.apache.org/jira/browse/SPARK-42256> > SPIP Doc: > https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME > > <https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME> > > Liang-Chi was kind enough to shepherd this effort. > > Thank you > Kazu