[DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

kazuyuki tanimura Tue, 31 Jan 2023 09:35:03 -0800

Hi everyone,

I would like to start a discussion on “Lazy Materialization for Parquet Read 
Performance Improvement"


Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL 
filter operations, evaluating the filters first and lazily materializing only 
the used values can save computation wastes and improve the read performance.
The current implementation of Spark requires the read values to materialize 
(i.e. decompress, de-code, etc...) onto memory first before applying the 
filters even though the filters may eventually throw away many values.

We made our design doc as follows.
SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 
<https://issues.apache.org/jira/browse/SPARK-42256> 
SPIP Doc: 
https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME 
<https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME>

Liang-Chi was kind enough to shepherd this effort. 

Thank you
Kazu

[DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Reply via email to