Hi everyone,

I would like to start a discussion on “Lazy Materialization for Parquet Read 
Performance Improvement"

Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL 
filter operations, evaluating the filters first and lazily materializing only 
the used values can save computation wastes and improve the read performance.
The current implementation of Spark requires the read values to materialize 
(i.e. decompress, de-code, etc...) onto memory first before applying the 
filters even though the filters may eventually throw away many values.

We made our design doc as follows.
SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 
<https://issues.apache.org/jira/browse/SPARK-42256> 
SPIP Doc: 
https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME 
<https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME>

Liang-Chi was kind enough to shepherd this effort. 

Thank you
Kazu

Reply via email to