Hi everyone, I would like to start a discussion on “Lazy Materialization for Parquet Read Performance Improvement"
Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL filter operations, evaluating the filters first and lazily materializing only the used values can save computation wastes and improve the read performance. The current implementation of Spark requires the read values to materialize (i.e. decompress, de-code, etc...) onto memory first before applying the filters even though the filters may eventually throw away many values. We made our design doc as follows. SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 <https://issues.apache.org/jira/browse/SPARK-42256> SPIP Doc: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME <https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME> Liang-Chi was kind enough to shepherd this effort. Thank you Kazu