In your statement on Q2 in SPIP, you mention and I quote

"... File formats other than Parquet are beyond the scope of this SPIP.."

It is important that you explain why you choose Parquet for this work. Apache
Parquet <>is an open source *column-oriented
data format *that is widely used in the Apache Hadoop ecosystem and beyond.
It is designed for efficient data storage and retrieval. Many data
warehouses prefer to store data in external storage in Parquet format. As
an ETL workload for Spark, it makes sense to optimise data retrieval as
much as possible.


   view my Linkedin profile

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <>

> Hi everyone,
> I would like to start a discussion on “Lazy Materialization for Parquet
> Read Performance Improvement"
> Chao and I propose a Parquet reader with lazy materialization. For
> Spark-SQL filter operations, evaluating the filters first and lazily
> materializing only the used values can save computation wastes and improve
> the read performance.
> The current implementation of Spark requires the read values to
> materialize (i.e. decompress, de-code, etc...) onto memory first before
> applying the filters even though the filters may eventually throw away many
> values.
> We made our design doc as follows.
> SPIP Jira:
> SPIP Doc:
> Liang-Chi was kind enough to shepherd this effort.
> Thank you
> Kazu

Reply via email to