Re: [I] Add an example of embedding indexes inside a parquet file [datafusion]

via GitHub Sat, 21 Jun 2025 05:57:38 -0700


JigaoLuo commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-2993567391

@alamb @zhuqi-lucas Thank you for this issue and the PR. This could
significantly aid query processing on Parquet.

I was previously **never** aware of `key_value_metadata` and am grateful for
the insight: today marks my first discovery of its presence in both
[ColumnMetaData](https://github.com/apache/parquet-format/blob/87f2c8bf77eefb4c43d0ebaeea1778bd28ac3609/src/main/thrift/parquet.thrift#L900)
and
[FileMetaData](https://github.com/apache/parquet-format/blob/87f2c8bf77eefb4c43d0ebaeea1778bd28ac3609/src/main/thrift/parquet.thrift#L1267).
Also @alamb's argument also reminded me of a paper from the German DB
Conference:
https://dl.gi.de/server/api/core/bitstreams/9c8435ee-d478-4b0e-9e3f-94f39a9e7090/content
for reference and

<summary>at the end of Section 2.3 of it:</summary>

> The only statistics available in Parquet files are the cardinality of the
contained dataset and
each page’s minimum and maximum values. Unfortunately, the minimum and
maximum values are optional fields, so Parquet writers are not forced to use
them. ... These minimum and maximum values, as well as the cardinality of the
datasets, are the only sources available for performing cardinality estimates.
Therefore, we get imprecise results since we do not know how the data is
distributed within the given boundaries. As a consequence, we get erroneous
cardinality estimates and suboptimal query plans.

> ... This shows how crucial a good cardinality estimate is for a Parquet
scan to be
an acceptable alternative to database relations. The Parquet scan cannot get
close to the
execution times of database relations as long as the query optimizer cannot
choose the same query plans for the Parquet files

</details>

In my experience, there’s a **widespread underappreciation for the
configurability of Parquet files**. Many practitioners default to blaming
Parquet’s performance or feature limitations, such as HLL. This often leads to
unfair comparisons with proprietary formats, which are fine-tuned and
cherry-picked.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

Reply via email to

Re: [I] Add an example of embedding indexes inside a parquet file [datafusion]