Re: Does Parquet format provide indexing for quick retrieval based on column filters?

Micah Kornfield Fri, 10 Jul 2020 21:11:32 -0700

Hi Yash,
there are a few mechanisms in Parquet that can help with this.  Not all of
them will be present in every parquet file.  And not all implementations
make use of them or populate them (i.e. C++ lacks a few):
1.  Per Column statistics per-row-group and data pages [1].  Includes
min/max values.
2.  Column indexes [2].
3.  Bloom filters [3]


Thanks,
Micah


[1]
https://github.com/apache/parquet-format/blob/232e23a68ab45be0db2cca5d0991613c9f350f8c/src/main/thrift/parquet.thrift#L197
[2] https://github.com/apache/parquet-format/blob/master/PageIndex.md
[3]
https://github.com/apache/parquet-format/blob/e1dca742bbd0e1eec3a07c70ca53535d678b20dc/BloomFilter.md

On Fri, Jul 10, 2020 at 12:04 PM Yash Ganthe <[email protected]> wrote:

> Hi,
>
> If I want to query a parquet file with a criteria such as income > 1000,
> does Parquet support indexing of the columns to make it faster to identify
> the records with the criteria? I know we can partition the file on a
> column. But in my case assume it is already partitioned on a single column
> that is Date and I want to use other criteria for filtering the records.
>
> Regards,
> Yash
>

Re: Does Parquet format provide indexing for quick retrieval based on column filters?

Reply via email to