Hi Yash, there are a few mechanisms in Parquet that can help with this. Not all of them will be present in every parquet file. And not all implementations make use of them or populate them (i.e. C++ lacks a few): 1. Per Column statistics per-row-group and data pages [1]. Includes min/max values. 2. Column indexes [2]. 3. Bloom filters [3]
Thanks, Micah [1] https://github.com/apache/parquet-format/blob/232e23a68ab45be0db2cca5d0991613c9f350f8c/src/main/thrift/parquet.thrift#L197 [2] https://github.com/apache/parquet-format/blob/master/PageIndex.md [3] https://github.com/apache/parquet-format/blob/e1dca742bbd0e1eec3a07c70ca53535d678b20dc/BloomFilter.md On Fri, Jul 10, 2020 at 12:04 PM Yash Ganthe <[email protected]> wrote: > Hi, > > If I want to query a parquet file with a criteria such as income > 1000, > does Parquet support indexing of the columns to make it faster to identify > the records with the criteria? I know we can partition the file on a > column. But in my case assume it is already partitioned on a single column > that is Date and I want to use other criteria for filtering the records. > > Regards, > Yash >
