Parquet supports predicate pushdown on RowGroup and Page level. While
page-level filtering is not yet available and will be a bit more complex to
implement, you can already do basic RowGroup filtering using `pyarrow`. With
`pyarrow.ParquetFile(…).metadata.row_group(…).column(…).{min,max}` you get the
minimum and maximum of a RowGroup for a given column. When you compare your
filters against these statistics, you should be able to only read a subset of
the file using `read_rowgroup`.
This though depends on that you already have written your Parquet files with
more than one RowGroup. You can achieve this by setting a `chunk_size` in
`write_table` that matches to your problem. Normally `pyarrow` will only write
Parquet files with a single RowGroup.
[ Full content available at: https://github.com/apache/arrow/issues/2491 ]
This message was relayed via gitbox.apache.org for [email protected]