[GitHub] [arrow] xhochy commented on issue #2491: Querying files with 10,000s of columns and 100,000s of rows

GitHub Wed, 29 Aug 2018 06:45:01 -0700

Parquet supports predicate pushdown on RowGroup and Page level. While 
page-level filtering is not yet available and will be a bit more complex to 
implement, you can already do basic RowGroup filtering using `pyarrow`. With 
`pyarrow.ParquetFile(…).metadata.row_group(…).column(…).{min,max}` you get the 
minimum and maximum of a RowGroup for a given column. When you compare your 
filters against these statistics, you should be able to only read a subset of 
the file using `read_rowgroup`.


This though depends on that you already have written your Parquet files with 
more than one RowGroup. You can achieve this by setting a `chunk_size` in 
`write_table` that matches to your problem. Normally `pyarrow` will only write 
Parquet files with a single RowGroup.

[ Full content available at: https://github.com/apache/arrow/issues/2491 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [arrow] xhochy commented on issue #2491: Querying files with 10,000s of columns and 100,000s of rows

Reply via email to