Martin Durant created ARROW-3245:
------------------------------------
Summary: Infer index and/or filtering from parquet column
statistics
Key: ARROW-3245
URL: https://issues.apache.org/jira/browse/ARROW-3245
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Reporter: Martin Durant
The metadata included in parquet generally gives the min/max of data for each
chunk of each column. This allows early filtering out of whole chunks if they
do not meet some criterion, and can greatly reduce reading burden in some
circumstances. In Dask, we care about this for setting an index and its
"divisions" (start/stop values for each data partition) and for directly
avoiding including some chunks in the graph of tasks to be processed.
Similarly, filtering may be applied on the values of fields defined by the
directory partitioning.
Currently, dask using the fastparquet backend is able to infer possible columns
to use as an index, perform filtering on that index and do general filtering on
any column which has statistical or partitioning information. It would be very
helpful to have such facilities via pyarrow also.
(please forgive that some of this has already been mentioned elsewhere; this is
one of the entries in the list at
[https://github.com/dask/fastparquet/issues/374] as a feature that is useful in
fastparquet)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)