[ 
https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Durant updated ARROW-3245:
---------------------------------
    Description: 
The metadata included in parquet generally gives the min/max of data for each 
chunk of each column. This allows early filtering out of whole chunks if they 
do not meet some criterion, and can greatly reduce reading burden in some 
circumstances. In Dask, we care about this for setting an index and its 
"divisions" (start/stop values for each data partition) and for directly 
avoiding including some chunks in the graph of tasks to be processed. 
Similarly, filtering may be applied on the values of fields defined by the 
directory partitioning.

Currently, dask using the fastparquet backend is able to infer possible columns 
to use as an index, perform filtering on that index and do general filtering on 
any column which has statistical or partitioning information. It would be very 
helpful to have such facilities via pyarrow also.

 This is probably the most important of the requests from Dask.

(please forgive that some of this has already been mentioned elsewhere; this is 
one of the entries in the list at 
[https://github.com/dask/fastparquet/issues/374] as a feature that is useful in 
fastparquet)

  was:
The metadata included in parquet generally gives the min/max of data for each 
chunk of each column. This allows early filtering out of whole chunks if they 
do not meet some criterion, and can greatly reduce reading burden in some 
circumstances. In Dask, we care about this for setting an index and its 
"divisions" (start/stop values for each data partition) and for directly 
avoiding including some chunks in the graph of tasks to be processed. 
Similarly, filtering may be applied on the values of fields defined by the 
directory partitioning.

 

Currently, dask using the fastparquet backend is able to infer possible columns 
to use as an index, perform filtering on that index and do general filtering on 
any column which has statistical or partitioning information. It would be very 
helpful to have such facilities via pyarrow also.

 

(please forgive that some of this has already been mentioned elsewhere; this is 
one of the entries in the list at 
[https://github.com/dask/fastparquet/issues/374] as a feature that is useful in 
fastparquet)


> Infer index and/or filtering from parquet column statistics
> -----------------------------------------------------------
>
>                 Key: ARROW-3245
>                 URL: https://issues.apache.org/jira/browse/ARROW-3245
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Martin Durant
>            Priority: Major
>
> The metadata included in parquet generally gives the min/max of data for each 
> chunk of each column. This allows early filtering out of whole chunks if they 
> do not meet some criterion, and can greatly reduce reading burden in some 
> circumstances. In Dask, we care about this for setting an index and its 
> "divisions" (start/stop values for each data partition) and for directly 
> avoiding including some chunks in the graph of tasks to be processed. 
> Similarly, filtering may be applied on the values of fields defined by the 
> directory partitioning.
> Currently, dask using the fastparquet backend is able to infer possible 
> columns to use as an index, perform filtering on that index and do general 
> filtering on any column which has statistical or partitioning information. It 
> would be very helpful to have such facilities via pyarrow also.
>  This is probably the most important of the requests from Dask.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to