[ https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622921#comment-16622921 ]
Matthew Rocklin commented on ARROW-3245: ---------------------------------------- After some fooling around this worked for me {{import pyarrow.parquet as pq}} {{import pandas as pd}} {{df = pd.DataFrame(\{'a': [1, 0]})}} {{df.to_parquet('out.parq', engine='pyarrow')}} {{pf = pq.ParquetDataset('out.parq')}} {{piece = pf.pieces[0]}} {{import functools}} {{piece.get_metadata(functools.partial(open, mode='rb'))}} I had to dive into the source a bit to figure out how to interpret the docstring. > [Python] Infer index and/or filtering from parquet column statistics > -------------------------------------------------------------------- > > Key: ARROW-3245 > URL: https://issues.apache.org/jira/browse/ARROW-3245 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Martin Durant > Priority: Major > Labels: parquet > > The metadata included in parquet generally gives the min/max of data for each > chunk of each column. This allows early filtering out of whole chunks if they > do not meet some criterion, and can greatly reduce reading burden in some > circumstances. In Dask, we care about this for setting an index and its > "divisions" (start/stop values for each data partition) and for directly > avoiding including some chunks in the graph of tasks to be processed. > Similarly, filtering may be applied on the values of fields defined by the > directory partitioning. > Currently, dask using the fastparquet backend is able to infer possible > columns to use as an index, perform filtering on that index and do general > filtering on any column which has statistical or partitioning information. It > would be very helpful to have such facilities via pyarrow also. > This is probably the most important of the requests from Dask. > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.3#76005)