[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics

Martin Durant (JIRA) Thu, 20 Sep 2018 06:10:55 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621990#comment-16621990
 ]


Martin Durant commented on ARROW-3245:
--------------------------------------

(pyarrow 0.10.0)

```

In [7]: df = pd.DataFrame(\{'a': [1, 0]})

In [8]: df.to_parquet('out.parq', engine='pyarrow')

In [9]: pf = pq.ParquetDataset('out.parq')

In [10]: pf.pieces[0].get_metadata()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-5f47bb2e5193> in <module>()
----> 1 pf.pieces[0].get_metadata()

~/anaconda/envs/tester/lib/python3.6/site-packages/pyarrow/parquet.py in 
get_metadata(self, open_file_func)
 412 file's metadata
 413 """
--> 414 return self._open(open_file_func).metadata
 415
 416 def _open(self, open_file_func=None):

~/anaconda/envs/tester/lib/python3.6/site-packages/pyarrow/parquet.py in 
_open(self, open_file_func)
 418 Returns instance of ParquetFile
 419 """
--> 420 reader = open_file_func(self.path)
 421 if not isinstance(reader, ParquetFile):
 422 reader = ParquetFile(reader)

TypeError: 'NoneType' object is not callable

```

> [Python] Infer index and/or filtering from parquet column statistics
> --------------------------------------------------------------------
>
>                 Key: ARROW-3245
>                 URL: https://issues.apache.org/jira/browse/ARROW-3245
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Martin Durant
>            Priority: Major
>              Labels: parquet
>
> The metadata included in parquet generally gives the min/max of data for each 
> chunk of each column. This allows early filtering out of whole chunks if they 
> do not meet some criterion, and can greatly reduce reading burden in some 
> circumstances. In Dask, we care about this for setting an index and its 
> "divisions" (start/stop values for each data partition) and for directly 
> avoiding including some chunks in the graph of tasks to be processed. 
> Similarly, filtering may be applied on the values of fields defined by the 
> directory partitioning.
> Currently, dask using the fastparquet backend is able to infer possible 
> columns to use as an index, perform filtering on that index and do general 
> filtering on any column which has statistical or partitioning information. It 
> would be very helpful to have such facilities via pyarrow also.
>  This is probably the most important of the requests from Dask.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics

Reply via email to