[jira] [Commented] (ARROW-8039) [C++][Python][Dataset] Assemble a minimal ParquetDataset shim

Neal Richardson (Jira) Mon, 09 Mar 2020 17:10:38 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055470#comment-17055470
 ]


Neal Richardson commented on ARROW-8039:
----------------------------------------

We might focus this by saying that the objective is to satisfy the {{.read()}} 
method of ParquetDataset and to at least support the {{filters}} argument to 
the init method (with the bonus feature that you can filter on any column, not 
just partition keys, as an incentive to use the new code). This would exclude 
supporting object attributes like "pieces", which we could address separately 
for dask et al..

See 
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
 and 
https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files.
 

> [C++][Python][Dataset] Assemble a minimal ParquetDataset shim
> -------------------------------------------------------------
>
>                 Key: ARROW-8039
>                 URL: https://issues.apache.org/jira/browse/ARROW-8039
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++ - Dataset, Python
>    Affects Versions: 0.16.0
>            Reporter: Ben Kietzman
>            Assignee: Ben Kietzman
>            Priority: Major
>             Fix For: 0.17.0
>
>
> Assemble a minimal ParquetDataset shim backed by {{pyarrow.dataset.*}}. 
> Replace the existing ParquetDataset with the shim by default, allow opt-out 
> for users who need the current ParquetDataset
> This is mostly exploratory to see which of the python tests fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8039) [C++][Python][Dataset] Assemble a minimal ParquetDataset shim

Reply via email to