[
https://issues.apache.org/jira/browse/ARROW-8039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055470#comment-17055470
]
Neal Richardson commented on ARROW-8039:
----------------------------------------
We might focus this by saying that the objective is to satisfy the {{.read()}}
method of ParquetDataset and to at least support the {{filters}} argument to
the init method (with the bonus feature that you can filter on any column, not
just partition keys, as an incentive to use the new code). This would exclude
supporting object attributes like "pieces", which we could address separately
for dask et al..
See
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html
and
https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files.
> [C++][Python][Dataset] Assemble a minimal ParquetDataset shim
> -------------------------------------------------------------
>
> Key: ARROW-8039
> URL: https://issues.apache.org/jira/browse/ARROW-8039
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: C++ - Dataset, Python
> Affects Versions: 0.16.0
> Reporter: Ben Kietzman
> Assignee: Ben Kietzman
> Priority: Major
> Fix For: 0.17.0
>
>
> Assemble a minimal ParquetDataset shim backed by {{pyarrow.dataset.*}}.
> Replace the existing ParquetDataset with the shim by default, allow opt-out
> for users who need the current ParquetDataset
> This is mostly exploratory to see which of the python tests fail
--
This message was sent by Atlassian Jira
(v8.3.4#803005)