[
https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298151#comment-17298151
]
Ben Kietzman commented on ARROW-7224:
-------------------------------------
[~andydoug] the metadata you mentioned makes me think of
[pyarrow.dataset.parquet_dataset|https://arrow.apache.org/docs/python/dataset.html#working-with-parquet-datasets],
which allows construction of a dataset from a single metadata-only parquet
file. When such a _metadata file can be written filters will be applied not
only to partition keys but also to row group statistics without IO; might be
worth a look.
The fundamental problem I see is a mismatch in how {{pq.ParquetDatset}} and
{{pyarrow.dataset.*}} consider filters: the former considers a filter during
construction while the latter is intended to handle multiple reads with
differing filters. In the latter case, the dataset is intended to _encapsulate_
the mapping between partitions and files:
{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset('s3://bucket-name/dataset-name', format='parquet')
kindness_day_filter = ds.field('year') == 2019 & ds.field('month') == 11 &
ds.field('day') == 13
kindness_day_table = dataset.to_table(filter=kindness_day_filter)
# repeat with other filters
kindness_week_filter = ds.field('year') == 2019 & ds.field('month') == 11 &
ds.field('day') >= 10 & ds.field('day') <= 16
{code}
If that's neither of these is acceptable, note that {{pq.ParquetDataset}} can
be constructed from a list of paths of data files. If an explicit list of files
is available then the dataset can be (repeatedly) constructed without invoking
a directory listing method.
FWIW, we have https://issues.apache.org/jira/browse/ARROW-8163 to track lazier
construction of datasets and the only issue I'm aware of which we had for
supporting more sophisticated listing behavior is
https://issues.apache.org/jira/browse/ARROW-6257 . It seems before lazy
construction of a dataset could benefit this case we'd need to also support
producing a stream of results from querying a FileSelector [~apitrou]
https://github.com/apache/arrow/pull/9632
[~jorisvandenbossche]
> [C++][Dataset] Partition level filters should be able to provide filtering to
> file systems
> ------------------------------------------------------------------------------------------
>
> Key: ARROW-7224
> URL: https://issues.apache.org/jira/browse/ARROW-7224
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Micah Kornfield
> Priority: Major
> Labels: dataset
>
> When providing a filter for partitions, it should be possible in some cases
> to use it to optimize file system list calls. This can greatly improve the
> speed for reading data from partitions because fewer number of
> directories/files need to be explored/expanded. I've fallen behind on the
> dataset code, but I want to make sure this issue is tracked someplace. This
> came up in SO question linked below (feel free to correct my analysis if I
> missed the functionality someplace).
> Reference:
> [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)