[
https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297664#comment-17297664
]
Micah Kornfield commented on ARROW-7224:
----------------------------------------
I think this would require some changing of dataset assumptions (I'm not
familiar to say how and if it is worth the work). But it could be done as
follows. Knowing a directories look like: "{a}/{b}/{c}/{files}" (either
through inference or user provides this) then if a predicate "a=foo" instead of
listing all objects and caching them a directory listing of "/foo/*" could be
issued.
So this might involve some of the following for datasets:
1. Making construction lazier.
2. Tracking which top level structures have been explored and which ones
haven't.
3. Constructing listings in parallel given a predicate.
A success metric is latency for the first returned data, it seems like the
existing datasets contract is optimized around minimizing total latency across
all queries.
If you think about the common case for a date partitioned datawarehouse then
the most common queries are going to be on recent data. Listing only the
partitions needed can reduce latency (and potentially by quite a bit if the
underlying file system doesn't support reverse lexicographic listing).
> [C++][Dataset] Partition level filters should be able to provide filtering to
> file systems
> ------------------------------------------------------------------------------------------
>
> Key: ARROW-7224
> URL: https://issues.apache.org/jira/browse/ARROW-7224
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Micah Kornfield
> Priority: Major
> Labels: dataset
>
> When providing a filter for partitions, it should be possible in some cases
> to use it to optimize file system list calls. This can greatly improve the
> speed for reading data from partitions because fewer number of
> directories/files need to be explored/expanded. I've fallen behind on the
> dataset code, but I want to make sure this issue is tracked someplace. This
> came up in SO question linked below (feel free to correct my analysis if I
> missed the functionality someplace).
> Reference:
> [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)