[jira] [Commented] (ARROW-7224) [C++][Dataset] Partition level filters should be able to provide filtering to file systems

Micah Kornfield (Jira) Mon, 08 Mar 2021 11:28:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297664#comment-17297664
 ]


Micah Kornfield commented on ARROW-7224:
----------------------------------------

I think this would require some changing of dataset assumptions (I'm not 
familiar to say how  and if it is worth the work).  But it could be done as 
follows.  Knowing a directories look like: "{a}/{b}/{c}/{files}" (either 
through inference or user provides this) then if a predicate "a=foo" instead of 
listing all objects and caching them a directory listing of "/foo/*" could be 
issued.

So this might involve some of the following for datasets:
1.  Making construction lazier.
2. Tracking which top level structures have been explored and which ones 
haven't.
3. Constructing listings in parallel given a predicate.

A success metric is latency for the first returned data, it seems like the 
existing datasets contract is optimized around minimizing total latency across 
all queries.

If you think about the common case for a date partitioned datawarehouse then 
the most common queries are going to be on recent data.  Listing only the 
partitions needed can reduce latency (and potentially by quite a bit if the 
underlying file system doesn't support reverse lexicographic listing).





> [C++][Dataset] Partition level filters should be able to provide filtering to 
> file systems
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7224
>                 URL: https://issues.apache.org/jira/browse/ARROW-7224
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Micah Kornfield
>            Priority: Major
>              Labels: dataset
>
> When providing a filter for partitions, it should be possible in some cases 
> to use it to optimize file system list calls.  This can greatly improve the 
> speed for reading data from partitions because fewer number of 
> directories/files need to be explored/expanded.  I've fallen behind on the 
> dataset code, but I want to make sure this issue is tracked someplace.  This 
> came up in SO question linked below (feel free to correct my analysis if I 
> missed the functionality someplace).
> Reference: 
> [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7224) [C++][Dataset] Partition level filters should be able to provide filtering to file systems

Reply via email to