[jira] [Commented] (ARROW-8658) [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments

David Li (Jira) Mon, 08 Mar 2021 12:28:12 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-8658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297685#comment-17297685
 ]


David Li commented on ARROW-8658:
---------------------------------

After looking through things here, I have a few thoughts:
 * Doing this with arbitrary expressions is expensive; we need a way to 
determine whether one is an 'ancestor' of another. Presumably, A is an ancestor 
of B if !A => !B, but evaluating that is the expensive operation we're hoping 
to avoid in the first place! If recovering a tree from expressions involves at 
least O(|fragments|) expression evaluations anyways, any gains you might get 
would be erased, especially in the case of ARROW-11781, where we'd do all that 
work, query it once, and then throw it all away. 
 * With arbitrary user-provided expressions, there may not be any tree 
structure worth recovering anyways.

 Given that, it might be worth only implementing for the case of a 
KeyValuePartitioning, where we can define a sorting and ancestry test on 
vector<Key> and cheaply construct the tree. Datasets built by other means would 
continue to use a simple vector of fragments as they have been. Or we could try 
converting each expression back into a vector<Key>, and give up on any 
expression that doesn't fit - but there are potentially lots of cases to 
consider and it would be easy to accidentally make the conversion from Key to 
Expression and back lossy.

Also, fs::PathForest is now completely unused, so we can just refactor it into 
a "KeyVector"Forest and save some work. And since the internal representation 
of a *Forest is just a vector, we can easily interpret it either way as 
appropriate.

> [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-8658
>                 URL: https://issues.apache.org/jira/browse/ARROW-8658
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 0.17.0
>            Reporter: Ben Kietzman
>            Assignee: Ben Kietzman
>            Priority: Major
>              Labels: dataset
>
> This is a very handy optimization for large datasets with multiple partition 
> fields. For example, given a hive-style directory {{$base_dir/a=3/}} and a 
> filter {{"a"_ == 2}} none of its files or subdirectories need be examined.
> After ARROW-8318 FileSystemDataset stores only files so subtree pruning 
> (whose implementation depended on the presence of directories to represent 
> subtrees) was disabled. It should be possible to reintroduce this without 
> reference to directories by examining partition expressions directly and 
> extracting a tree structure from their subexpressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8658) [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments

Reply via email to