[ 
https://issues.apache.org/jira/browse/DRILL-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742112#comment-14742112
 ] 

Aman Sinha commented on DRILL-3735:
-----------------------------------

Let's treat the 64K limit issue separate from the multi-phase directory 
pruning.  For the multi-phase pruning I have opened a separate enhancement: 
DRILL-3759.   I believe the 64K limit is more urgent - especially considering 
the CTAS auto partitioning creates files in a flat structure, not hierarchical 
- so we'll try to address it first.  

My proposal is to have the PartitionDescriptor implement the Iterable interface 
and use the {code}com.google.common.collect.Lists.partition(List<T> list, int 
size){code} method to subdivide the list into chunks of 64K each (last sublist 
could be smaller).  The iterator will return sublists to the PruneScanRule 
which will then iterate within the sublists and perform the same logic as 
before except that the VectorContainer used by the rule will be cleared out 
after each sublist is processed.   The final output list of files is already 
stored in a List on the heap and I am not proposing to change that. 


> Directory pruning is not happening when number of files is larger than 64k
> --------------------------------------------------------------------------
>
>                 Key: DRILL-3735
>                 URL: https://issues.apache.org/jira/browse/DRILL-3735
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.1.0
>            Reporter: Hao Zhu
>            Assignee: Aman Sinha
>
> When the number of files is larger than 64k limit, directory pruning is not 
> happening. 
> We need to increase this limit further to handle most use cases.
> My proposal is to separate the code for directory pruning and partition 
> pruning. 
> Say in a parent directory there are 100 directories and 1 million files.
> If we only query the file from one directory, we should firstly read the 100 
> directories and narrow down to which directory; and then read the file paths 
> in that directory in memory and do the rest stuff.
> Current behavior is , Drill will read all the file paths of that 1 million 
> files in memory firstly, and then do directory pruning or partition pruning. 
> This is not performance efficient nor memory efficient. And also it can not 
> scale.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to