[jira] [Commented] (DRILL-3735) Directory pruning is not happening when number of files is larger than 64k

ASF GitHub Bot (JIRA) Tue, 15 Sep 2015 08:56:54 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745636#comment-14745636
 ]


ASF GitHub Bot commented on DRILL-3735:
---------------------------------------

Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/156#discussion_r39528224
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/ParquetPartitionDescriptor.java
 ---
    @@ -125,4 +117,16 @@ private String getBaseTableLocation() {
         final FormatSelection origSelection = (FormatSelection) 
scanRel.getDrillTable().getSelection();
         return origSelection.getSelection().selectionRoot;
       }
    +
    +  @Override
    +  protected void createPartitionSublists() {
    +    Set<String> fileLocations = ((ParquetGroupScan) 
scanRel.getGroupScan()).getFileSet();
    +    List<PartitionLocation> locations = new LinkedList<>();
    +    for (String file: fileLocations) {
    +      locations.add(new DFSPartitionLocation(MAX_NESTED_SUBDIRS, 
getBaseTableLocation(), file));
    --- End diff --
    
    Long file names are an issue not just for partition pruning but metadata in 
general..that's what I was saying previously about FormatSelection.getAsFiles() 
and ParquetGroupScan.getFileSet() etc.  If we want to put the names into direct 
memory rather than heap, then a broader change is needed.   We should have a 
separate JIRA for that I think. 


> Directory pruning is not happening when number of files is larger than 64k
> --------------------------------------------------------------------------
>
>                 Key: DRILL-3735
>                 URL: https://issues.apache.org/jira/browse/DRILL-3735
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.1.0
>            Reporter: Hao Zhu
>            Assignee: Mehant Baid
>             Fix For: 1.2.0
>
>
> When the number of files is larger than 64k limit, directory pruning is not 
> happening. 
> We need to increase this limit further to handle most use cases.
> My proposal is to separate the code for directory pruning and partition 
> pruning. 
> Say in a parent directory there are 100 directories and 1 million files.
> If we only query the file from one directory, we should firstly read the 100 
> directories and narrow down to which directory; and then read the file paths 
> in that directory in memory and do the rest stuff.
> Current behavior is , Drill will read all the file paths of that 1 million 
> files in memory firstly, and then do directory pruning or partition pruning. 
> This is not performance efficient nor memory efficient. And also it can not 
> scale.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-3735) Directory pruning is not happening when number of files is larger than 64k

Reply via email to