[
https://issues.apache.org/jira/browse/DRILL-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745636#comment-14745636
]
ASF GitHub Bot commented on DRILL-3735:
---------------------------------------
Github user amansinha100 commented on a diff in the pull request:
https://github.com/apache/drill/pull/156#discussion_r39528224
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/planner/ParquetPartitionDescriptor.java
---
@@ -125,4 +117,16 @@ private String getBaseTableLocation() {
final FormatSelection origSelection = (FormatSelection)
scanRel.getDrillTable().getSelection();
return origSelection.getSelection().selectionRoot;
}
+
+ @Override
+ protected void createPartitionSublists() {
+ Set<String> fileLocations = ((ParquetGroupScan)
scanRel.getGroupScan()).getFileSet();
+ List<PartitionLocation> locations = new LinkedList<>();
+ for (String file: fileLocations) {
+ locations.add(new DFSPartitionLocation(MAX_NESTED_SUBDIRS,
getBaseTableLocation(), file));
--- End diff --
Long file names are an issue not just for partition pruning but metadata in
general..that's what I was saying previously about FormatSelection.getAsFiles()
and ParquetGroupScan.getFileSet() etc. If we want to put the names into direct
memory rather than heap, then a broader change is needed. We should have a
separate JIRA for that I think.
> Directory pruning is not happening when number of files is larger than 64k
> --------------------------------------------------------------------------
>
> Key: DRILL-3735
> URL: https://issues.apache.org/jira/browse/DRILL-3735
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Affects Versions: 1.1.0
> Reporter: Hao Zhu
> Assignee: Mehant Baid
> Fix For: 1.2.0
>
>
> When the number of files is larger than 64k limit, directory pruning is not
> happening.
> We need to increase this limit further to handle most use cases.
> My proposal is to separate the code for directory pruning and partition
> pruning.
> Say in a parent directory there are 100 directories and 1 million files.
> If we only query the file from one directory, we should firstly read the 100
> directories and narrow down to which directory; and then read the file paths
> in that directory in memory and do the rest stuff.
> Current behavior is , Drill will read all the file paths of that 1 million
> files in memory firstly, and then do directory pruning or partition pruning.
> This is not performance efficient nor memory efficient. And also it can not
> scale.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)