[
https://issues.apache.org/jira/browse/DRILL-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744839#comment-14744839
]
ASF GitHub Bot commented on DRILL-3735:
---------------------------------------
Github user amansinha100 commented on a diff in the pull request:
https://github.com/apache/drill/pull/156#discussion_r39474948
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/planner/ParquetPartitionDescriptor.java
---
@@ -125,4 +117,16 @@ private String getBaseTableLocation() {
final FormatSelection origSelection = (FormatSelection)
scanRel.getDrillTable().getSelection();
return origSelection.getSelection().selectionRoot;
}
+
+ @Override
+ protected void createPartitionSublists() {
+ Set<String> fileLocations = ((ParquetGroupScan)
scanRel.getGroupScan()).getFileSet();
+ List<PartitionLocation> locations = new LinkedList<>();
+ for (String file: fileLocations) {
+ locations.add(new DFSPartitionLocation(MAX_NESTED_SUBDIRS,
getBaseTableLocation(), file));
--- End diff --
Actually, this patch was not about reducing memory footprint per se. It
was to eliminate the 64K files limit for partition pruning. The above function
logic is the same as we had before for getPartitions() plus the new splitting
of the list into sublists. The long filenames seem less of an issue for the
JVM heap usage. Suppose we have 100K files each with name length 200 bytes.
This is 20MB which is relatively low compared to the heap size. However, we
should try to build a better framework for propagating the filenames throughout
the planning process. Right now, methods such as FormatSelection.getAsFiles()
populate all the filenames as once. Ideally, these could also expose an
iterator model.
> Directory pruning is not happening when number of files is larger than 64k
> --------------------------------------------------------------------------
>
> Key: DRILL-3735
> URL: https://issues.apache.org/jira/browse/DRILL-3735
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Affects Versions: 1.1.0
> Reporter: Hao Zhu
> Assignee: Mehant Baid
> Fix For: 1.2.0
>
>
> When the number of files is larger than 64k limit, directory pruning is not
> happening.
> We need to increase this limit further to handle most use cases.
> My proposal is to separate the code for directory pruning and partition
> pruning.
> Say in a parent directory there are 100 directories and 1 million files.
> If we only query the file from one directory, we should firstly read the 100
> directories and narrow down to which directory; and then read the file paths
> in that directory in memory and do the rest stuff.
> Current behavior is , Drill will read all the file paths of that 1 million
> files in memory firstly, and then do directory pruning or partition pruning.
> This is not performance efficient nor memory efficient. And also it can not
> scale.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)