[
https://issues.apache.org/jira/browse/DRILL-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952372#comment-14952372
]
ASF GitHub Bot commented on DRILL-3918:
---------------------------------------
Github user jacques-n commented on a diff in the pull request:
https://github.com/apache/drill/pull/196#discussion_r41713670
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java
---
@@ -44,6 +45,10 @@
public List<String> files;
public String selectionRoot;
+ // this is a temporary location for the reference to Parquet metadata
+ // TODO: ideally this should be in a Parquet specific derived class.
+ public ParquetTableMetadata_v1 parquetMeta = null;
--- End diff --
Can we please make this a private field and add the appropriate getter
before merging?
> Avoid extra loading of the metadata cache file
> ----------------------------------------------
>
> Key: DRILL-3918
> URL: https://issues.apache.org/jira/browse/DRILL-3918
> Project: Apache Drill
> Issue Type: Bug
> Components: Metadata
> Reporter: Aman Sinha
> Assignee: Aman Sinha
>
> The metadata cache file is currently being deserialized and read twice: once
> during {{ParquetFormatPlugin.expandSelection()}} that happens as part of the
> creation of DynamicDrillTable and once during ParquetGroupScan. This was
> also pointed out by [~sphillips] in DRILL-3901. We should avoid doing the
> read twice.
> The performance issue is getting exposed more now because of the fix for
> DRILL-3917 which fixed the behavior of expandSelection() by reading the
> metadata cache file through the correct interface (it was previously erroring
> out and not spending any time in the expansion). This fix is needed for
> correct functionality. However, performance numbers show a slowdown of
> about 2.7x for the 400K files test using caching. In my view, this
> performance comparison is not very meaningful because of the prior bug.
> This JIRA is to specifically targeting the extra load of the metadata cache
> file. There are other opportunities for improvement (for instance reading
> from the metadata cache is single threaded whereas reading from parquet files
> gets parallelized. That should be a separate JIRA).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)