[ 
https://issues.apache.org/jira/browse/DRILL-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952380#comment-14952380
 ] 

ASF GitHub Bot commented on DRILL-3918:
---------------------------------------

Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/196#discussion_r41713896
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java 
---
    @@ -44,6 +45,10 @@
       public List<String> files;
       public String selectionRoot;
     
    +  // this is a temporary location for the reference to Parquet metadata
    +  // TODO: ideally this should be in a Parquet specific derived class.
    +  public ParquetTableMetadata_v1 parquetMeta = null;
    --- End diff --
    
    Yes, will do. 


> Avoid extra loading of the metadata cache file
> ----------------------------------------------
>
>                 Key: DRILL-3918
>                 URL: https://issues.apache.org/jira/browse/DRILL-3918
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>
> The metadata cache file is currently being deserialized and read twice: once 
> during {{ParquetFormatPlugin.expandSelection()}} that happens as part of the 
> creation of DynamicDrillTable and once during ParquetGroupScan.  This was 
> also pointed out by [~sphillips] in DRILL-3901.   We should avoid doing the 
> read twice.  
> The performance issue is getting exposed more now because of the fix for 
> DRILL-3917 which fixed the behavior of expandSelection() by reading the 
> metadata cache file through the correct interface (it was previously erroring 
> out and not spending any time in the expansion). This fix is needed for 
> correct functionality.   However, performance numbers show a slowdown of 
> about 2.7x for the 400K files test using caching.  In my view, this 
> performance comparison is not very meaningful because of the prior bug.  
> This JIRA is to specifically targeting the extra load of the metadata cache 
> file.  There are other opportunities for improvement (for instance reading 
> from the metadata cache is single threaded whereas reading from parquet files 
> gets parallelized.  That should be a separate JIRA).  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to