[ 
https://issues.apache.org/jira/browse/DRILL-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Sinha updated DRILL-3918:
------------------------------
    Description: 
The metadata cache file is currently being deserialized and read twice: once 
during {{ParquetFormatPlugin.expandSelection()}} that happens as part of the 
creation of DynamicDrillTable and once during ParquetGroupScan.  This was also 
pointed out by [~sphillips] in DRILL-3901.   We should avoid doing the read 
twice.  

The performance issue is getting exposed more now because of the fix for 
DRILL-3917 which fixed the behavior of expandSelection() by reading the 
metadata cache file through the correct interface (it was previously erroring 
out and not spending any time in the expansion). This fix is needed for correct 
functionality.   However, performance numbers show a slowdown of about 2.7x for 
the 400K files test using caching.  In my view, this performance comparison is 
not very meaningful because of the prior bug.  

This JIRA is to specifically targeting the extra load of the metadata cache 
file.  There are other opportunities for improvement (for instance reading from 
the metadata cache is single threaded whereas reading from parquet files gets 
parallelized.  That should be a separate JIRA).  


  was:
The metadata cache file is currently being deserialized and read twice: once 
during {{ParquetFormatPlugin.expandSelection()}} that happens as part of the 
creation of DynamicDrillTable and once during ParquetGroupScan.  This was also 
pointed out by [~sphillips] in DRILL-3901.   We should avoid doing the read 
twice.  

The performance issue is getting exposed more now because of the fix for 
DRILL-3917 which fixed the behavior of expandSelection() by reading the 
metadata cache file through the correct interface (it was previously erroring 
out and not spending any time in the expansion). This fix is needed for correct 
functionality.   However, performance numbers show a slowdown of about 2.7x for 
the 400K files test using caching.  In my view, this performance comparison is 
not very meaningful because of the prior bug.  

This JIRA is to specifically target the extra load of the metadata cache file.  
There are other opportunities for improvement (for instance reading from the 
metadata cache is single treaded whereas reading from parquet files gets 
parallelized.  That should be a separate JIRA).  



> Avoid extra loading of the metadata cache file
> ----------------------------------------------
>
>                 Key: DRILL-3918
>                 URL: https://issues.apache.org/jira/browse/DRILL-3918
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>            Reporter: Aman Sinha
>
> The metadata cache file is currently being deserialized and read twice: once 
> during {{ParquetFormatPlugin.expandSelection()}} that happens as part of the 
> creation of DynamicDrillTable and once during ParquetGroupScan.  This was 
> also pointed out by [~sphillips] in DRILL-3901.   We should avoid doing the 
> read twice.  
> The performance issue is getting exposed more now because of the fix for 
> DRILL-3917 which fixed the behavior of expandSelection() by reading the 
> metadata cache file through the correct interface (it was previously erroring 
> out and not spending any time in the expansion). This fix is needed for 
> correct functionality.   However, performance numbers show a slowdown of 
> about 2.7x for the 400K files test using caching.  In my view, this 
> performance comparison is not very meaningful because of the prior bug.  
> This JIRA is to specifically targeting the extra load of the metadata cache 
> file.  There are other opportunities for improvement (for instance reading 
> from the metadata cache is single threaded whereas reading from parquet files 
> gets parallelized.  That should be a separate JIRA).  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to