Aman Sinha created DRILL-3918:
---------------------------------
Summary: Avoid extra loading of the metadata cache file
Key: DRILL-3918
URL: https://issues.apache.org/jira/browse/DRILL-3918
Project: Apache Drill
Issue Type: Bug
Components: Metadata
Reporter: Aman Sinha
The metadata cache file is currently being deserialized and read twice: once
during {{ParquetFormatPlugin.expandSelection()}} that happens as part of the
creation of DynamicDrillTable and once during ParquetGroupScan. This was also
pointed out by [~sphillips] in DRILL-3901. We should avoid doing the read
twice.
The performance issue is getting exposed more now because of the fix for
DRILL-3917 which fixed the behavior of expandSelection() by reading the
metadata cache file through the correct interface (it was previously erroring
out and not spending any time in the expansion). This fix is needed for correct
functionality. However, performance numbers show a slowdown of about 2.7x for
the 400K files test using caching. In my view, this performance comparison is
not very meaningful because of the prior bug.
This JIRA is to specifically target the extra load of the metadata cache file.
There are other opportunities for improvement (for instance reading from the
metadata cache is single treaded whereas reading from parquet files gets
parallelized. That should be a separate JIRA).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)