Aman Sinha created DRILL-4861:
---------------------------------
Summary: Revisit the 'entries' stored as part of ParquetGroupScan
Key: DRILL-4861
URL: https://issues.apache.org/jira/browse/DRILL-4861
Project: Apache Drill
Issue Type: Bug
Components: Storage - Parquet
Affects Versions: 1.7.0
Reporter: Aman Sinha
The ParquetGroupScan stores a list of ReadEntryWithPath in the form of
'entries' field
(https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L104)
as well as a hash set of file names in the 'fileSet' field
(https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L263).
The underlying data stored by both is essentially the same set of filenames.
We should try to consolidate these into a single entity. This is not just
useful for code simplification but has a real performance cost: when a
ParquetGroupScan is serialized and sent as part of a Json plan fragment, the
overhead is quite high if the number of files is large (tens of thousands or
higher).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)