Aman Sinha created DRILL-4861:
---------------------------------

             Summary: Revisit the 'entries' stored as part of ParquetGroupScan
                 Key: DRILL-4861
                 URL: https://issues.apache.org/jira/browse/DRILL-4861
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.7.0
            Reporter: Aman Sinha


The ParquetGroupScan stores a list of ReadEntryWithPath in the form of 
'entries' field 
(https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L104)
 as well as a hash set of file names  in the 'fileSet' field 
(https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L263).
   

The underlying data stored by both is essentially the same set of filenames.  
We should try to consolidate these into a single entity.  This is not just 
useful for code simplification but has a real performance cost: when a 
ParquetGroupScan is serialized and sent as part of a Json plan fragment, the 
overhead is quite high if the number of files is large (tens of thousands or 
higher). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to