[ 
https://issues.apache.org/jira/browse/DRILL-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880951#comment-15880951
 ] 

Nate Putnam commented on DRILL-5292:
------------------------------------

Digging into this further an approach would be to Modify the 
ParquetScanBatchCreator and ParquetRecordReader classes with the general 
modifications.

* ParquetScanBatchCreator - Read a full list of all the footers for the 
SelectionRoot so that a Map of Parquet Files to Footers can be passed to the 
record reader. Looking at that class there is a TODO that this would be a 
desired change for performance reasons anyway. 

* ParquetRecordReader - Refactor the nullFilledVectors to be the more general 
NullableVector instead of the specific NullableIntVector. 

* ParquetRecordReader - Use the Map of footers passed in from the 
ParquetScanBatchCreator to do a reconciliation on the schema. 
** If the requested vector is not present in the file being read but present in 
a different file and is optional than add it as a NullableVector to the current 
file. 


> Better Parquet handling of sparse columns
> -----------------------------------------
>
>                 Key: DRILL-5292
>                 URL: https://issues.apache.org/jira/browse/DRILL-5292
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.10.0
>            Reporter: Nate Putnam
>
> It appears the current implantation of ParquetRecordReader will fill in 
> missing columns between files as a NullableIntVector. It would be better if 
> the code could determine if that column was defined in a different file (and 
> didn't conflict) and use the defined data type. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to