[
https://issues.apache.org/jira/browse/DRILL-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880951#comment-15880951
]
Nate Putnam commented on DRILL-5292:
------------------------------------
Digging into this further an approach would be to Modify the
ParquetScanBatchCreator and ParquetRecordReader classes with the general
modifications.
* ParquetScanBatchCreator - Read a full list of all the footers for the
SelectionRoot so that a Map of Parquet Files to Footers can be passed to the
record reader. Looking at that class there is a TODO that this would be a
desired change for performance reasons anyway.
* ParquetRecordReader - Refactor the nullFilledVectors to be the more general
NullableVector instead of the specific NullableIntVector.
* ParquetRecordReader - Use the Map of footers passed in from the
ParquetScanBatchCreator to do a reconciliation on the schema.
** If the requested vector is not present in the file being read but present in
a different file and is optional than add it as a NullableVector to the current
file.
> Better Parquet handling of sparse columns
> -----------------------------------------
>
> Key: DRILL-5292
> URL: https://issues.apache.org/jira/browse/DRILL-5292
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Affects Versions: 1.10.0
> Reporter: Nate Putnam
>
> It appears the current implantation of ParquetRecordReader will fill in
> missing columns between files as a NullableIntVector. It would be better if
> the code could determine if that column was defined in a different file (and
> didn't conflict) and use the defined data type.
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)