[
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16475815#comment-16475815
]
Oleksandr Kalinin commented on DRILL-5797:
------------------------------------------
[~arina] Thanks for your feedback. The problem with ParquetSchema seems to be
that buildSchema() can't be called on a complex schema file as classes like
ParquetReaderUtility used in ParquetSchema rely on flat schema. E.g.
getColNameToSchemaElementMapping() returns corrupted / unusable structure if
called on a nested schema.
In other words, ParquetSchema can only be built and used after ensuring that
schema is flat, unless more refactoring-like work is done to support nested
data (I spotted other locations that explicitly rely on flat schema).
So to keep things simple I am considering adding static method to ParquetSchema
or even ParquetRecordUtility, something like isSuitableForFastReader(), which
would do necessary checks based on input parameters (footer, selected columns
etc) and serve as a gate for using the new reader.
> Use more often the new parquet reader
> -------------------------------------
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Reporter: Damien Profeta
> Assignee: Oleksandr Kalinin
> Priority: Major
> Fix For: 1.14.0
>
>
> The choice of using the regular parquet reader of the optimized one is based
> of what type of columns is in the file. But the columns that are read by the
> query doesn't matter. We can increase a little bit the cases where the
> optimized reader is used by checking is the projected column are simple or
> not.
> This is an optimization waiting for the fast parquet reader to handle complex
> structure.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)