[ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16475815#comment-16475815
 ] 

Oleksandr Kalinin commented on DRILL-5797:
------------------------------------------

[~arina] Thanks for your feedback. The problem with ParquetSchema seems to be 
that buildSchema() can't be called on a complex schema file as classes like 
ParquetReaderUtility used in ParquetSchema rely on flat schema. E.g. 
getColNameToSchemaElementMapping() returns corrupted / unusable structure if 
called on a nested schema.

In other words, ParquetSchema can only be built and used after ensuring that 
schema is flat, unless more refactoring-like work is done to support nested 
data (I spotted other locations that explicitly rely on flat schema).

So to keep things simple I am considering adding static method to ParquetSchema 
or even ParquetRecordUtility, something like isSuitableForFastReader(), which 
would do necessary checks based on input parameters (footer, selected columns 
etc) and serve as a gate for using the new reader.

> Use more often the new parquet reader
> -------------------------------------
>
>                 Key: DRILL-5797
>                 URL: https://issues.apache.org/jira/browse/DRILL-5797
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: Damien Profeta
>            Assignee: Oleksandr Kalinin
>            Priority: Major
>             Fix For: 1.14.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to