[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

ASF GitHub Bot (JIRA) Wed, 01 Nov 2017 10:51:39 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234473#comment-16234473
 ]


ASF GitHub Bot commented on DRILL-5797:
---------------------------------------

Github user sachouche commented on the issue:

    https://github.com/apache/drill/pull/976
  
    Looking at the stack trace:
    - The code definitely is initializing a column of type REPEATABLE
    - The Fast Reader didn't expect this scenario so it used a default 
container (NullableVarBinary) for VL binary DT
    
    Why this is happening?
    - The code in ReadState::buildReader() is processing all selected columns
    - This information is obtained from the ParquetSchema
    - Looking at the code, this seems a case-sensitivity issue
    - The ParquetSchema is case-insensitive whereas the Parquet GroupType is not
    - Damien added a catch handler (column not found) to handle use-cases where 
we are projecting non-existing columns
    - This basically is leading to an unforeseen use-case
    - Assume column XYZ is complex
    - User uses an alias (xyz)
    - The new code will allow this column to pass and treat is as simple
    - The ParquetSchema is being case insensitive will process this column
    - and thus the exception in the test suite
    
    Suggested Fix
    - Create a map (key to-lower-case) and register all current row-group 
columns
    - Use this map to locate a selected column type



> Use more often the new parquet reader
> -------------------------------------
>
>                 Key: DRILL-5797
>                 URL: https://issues.apache.org/jira/browse/DRILL-5797
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: Damien Profeta
>            Assignee: Damien Profeta
>            Priority: Major
>             Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based 
> of what type of columns is in the file. But the columns that are read by the 
> query doesn't matter. We can increase a little bit the cases where the 
> optimized reader is used by checking is the projected column are simple or 
> not.
> This is an optimization waiting for the fast parquet reader to handle complex 
> structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

Reply via email to