Junjie Chen commented on HIVE-17261:

Actually, Hive use two deprecated parquet APIs, one is ParquetInputSplit, 
another is filterRowGroup. This is because parquet introduce new dictionary 
filter. The key point here is how to leverage both statistics filter and 
dictionary filter, in existing code, hive explicitly apply statistic filter in 
Hive side. 

To apply both statistics and dictionary filter, we can either explicitly 
changed filterRowGroup API or pass predicate statement through job 
configuration to parquet and filter at parquet side. The patch I provide is to 
pass predicate statement and skip explicitly filter at hive side.

> Hive use deprecated ParquetInputSplit constructor which blocked parquet 
> dictionary filter
> -----------------------------------------------------------------------------------------
>                 Key: HIVE-17261
>                 URL: https://issues.apache.org/jira/browse/HIVE-17261
>             Project: Hive
>          Issue Type: Improvement
>          Components: Database/Schema
>    Affects Versions: 2.2.0
>            Reporter: Junjie Chen
>            Assignee: Junjie Chen
>            Priority: Minor
>         Attachments: HIVE-17261.2.patch, HIVE-17261.diff, HIVE-17261.patch
> Hive use deprecated ParquetInputSplit in 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L128]
> Please see interface definition in 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L80]
> Old interface set rowgroupoffset values which will lead to skip dictionary 
> filter in parquet.

This message was sent by Atlassian JIRA

Reply via email to