[jira] [Commented] (HIVE-21599) Parquet predicate pushdown on partition columns may cause wrong result if files contain partition columns

Stamatis Zampetakis (Jira) Thu, 17 Nov 2022 06:30:45 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-21599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635377#comment-17635377
 ]


Stamatis Zampetakis commented on HIVE-21599:
--------------------------------------------

The solution based on {{ReadContext#getRequestedSchema}} creates some other 
problems when the schema of the table evolves. My assumption was that 
getRequestedSchema always returns a subset of the columns of the original 
schema (i.e., {{{}FileMetaData#getSchema{}}}). This is not true since the 
getRequestedSchema is used to handle schema evolutions (and not only for column 
pruning).

+Example:+
{code:sql}
create table person (id int, fname string, lname string, age int) stored as 
parquet;
{code}
+FileMetaData#getSchema+
{noformat}
message hive_schema {
  optional int32 id;
  optional binary fname (STRING);
  optional binary lname (STRING);
  optional int32 age;
}
{noformat}
{code:sql}
select fname from person where age >=25;
{code}
+ReadContext#getRequestedSchema+
{noformat}
message hive_schema {
  optional binary fname (STRING);
  optional int32 age;
}
{noformat}
{code:sql}
ALTER TABLE person CHANGE COLUMN age years_from_birth int;
select fname from person where years_from_birth >=25;
{code}
+ReadContext#getRequestedSchema+
{noformat}
message hive_schema {
  optional binary fname (STRING);
  optional binary years_from_birth;
}
{noformat}
Observe that after renaming the column the result of {{getRequestedSchema}} is 
not a subset of the {{FileMetaData#getSchema}} and years_from_birth column does 
not appear in the file. Creating a Parquet filter predicate for a column that 
does not actually exist in the file can cause various problems. For instance, 
Parquet [tries to determine which blocks are matching the 
filter|https://github.com/apache/parquet-mr/blob/d057b39d93014fe40f5067ee4a33621e65c91552/parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java#L103]
 and if the filter column does not appear in the block it can wrongly derive 
that a block does not have data for the filtering predicate.

Moreover after the rename the Parquet types are not retained (int32 vs binary) 
which can cause problems as well when creating the filter predicate.

All in all, relying on getRequestedSchema to build the filter predicate is not 
possible at this stage.

> Parquet predicate pushdown on partition columns may cause wrong result if 
> files contain partition columns
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-21599
>                 URL: https://issues.apache.org/jira/browse/HIVE-21599
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Planning
>            Reporter: Vineet Garg
>            Assignee: Soumyakanti Das
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-21599.1.patch
>
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Filter predicates are pushed to Table Scan (to be pushed to and used by 
> storage handler/input format). Such predicates could consist of partition 
> columns which are of no use to storage handler  or input formats. Therefore 
> it should be removed from TS filter expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-21599) Parquet predicate pushdown on partition columns may cause wrong result if files contain partition columns

Reply via email to