[
https://issues.apache.org/jira/browse/HIVE-21599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635377#comment-17635377
]
Stamatis Zampetakis commented on HIVE-21599:
--------------------------------------------
The solution based on {{ReadContext#getRequestedSchema}} creates some other
problems when the schema of the table evolves. My assumption was that
getRequestedSchema always returns a subset of the columns of the original
schema (i.e., {{{}FileMetaData#getSchema{}}}). This is not true since the
getRequestedSchema is used to handle schema evolutions (and not only for column
pruning).
+Example:+
{code:sql}
create table person (id int, fname string, lname string, age int) stored as
parquet;
{code}
+FileMetaData#getSchema+
{noformat}
message hive_schema {
optional int32 id;
optional binary fname (STRING);
optional binary lname (STRING);
optional int32 age;
}
{noformat}
{code:sql}
select fname from person where age >=25;
{code}
+ReadContext#getRequestedSchema+
{noformat}
message hive_schema {
optional binary fname (STRING);
optional int32 age;
}
{noformat}
{code:sql}
ALTER TABLE person CHANGE COLUMN age years_from_birth int;
select fname from person where years_from_birth >=25;
{code}
+ReadContext#getRequestedSchema+
{noformat}
message hive_schema {
optional binary fname (STRING);
optional binary years_from_birth;
}
{noformat}
Observe that after renaming the column the result of {{getRequestedSchema}} is
not a subset of the {{FileMetaData#getSchema}} and years_from_birth column does
not appear in the file. Creating a Parquet filter predicate for a column that
does not actually exist in the file can cause various problems. For instance,
Parquet [tries to determine which blocks are matching the
filter|https://github.com/apache/parquet-mr/blob/d057b39d93014fe40f5067ee4a33621e65c91552/parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java#L103]
and if the filter column does not appear in the block it can wrongly derive
that a block does not have data for the filtering predicate.
Moreover after the rename the Parquet types are not retained (int32 vs binary)
which can cause problems as well when creating the filter predicate.
All in all, relying on getRequestedSchema to build the filter predicate is not
possible at this stage.
> Parquet predicate pushdown on partition columns may cause wrong result if
> files contain partition columns
> ---------------------------------------------------------------------------------------------------------
>
> Key: HIVE-21599
> URL: https://issues.apache.org/jira/browse/HIVE-21599
> Project: Hive
> Issue Type: Improvement
> Components: Query Planning
> Reporter: Vineet Garg
> Assignee: Soumyakanti Das
> Priority: Major
> Labels: pull-request-available
> Attachments: HIVE-21599.1.patch
>
> Time Spent: 3.5h
> Remaining Estimate: 0h
>
> Filter predicates are pushed to Table Scan (to be pushed to and used by
> storage handler/input format). Such predicates could consist of partition
> columns which are of no use to storage handler or input formats. Therefore
> it should be removed from TS filter expression.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)