[jira] [Commented] (IMPALA-13364) Schema resolution doesn't work for migrated partitioned Iceberg tables that have complex types

ASF subversion and git services (Jira) Wed, 02 Oct 2024 14:40:22 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-13364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886518#comment-17886518
 ]


ASF subversion and git services commented on IMPALA-13364:
----------------------------------------------------------

Commit 3a861500b669f6cfd7283a43b542ff126ad44cea in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=3a861500b ]

IMPALA-13364: Schema resolution doesn't work for migrated partitioned Iceberg 
tables that have complex types

Schema resolution doesn't work correctly for migrated partitioned
Iceberg tables that have complex types. When we face a Parquet/ORC file
in an Iceberg table that doesn't have field IDs in the file metadata, we
assume that it is an old data file before migration, and the schema is
the very first one, hence we can mimic Iceberg's field ID generation to
assign field IDs to the file schema elements.

This process didn't take the partition columns into account. Partition
columns are not part of the data file but they still get field IDs. This
only matters when there are complex types in the table, as partition
columns are always the last columns in legacy Hive tables, and field IDs
are assigned via a "BFS-like" traversal. I.e. if there are only primitive
types in the table we don't have any problems, but the children of
complex types columns are assigned incorrectly.

This patch fixes field ID generation by taking the number of partitions
into account. If none of the partition columns are included in the data
file (common case) we adjust the file-level field IDs accordingly. It is
also OK to have all the partition columns in the data files (it is not
common, but we've seen such data files). We raise an error in other
cases (some partition columns are in the data file, while others
aren't).

Testing:
 * e2e tests added
 * added negative tests

Change-Id: Ie32952021b63d6b55b8820489e434bfc2a91580b
Reviewed-on: http://gerrit.cloudera.org:8080/21761
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Schema resolution doesn't work for migrated partitioned Iceberg tables that 
> have complex types
> ----------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-13364
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13364
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> Schema resolution doesn't work correctly for migrated partitioned Iceberg 
> tables that have complex types.
> When we face a Parquet/ORC file in an Iceberg table that doesn't have field 
> IDs in the file metadata, we assume that it is an old data file before 
> migration, and the schema is the very first one, hence we can mimic Iceberg's 
> field ID generation to assign field IDs to the file schema elements.
> This process didn't take the partition columns into account. This only 
> matters when there are complex types in the table, as partition columns are 
> always the last columns in legacy Hive tables, and field IDs are assigned via 
> a "BFS-like" traversal. I.e. if there are only primitive types in the table 
> we don't have any problems, but the children of complex types columns are 
> assigned incorrectly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13364) Schema resolution doesn't work for migrated partitioned Iceberg tables that have complex types

Reply via email to