[GitHub] [iceberg] rdblue commented on pull request #5932: Parquet: support nested fields when assigning fallback ids

GitBox Fri, 07 Oct 2022 13:51:42 -0700


rdblue commented on PR #5932:
URL: https://github.com/apache/iceberg/pull/5932#issuecomment-1272077852


   Thanks for taking a look at this, @the-other-tim-brown, but I'm not sure 
that this is the right way to do what you want to accomplish.
   
   The fallback ID assignment is really old and makes assumptions about how the 
data has evolved -- specifically that position-based column resolution is valid 
(just like CSV with no header). This works for top-level columns, but it won't 
work for correctness with nested fields. With position-based column resolution, 
you can add columns to the end of the schema safely. So you can have some files 
with columns `1: a, 2: b` and some with columns `1: a, 2: b, 3: c` (note that 
ID assignment is consistent). The problem with nested columns is that the 
top-level field assignment can change the assignment for nested fields. For 
example: `1: a struct<3: x, 4: y>, 2: b` and `1: a struct<4: x, 5: y>, 2: b, 3: 
c`.
   
   I think what you probably want is to use a name mapping, which is more 
flexible and can probably handle what you want to do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #5932: Parquet: support nested fields when assigning fallback ids

Reply via email to