[GitHub] [iceberg] rdblue opened a new pull request #1336: Fix schema name conflicts

GitBox Thu, 13 Aug 2020 09:50:21 -0700


rdblue opened a new pull request #1336:
URL: https://github.com/apache/iceberg/pull/1336



   Iceberg indexes schemas by name to look up fields, but uses short names by 
omitting "element" and "value" names when a list or map has a struct element or 
value. For example, a list, `locations`, of structs `struct<lat: double, long: 
double>` will use names `locations.lat` instead of `locations.element.lat`.
   
   This works most of the time, but can lead to conflicts when a map value 
contains a field name `key`. In that case, the schema is rejected with a 
failure message like this: `ValidationException: Invalid schema: multiple 
fields for name some_map.key: 146 and 144`
   
   In addition, Spark passes names through `alterTable` that include the 
`value` and `element` names that are omitted.
   
   This PR fixes the problem by keeping track of two names, the full name with 
`element` and `value`, and secondary "short" names that omit them. Indexing now 
uses all of the full names and adds any short names that are not ambiguous.
   
   This change also requires indexing IDs separately rather than using a 
`BiMap` because IDs can have multiple names, which is not valid when inverting 
the `BiMap`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue opened a new pull request #1336: Fix schema name conflicts

Reply via email to