rdsr commented on issue #40: Add external schema mappings for files written with name-based schemas URL: https://github.com/apache/incubator-iceberg/issues/40#issuecomment-465452378 Thanks @rdblue for that information. Here's a rough spec of what we can do :- 1. Update `Avro.ReadBuilder` to take in a `Map` of field names to fields ids which is passed on to `ProjectionDatumReader`. 1. In `ProjectionDatumReader` if the passed in `fileSchema` does not have any field ids, we use a modified `SchemaToType` class to assign ids to fields with the help of the mapping. 1. We'll need to update `SchemaToType` to take in a function which can either allocate increasing ids (current default behavior) or assign ids using the map. 1. Reconvert the returned iceberg schema (`SchemaToType` returns `Type`) to Avro using AvroSchemaUtil, which can then be passed to `AvroSchemaUtil#buildAvroProjection`. **Updating `SchemaToType`** In order to write a default mapping function to emulate current behavior in SchemaToType. I need to clear up some doubts around how are we currently allocating ids. In `SchemaToType`, when called with table schema or record - their top level fields are assigned ids starting from 0, when called with other type -- map, list etc, ids are assigned starting from 1. We can possibly write in a function to emulate the current behavior, or do u see scope here to simplify and only assign ever increasing ids, without caring about whether the top level fields of a schema get ids starting from 0. For the new function which uses a map to assign ids to fields, I presume for other types - map, list etc we can use non conflicting increasing ids to assign to each field in these container types. **Possible Alternative** Another possible alternative here could be that we don't modify `SchemaToType` , but use it nonetheless to generate an Iceberg schema. We define a new method [assignIds] under `TypeUtil` similar to `TypeUtil#assignFreshIds(Schema, TypeUtil.NextID)` which can take that schema and a function [similar to `NextID`] which can assign ids using the mapping. We will basically be overriding all the ids which `SchemaToType` would have allocated. I think there's scope of reusing `com.netflix.iceberg.types.AssignFreshIds` and making it more generic to serve two usecases - assigning fresh ids and assigning ids from a map. [For non record container types we'd still have to use non conflicting ids to assign to their fields].
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
