[GitHub] rdsr commented on issue #40: Add external schema mappings for files written with name-based schemas

GitBox Tue, 19 Feb 2019 23:07:24 -0800

rdsr commented on issue #40: Add external schema mappings for files written 
with name-based schemas
URL: 
https://github.com/apache/incubator-iceberg/issues/40#issuecomment-465452378
 
 
   Thanks @rdblue  for that information. Here's a rough spec of what we can do 
:-
   1. Update `Avro.ReadBuilder` to take in a `Map` of field names to fields ids 
which is passed on to `ProjectionDatumReader`.
   1. In `ProjectionDatumReader` if the passed in `fileSchema` does not have 
any field ids, we use a modified `SchemaToType` class to assign ids to fields 
with the help of the mapping.
   1. We'll need to update `SchemaToType` to take in a function which can 
either allocate increasing ids (current default behavior) or assign ids using 
the map.
   1. Reconvert the returned iceberg schema (`SchemaToType` returns `Type`) to 
Avro using AvroSchemaUtil, which can then be passed to 
`AvroSchemaUtil#buildAvroProjection`.
   
   **Updating `SchemaToType`**
   In order to write a default mapping function to emulate current behavior in 
SchemaToType. I need to clear up some doubts around how are we currently 
allocating ids.
   
   In `SchemaToType`, when called with table schema or record - their top level 
fields are assigned ids starting from 0, when called with other type -- map, 
list etc, ids are assigned starting from 1. We can possibly write in a function 
to emulate the current behavior, or do u see scope here to simplify and only 
assign ever increasing ids, without caring about whether the top level fields 
of a schema get ids starting from 0.
   
   For the new function which uses a map to assign ids to fields, I presume for 
other types - map, list etc we can use non conflicting increasing ids to assign 
to each field in these container types.
   
   **Possible Alternative**
   Another possible alternative here could be that we don't modify 
`SchemaToType` , but use it nonetheless to generate an Iceberg schema.  We 
define a new method [assignIds] under `TypeUtil` similar to 
`TypeUtil#assignFreshIds(Schema, TypeUtil.NextID)` which can take that schema 
and a function [similar to `NextID`] which can assign ids using the mapping.  
We will basically be overriding all the ids which `SchemaToType` would have 
allocated. I think there's scope of reusing 
`com.netflix.iceberg.types.AssignFreshIds` and making it more generic to serve 
two usecases - assigning fresh ids and assigning ids from a map.  [For non 
record container types we'd still have to use non conflicting ids to assign to 
their fields].


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] rdsr commented on issue #40: Add external schema mappings for files written with name-based schemas

Reply via email to