Re: Iceberg - Hive schema synchronization

Vivekanand Vellanki Tue, 24 Nov 2020 20:17:13 -0800

Some of the conversions we are seeing are:

   - Decimal to Decimal; not just limited to increasing precision as with
   Iceberg
   - varchar<N> to string
   - numeric type to numeric type (float to Decimal, double to Decimal,
   Decimal to double, etc)
   -
   - numeric type to string



On Tue, Nov 24, 2020 at 11:43 PM Owen O'Malley <owen.omal...@gmail.com>
wrote:

> You left the complex types off of your list (struct, map, array,
> uniontype). All of them have natural mappings in Iceberg, except for
> uniontype. Interval is supported on output, but not as a column type.
> Unfortunately, we have some tables with uniontype, so we'll need a solution
> for how to deal with it.
>
> I'm generally in favor of a strict mapping in both type and column
> mappings. One piece that I think will help a lot is if we add type
> annotations to Iceberg so that for example we could mark a struct as
> actually being a uniontype. If someone has the use case where they need to
> support Hive's char or varchar types it would make sense to define an
> attribute for the max length.
>
> Vivekanand, what kind of conversions are you needing. Hive has a *lot* of
> conversions. Many of those conversions are more error-prone than useful.
> (For example, I seriously doubt anyone found Hive's conversion of
> timestamps to booleans useful...)
>
> .. Owen
>
> On Tue, Nov 24, 2020 at 3:46 PM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> One of the challenges we've had is that Hive is more flexible with schema
>> evolution compared to Iceberg. Are you guys also looking at this aspect?
>>
>> On Tue, Nov 24, 2020 at 8:21 PM Peter Vary <pv...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi Team,
>>>
>>> With Shardul we had a longer discussion yesterday about the schema
>>> synchronization between Iceberg and Hive, and we thought that it would be
>>> good to ask the opinion of the greater community too.
>>>
>>> We can have 2 sources for the schemas.
>>>
>>>    1. Hive table definition / schema
>>>    2. Iceberg schema.
>>>
>>>
>>> If we want Iceberg and Hive to work together we have to find a way to
>>> synchronize them. Either by defining a master schema, or by defining a
>>> compatibility matrix and conversion for them.
>>> In previous Hive integrations we can see examples for both:
>>>
>>>    - With Avro there is a possibility to read the schema from the data
>>>    file directly, and the master schema is the one which is in Avro.
>>>    - With HBase you can provide a mapping between HBase columns by
>>>    providing the *hbase.columns.mapping* table property
>>>
>>>
>>> Maybe the differences are caused by how the storage format is perceived
>>> Avro being a simple storage format, HBase being an independent query engine
>>> - but his is just a questionable opinion :)
>>>
>>> I would like us to decide how Iceberg - Hive integration should be
>>> handled.
>>>
>>> There are at least 2 questions:
>>>
>>>    1. How flexible we should be with the type mapping between Hive and
>>>    Iceberg types?
>>>       1. Shall we have a strict mapping - This way if we have an
>>>       Iceberg schema we can immediately derive the Hive schema from it.
>>>       2. Shall we be more relaxed on this - Automatic casting /
>>>       conversions can be built into the integration, allowing the users to 
>>> skip
>>>       view and/or UDF creation for typical conversions
>>>    2. How flexible we should be with column mappings?
>>>       1. Shall we have strict 1-on-1 mapping - This way if we have an
>>>       Iceberg schema we can immediately derive the Hive schema from it. We 
>>> still
>>>       have to omit Iceberg columns which does not have a representation 
>>> available
>>>       in Hive.
>>>       2. Shall we allow flexibility on Hive table creation to chose
>>>       specific Iceberg columns instead of immediately creating a Hive table 
>>> with
>>>       all of the columns from the Iceberg table
>>>
>>>
>>> Currently I would chose:
>>>
>>>    - Strict type mapping because of the following reasons:
>>>       - Faster execution (we want as few checks and conversions as
>>>       possible, since it will be executed for every record)
>>>       - Complexity exponentially increases with every conversion
>>>    - Flexible column mapping:
>>>       - I think it will be a typical situation when we have a huge
>>>       Iceberg table storing the facts with big number of columns and we 
>>> would
>>>       like to create multiple Hive tables above that. The problem could be 
>>> solved
>>>       by creating the table and adding a view above that table, but I think 
>>> it
>>>       would be more user-friendly if we could avoid this extra step.
>>>       - The added complexity is at table creation / query planning
>>>       which has far smaller impact on the overall performance
>>>
>>>
>>> I would love to hear your thoughts as well since the choice should
>>> really depend on the user base, and what are the expected use-cases.
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>> Appendix 1 - Type mapping proposal:
>>> Iceberg typeHive2 typeHive3 typeStatus
>>> boolean BOOLEAN BOOLEAN OK
>>> int INTEGER INTEGER OK
>>> long BIGINT BIGINT OK
>>> float FLOAT FLOAT OK
>>> double DOUBLE DOUBLE OK
>>> decimal(P,S) DECIMAL(P,S) DECIMAL(P,S) OK
>>> binary BINARY BINARY OK
>>> date DATE DATE OK
>>> timestamp TIMESTAMP TIMESTAMP OK
>>> timestamptz TIMESTAMP TIMESTAMP WITH LOCAL TIMEZONE TODO
>>> string STRING STRING OK
>>> uuid STRING or BINARY STRING or BINARY TODO
>>> time - - -
>>> fixed(L) - - -
>>> - TINYINT TINYINT -
>>> - SMALLINT SMALLINT -
>>> - INTERVAL INTERVAL -
>>> - VARCHAR VARCHAR -
>>> - CHAR CHAR -
>>>
>>>

Re: Iceberg - Hive schema synchronization

Reply via email to