Some of the conversions we are seeing are: - Decimal to Decimal; not just limited to increasing precision as with Iceberg - varchar<N> to string - numeric type to numeric type (float to Decimal, double to Decimal, Decimal to double, etc) - - numeric type to string
On Tue, Nov 24, 2020 at 11:43 PM Owen O'Malley <owen.omal...@gmail.com> wrote: > You left the complex types off of your list (struct, map, array, > uniontype). All of them have natural mappings in Iceberg, except for > uniontype. Interval is supported on output, but not as a column type. > Unfortunately, we have some tables with uniontype, so we'll need a solution > for how to deal with it. > > I'm generally in favor of a strict mapping in both type and column > mappings. One piece that I think will help a lot is if we add type > annotations to Iceberg so that for example we could mark a struct as > actually being a uniontype. If someone has the use case where they need to > support Hive's char or varchar types it would make sense to define an > attribute for the max length. > > Vivekanand, what kind of conversions are you needing. Hive has a *lot* of > conversions. Many of those conversions are more error-prone than useful. > (For example, I seriously doubt anyone found Hive's conversion of > timestamps to booleans useful...) > > .. Owen > > On Tue, Nov 24, 2020 at 3:46 PM Vivekanand Vellanki <vi...@dremio.com> > wrote: > >> One of the challenges we've had is that Hive is more flexible with schema >> evolution compared to Iceberg. Are you guys also looking at this aspect? >> >> On Tue, Nov 24, 2020 at 8:21 PM Peter Vary <pv...@cloudera.com.invalid> >> wrote: >> >>> Hi Team, >>> >>> With Shardul we had a longer discussion yesterday about the schema >>> synchronization between Iceberg and Hive, and we thought that it would be >>> good to ask the opinion of the greater community too. >>> >>> We can have 2 sources for the schemas. >>> >>> 1. Hive table definition / schema >>> 2. Iceberg schema. >>> >>> >>> If we want Iceberg and Hive to work together we have to find a way to >>> synchronize them. Either by defining a master schema, or by defining a >>> compatibility matrix and conversion for them. >>> In previous Hive integrations we can see examples for both: >>> >>> - With Avro there is a possibility to read the schema from the data >>> file directly, and the master schema is the one which is in Avro. >>> - With HBase you can provide a mapping between HBase columns by >>> providing the *hbase.columns.mapping* table property >>> >>> >>> Maybe the differences are caused by how the storage format is perceived >>> Avro being a simple storage format, HBase being an independent query engine >>> - but his is just a questionable opinion :) >>> >>> I would like us to decide how Iceberg - Hive integration should be >>> handled. >>> >>> There are at least 2 questions: >>> >>> 1. How flexible we should be with the type mapping between Hive and >>> Iceberg types? >>> 1. Shall we have a strict mapping - This way if we have an >>> Iceberg schema we can immediately derive the Hive schema from it. >>> 2. Shall we be more relaxed on this - Automatic casting / >>> conversions can be built into the integration, allowing the users to >>> skip >>> view and/or UDF creation for typical conversions >>> 2. How flexible we should be with column mappings? >>> 1. Shall we have strict 1-on-1 mapping - This way if we have an >>> Iceberg schema we can immediately derive the Hive schema from it. We >>> still >>> have to omit Iceberg columns which does not have a representation >>> available >>> in Hive. >>> 2. Shall we allow flexibility on Hive table creation to chose >>> specific Iceberg columns instead of immediately creating a Hive table >>> with >>> all of the columns from the Iceberg table >>> >>> >>> Currently I would chose: >>> >>> - Strict type mapping because of the following reasons: >>> - Faster execution (we want as few checks and conversions as >>> possible, since it will be executed for every record) >>> - Complexity exponentially increases with every conversion >>> - Flexible column mapping: >>> - I think it will be a typical situation when we have a huge >>> Iceberg table storing the facts with big number of columns and we >>> would >>> like to create multiple Hive tables above that. The problem could be >>> solved >>> by creating the table and adding a view above that table, but I think >>> it >>> would be more user-friendly if we could avoid this extra step. >>> - The added complexity is at table creation / query planning >>> which has far smaller impact on the overall performance >>> >>> >>> I would love to hear your thoughts as well since the choice should >>> really depend on the user base, and what are the expected use-cases. >>> >>> Thanks, >>> Peter >>> >>> >>> Appendix 1 - Type mapping proposal: >>> Iceberg typeHive2 typeHive3 typeStatus >>> boolean BOOLEAN BOOLEAN OK >>> int INTEGER INTEGER OK >>> long BIGINT BIGINT OK >>> float FLOAT FLOAT OK >>> double DOUBLE DOUBLE OK >>> decimal(P,S) DECIMAL(P,S) DECIMAL(P,S) OK >>> binary BINARY BINARY OK >>> date DATE DATE OK >>> timestamp TIMESTAMP TIMESTAMP OK >>> timestamptz TIMESTAMP TIMESTAMP WITH LOCAL TIMEZONE TODO >>> string STRING STRING OK >>> uuid STRING or BINARY STRING or BINARY TODO >>> time - - - >>> fixed(L) - - - >>> - TINYINT TINYINT - >>> - SMALLINT SMALLINT - >>> - INTERVAL INTERVAL - >>> - VARCHAR VARCHAR - >>> - CHAR CHAR - >>> >>>