pvary commented on a change in pull request #1612:
URL: https://github.com/apache/iceberg/pull/1612#discussion_r528748947
##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -80,6 +81,10 @@ public void
preCreateTable(org.apache.hadoop.hive.metastore.api.Table hmsTable)
Preconditions.checkArgument(catalogProperties.getProperty(InputFormatConfig.PARTITION_SPEC)
== null,
"Iceberg table already created - can not use provided partition
specification");
+ Schema hmsSchema = HiveSchemaUtil.schema(hmsTable.getSd().getCols());
+ Preconditions.checkArgument(HiveSchemaUtil.compatible(hmsSchema,
icebergTable.schema()),
+ "Iceberg table already created - with different specification");
Review comment:
Good summary @shardulm94! Thanks!
> There are a few problems I think of when it comes to Hive schemas when
compared to Iceberg
>
> 1. Field names are always lowercase
> 2. Field are always nullable
> 3. Difference in supported types
And we should add different partitioning philosophy, but that will be
something for later 😄
Good point on the differences. Definitely we should create a PR handling all
those differences.
> So, I don't see a good reason why we should even look at the Hive schema
when `InputFormatConfig.TABLE_SCHEMA` is present. Can we just disregard the
Hive schema in such cases? I would prefer looking at Hive schema when thats the
only schema information we have available. We won't need compatibility checks
as there would be nothing to compare against.
>
> For the use case to read a subset of columns from a non-Hive catalog
table, can we just make the user specify the the column names they want to
read? The user having to provide type information would be error-prone and
especially tedious for nested column definitions. The user provided type
information can also go stale overtime in reference to the source table.
For the longer term:
1. I think we should stick to the SQL standard as much as possible. I think
it would be an overkill to introduce new SQL language elements only for the
case where the Iceberg table is already created and we are creating a new Hive
table with fever columns.
2. If the above point stands, then we have a schema from the SQL and the
schema from the table which we have to compare anyway.
For short term:
1. We can decide to stick to the Hive schema generated from the Iceberg
schema if we accept that Hive recreates the SerDe quite often and we have to
load the table every case. This is "only" a performance question.
2. The bigger problem is that there is no technically correct way to
differentiate cases when the column specification is provided by the user or
when it is generated based on the Iceberg schema. We can rely on the column
comments ("from deserializer") or we can use undocumented feature to add a new
serDeProperties, but I would no like to depend on these for features. This
means that we either have to compare the Hive schema and the Iceberg schema, or
we have to throw away the schema provided by Hive. This later solution could be
a usability problem since we will hide the case when the user provided a schema
and it is not the same as the schema generated from the Iceberg schema.
So my thought process was the following:
1. In the long term we need the schema comparison
2. In the short term it would be good to have a schema comparison to provide
good usability
3. This is why I decided to implement the comparison.
The options I see:
1. Move forward with the change as it is 😄
2. Remove comparison and if there is an Iceberg schema available for us, use
it even if the user provided columns. We can remove the unsupported columns /
struct fields / maps, so the Hive table could be created. This could raise some
eyebrows when the table specification is provided in the SQL and the created
table does not conform to that specification, but we can document this
"feature" 😄
3. I am open to other suggestions
Thanks,
Peter
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]