shardulm94 commented on a change in pull request #1612:
URL: https://github.com/apache/iceberg/pull/1612#discussion_r528248059
##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -80,6 +81,10 @@ public void
preCreateTable(org.apache.hadoop.hive.metastore.api.Table hmsTable)
Preconditions.checkArgument(catalogProperties.getProperty(InputFormatConfig.PARTITION_SPEC)
== null,
"Iceberg table already created - can not use provided partition
specification");
+ Schema hmsSchema = HiveSchemaUtil.schema(hmsTable.getSd().getCols());
+ Preconditions.checkArgument(HiveSchemaUtil.compatible(hmsSchema,
icebergTable.schema()),
+ "Iceberg table already created - with different specification");
Review comment:
We have combinations of two sources of schemas when creating tables
1) `InputFormatConfig.TABLE_SCHEMA` (either provided by the user when
creating table, or coming from an existing table)
2) Hive schema (provided by the user, or based on our previous comment seems
like Hive will generate it based on ObjectInspector anyways)
There are a few problems I think of when it comes to Hive schemas when
compared to Iceberg
1) Field names are always lowercase
2) Field are always nullable
3) Difference in supported types
So, I don't see a good reason why we should even look at the Hive schema
when `InputFormatConfig.TABLE_SCHEMA` is present. Can we just disregard the
Hive schema in such cases? I would prefer looking at Hive schema when thats the
only schema information we have available. We won't need compatibility checks
as there would be nothing to compare against.
For the use case to read a subset of columns from a non-Hive catalog table,
can we just make the user specify the the column names they want to read? The
user having to provide type information would be error-prone and especially
tedious for nested column definitions. The user provided type information can
also go stale overtime in reference to the source table.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]