[GitHub] [iceberg] shardulm94 commented on a change in pull request #1612: Hive: Using Hive schema to create tables and partition specification

GitBox Sat, 21 Nov 2020 14:04:31 -0800


shardulm94 commented on a change in pull request #1612:
URL: https://github.com/apache/iceberg/pull/1612#discussion_r528248059




##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -80,6 +81,10 @@ public void 
preCreateTable(org.apache.hadoop.hive.metastore.api.Table hmsTable)
         
Preconditions.checkArgument(catalogProperties.getProperty(InputFormatConfig.PARTITION_SPEC)
 == null,
             "Iceberg table already created - can not use provided partition 
specification");
 
+        Schema hmsSchema = HiveSchemaUtil.schema(hmsTable.getSd().getCols());
+        Preconditions.checkArgument(HiveSchemaUtil.compatible(hmsSchema, 
icebergTable.schema()),
+            "Iceberg table already created - with different specification");

Review comment:
       We have combinations of two sources of schemas when creating tables
   1) `InputFormatConfig.TABLE_SCHEMA` (either provided by the user when 
creating table, or coming from an existing table)
   2) Hive schema (provided by the user, or based on our previous comment seems 
like Hive will generate it based on ObjectInspector anyways)
   
   There are a few problems I think of when it comes to Hive schemas when 
compared to Iceberg
   1) Field names are always lowercase
   2) Field are always nullable
   3) Difference in supported types
   
   So, I don't see a good reason why we should even look at the Hive schema 
when `InputFormatConfig.TABLE_SCHEMA` is present. Can we just disregard the 
Hive schema in such cases? I would prefer looking at Hive schema when thats the 
only schema information we have available. We won't need compatibility checks 
as there would be nothing to compare against.
   
   For the use case to read a subset of columns from a non-Hive catalog table, 
can we just make the user specify the the column names they want to read? The 
user having to provide type information would be error-prone and especially 
tedious for nested column definitions. The user provided type information can 
also go stale overtime in reference to the source table.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] shardulm94 commented on a change in pull request #1612: Hive: Using Hive schema to create tables and partition specification

Reply via email to