[GitHub] [iceberg] pvary commented on a change in pull request #1612: Hive: Using Hive schema to create tables and partition specification

GitBox Mon, 23 Nov 2020 06:39:03 -0800


pvary commented on a change in pull request #1612:
URL: https://github.com/apache/iceberg/pull/1612#discussion_r528748947




##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -80,6 +81,10 @@ public void 
preCreateTable(org.apache.hadoop.hive.metastore.api.Table hmsTable)
         
Preconditions.checkArgument(catalogProperties.getProperty(InputFormatConfig.PARTITION_SPEC)
 == null,
             "Iceberg table already created - can not use provided partition 
specification");
 
+        Schema hmsSchema = HiveSchemaUtil.schema(hmsTable.getSd().getCols());
+        Preconditions.checkArgument(HiveSchemaUtil.compatible(hmsSchema, 
icebergTable.schema()),
+            "Iceberg table already created - with different specification");

Review comment:
       Good summary @shardulm94! Thanks!
   
   > There are a few problems I think of when it comes to Hive schemas when 
compared to Iceberg
   > 
   > 1. Field names are always lowercase
   > 2. Field are always nullable
   > 3. Difference in supported types
   
   And we should add different partitioning philosophy, but that will be 
something for later 😄 
   Good point on the differences. Definitely we should create a PR handling all 
those differences.
   
   > So, I don't see a good reason why we should even look at the Hive schema 
when `InputFormatConfig.TABLE_SCHEMA` is present. Can we just disregard the 
Hive schema in such cases? I would prefer looking at Hive schema when thats the 
only schema information we have available. We won't need compatibility checks 
as there would be nothing to compare against.
   > 
   > For the use case to read a subset of columns from a non-Hive catalog 
table, can we just make the user specify the the column names they want to 
read? The user having to provide type information would be error-prone and 
especially tedious for nested column definitions. The user provided type 
information can also go stale overtime in reference to the source table.
   
   For the longer term:
   1. I think we should stick to the SQL standard as much as possible. I think 
it would be an overkill to introduce new SQL language elements only for the 
case where the Iceberg table is already created and we are creating a new Hive 
table with fever columns.
   2. If the above point stands, then we have a schema from the SQL and the 
schema from the table which we have to compare anyway.
   
   For short term:
   1. We can decide to stick to the Hive schema generated from the Iceberg 
schema if we accept that Hive recreates the SerDe quite often and we have to 
load the table every case. This is "only" a performance question.
   2. The bigger problem is that there is no technically correct way to 
differentiate cases when the column specification is provided by the user or 
when it is generated based on the Iceberg schema. We can rely on the column 
comments ("from deserializer") or we can use undocumented feature to add a new 
serDeProperties, but I would no like to depend on these for features. This 
means that we either have to compare the Hive schema and the Iceberg schema, or 
we have to throw away the schema provided by Hive. This later solution could be 
a usability problem since we will hide the case when the user provided a schema 
and it is not the same as the schema generated from the Iceberg schema.
   
   So my thought process was the following:
   1. In the long term we need the schema comparison
   2. In the short term it would be good to have a schema comparison to provide 
good usability
   3. This is why I decided to implement the comparison.
   
   The options I see:
   1. Move forward with the change as it is 😄 
   2. Remove comparison and if there is an Iceberg schema available for us, use 
it even if the user provided columns. We can remove the unsupported columns / 
struct fields / maps, so the Hive table could be created. This could raise some 
eyebrows when the table specification is provided in the SQL and the created 
table does not conform to that specification, but we can document this 
"feature" 😄 
   3. I am open to other suggestions
   
   Thanks,
   Peter




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pvary commented on a change in pull request #1612: Hive: Using Hive schema to create tables and partition specification

Reply via email to