[GitHub] [iceberg] wypoon commented on pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

GitBox Mon, 07 Feb 2022 19:06:56 -0800


wypoon commented on pull request #4017:
URL: https://github.com/apache/iceberg/pull/4017#issuecomment-1032165351



   @rdblue I realize now that you view this issue differently than I do. You 
view engine.hive.enabled as a property of the Hive environment, and see it as 
related to Iceberg classes being available on the classpath. I view 
engine.hive.enabled as a property of the table, as I view whether or not a 
table is Hive-enabled to be a trait of the table rather than of the Hive 
environment. Thus I do not see having the property in the table as "leaking 
Hive environment". Thus my proposal to write engine.hive.enabled in the table 
properties when it is created (not "from some point in time") if it is created 
a being Hive-enabled.
   From what I observe, as a Spark user, I have been able to create, read and 
write Hive-enabled Iceberg tables without having the Hive Iceberg 
StorageHandler, SerDe, InputFormat and OutputFormat classes in my classpath; I 
only need the Iceberg Spark runtime jar in my classpath. So I don't think of 
whether a table is Hive-enabled or not as tied to my environment and classpath.
   The way we encountered this issue is in a setting where we have a shared 
HMS, and where I create an Iceberg table using Spark (with 
iceberg.engine.hive.enabled=true in my conf, as we want to create Hive-enabled 
tables for interop with Hive). There is no Hive on this cluster. In a different 
cluster, we have Hive. iceberg.engine.hive.enabled is not set in the 
environment of this Hive cluster (so it defaults to false), as there did not 
seem to be a need for it. @pvary or @mbod can speak to the Hive side, since I'm 
not knowledgeable about it, but I believe that when you create an Iceberg table 
using Hive QL, then Hive creates it with engine.hive.enabled=true in the 
properties. It's just that when Spark created the table, it doesn't write 
engine.hive.enabled=true in the properties (which is what this change fixes). 
Hive can read the Iceberg table Spark created (because HMS says it has the 
Iceberg StorageHandler, SerDe, Input/OutputFormat), but if it updates it, in 
`HiveTableOperations#
 doCommit`, `hiveEngineEnabled` will be false, so the table is turned into a 
non-Hive-enabled table (the table will then no longer have the Iceberg 
StorageHandler, SerDe, Input/OutputFormat), and Hive won't be able to read it 
after that. We could also have a situation where Spark, in another cluster 
where iceberg.engine.hive.enabled is not set in the conf), updates the table, 
and that would also turn a Hive-enabled table into a non-Hive-enabled table if 
we don't have engine.hive.enabled=true in the table properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] wypoon commented on pull request #4017: Write engine.hive.enabled=true in table properties for Hive-enabled table

Reply via email to