[GitHub] [iceberg] pvary commented on pull request #1505: Hive: Make HiveCatalog based tables automatically readable from Hive

GitBox Thu, 24 Sep 2020 11:41:51 -0700


pvary commented on pull request #1505:
URL: https://github.com/apache/iceberg/pull/1505#issuecomment-698518875



   > Would option 3 work? The reflection-based loader would need to be in the 
Hive classpath at all times. So we would have the same problem with that one 
right? This is a problem that Hive would need to fix.
   
   Without the HiveIcebergStorageHandler, Hive will not even know the columns 
in the table (some historic version is stored in HMS, but not considered 
correct and queried again from the StorageHandler every time). So Hive 
considers these tables corrupt. Not being able to drop them is definitely a 
problem, but for the other functions it is questionable if we provide any data 
/ possibility to manipulate.
   
   > I lean toward the approach of making it optional to set the serde and 
storage handler, and using either reflection to see if the classes currently 
exist or a table property to change the behavior. A table property makes sense 
to me: `engine.hive.enabled=true` could signal that these should be real 
classes.
   
   Config still would mean that if we turn that on for a table then we need the 
HiveIcebergStorageHandler on the classpath for even the Spark processes which 
are accessing the table.
   
   In my opinion this is really a static/dynamic decision. With config we 
statically set a table to Hive or Spark readable one. With the reflection based 
solution we allow the accessor to dynamically decide which StorageHandler is 
needed. But you are absolutely right that if Hive access is required, then 
something has to be on the classpath.
   
   Would combining the 2 be an overkill?
   Like config if you want or not, and if you want ReflectionStorageHandler 
would be set. So if enabled the Spark could still access the HiveCatalog 
without the StorageHandler, and Hive will use HiveIcebergStorageHandler.
   
   I see the following uses-cases:
   1. Users have data on S3 but need some consistent store on the cloud. No 
HDFS at hand, so they need HiveCatalog, but really uses only Spark
   2. Users use Spark and only occasionally Hive. Since they already have Hive 
at hand, uses HiveCatalog
   3. Users use Spark and Hive too. HiveCatalog is an obvious choice
   4. Users use Hive but want to use Iceberg for the partitioning and such
   
   You might know more about other use-cases to consider...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pvary commented on pull request #1505: Hive: Make HiveCatalog based tables automatically readable from Hive

Reply via email to