pvary commented on pull request #1505: URL: https://github.com/apache/iceberg/pull/1505#issuecomment-698518875
> Would option 3 work? The reflection-based loader would need to be in the Hive classpath at all times. So we would have the same problem with that one right? This is a problem that Hive would need to fix. Without the HiveIcebergStorageHandler, Hive will not even know the columns in the table (some historic version is stored in HMS, but not considered correct and queried again from the StorageHandler every time). So Hive considers these tables corrupt. Not being able to drop them is definitely a problem, but for the other functions it is questionable if we provide any data / possibility to manipulate. > I lean toward the approach of making it optional to set the serde and storage handler, and using either reflection to see if the classes currently exist or a table property to change the behavior. A table property makes sense to me: `engine.hive.enabled=true` could signal that these should be real classes. Config still would mean that if we turn that on for a table then we need the HiveIcebergStorageHandler on the classpath for even the Spark processes which are accessing the table. In my opinion this is really a static/dynamic decision. With config we statically set a table to Hive or Spark readable one. With the reflection based solution we allow the accessor to dynamically decide which StorageHandler is needed. But you are absolutely right that if Hive access is required, then something has to be on the classpath. Would combining the 2 be an overkill? Like config if you want or not, and if you want ReflectionStorageHandler would be set. So if enabled the Spark could still access the HiveCatalog without the StorageHandler, and Hive will use HiveIcebergStorageHandler. I see the following uses-cases: 1. Users have data on S3 but need some consistent store on the cloud. No HDFS at hand, so they need HiveCatalog, but really uses only Spark 2. Users use Spark and only occasionally Hive. Since they already have Hive at hand, uses HiveCatalog 3. Users use Spark and Hive too. HiveCatalog is an obvious choice 4. Users use Hive but want to use Iceberg for the partitioning and such You might know more about other use-cases to consider... ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
