Hi Iceberg and Hive Teams, As some of you already know we are working on making Iceberg available as a first class storage layer for Hive.
Folks on the Iceberg side made a good job on utilizing the existing Hive SerDe API for the released Hive 2.3.8 and 3.1.2 versions. Thanks to their efforts we have read support for queries above Iceberg backed Hive tables with predicate pushdown and column pruning. In the last few months we added basic write and DDL support, so now one can create Iceberg backed Hive table and insert data into it with Hive queries. The code of these features are in the iceberg repo and available through the released iceberg-mr-runtime.jar for everyone to try out. There are some important features where the current Hive query execution model and SerDe API is not enough to achieve the things we need. Just to name a few: CREATE TABLE AS ... - Here we need to create an Iceberg table first, then write the data. Hive currently writes the data to a temporary dir and uses MoveTask to move it to the final place INSERT OVERWRITE ... - We need information about the jobs/tasks at HS2 side to commit the changes to an Iceberg table. These are not available ATM in DefaultHiveMetaHook.commitInsert method. We would like to extend the Hive query language with Iceberg specific bits, like timetravel / Iceberg specific partitioning etc We fully expect to find even more roadblocks as we progress with our roadmap. We might be able to work around the limitations by some hacky solutions but those do not pave the road for long term stable integration. The good solution for this problem should be to extend the SerDe API and enhance the query execution logic based on the new SerDe API. This will be an iterative process where the API will be constantly evolving until we reach the "final" stable stage. To make the process above streamlined, we propose to create an iceberg-handler module in Hive and use the existing iceberg-mr/iceberg-hive3 Iceberg modules as a baseline for it. We can extend and use the new SerDe API in this new iceberg-handler module and iterate faster. When there is a Hive release we can decide our next steps based on the actual landscape, and in the meantime we can port the changes between the 2 repo which does not require the new APIs. I would like to hear both teams opinion of the proposed solution. Thanks, Peter