Hi Iceberg and Hive Teams,

As some of you already know we are working on making Iceberg available as a 
first class storage layer for Hive.

Folks on the Iceberg side made a good job on utilizing the existing Hive SerDe 
API for the released Hive 2.3.8 and 3.1.2 versions. Thanks to their efforts we 
have read support for queries above Iceberg backed Hive tables with predicate 
pushdown and column pruning. In the last few months we added basic write and 
DDL support, so now one can create Iceberg backed Hive table and insert data 
into it with Hive queries. The code of these features are in the iceberg repo 
and available through the released iceberg-mr-runtime.jar for everyone to try 
out.

There are some important features where the current Hive query execution model 
and SerDe API is not enough to achieve the things we need. Just to name a few:
CREATE TABLE AS ... - Here we need to create an Iceberg table first, then write 
the data. Hive currently writes the data to a temporary dir and uses MoveTask 
to move it to the final place
INSERT OVERWRITE ... - We need information about the jobs/tasks at HS2 side to 
commit the changes to an Iceberg table. These are not available ATM in 
DefaultHiveMetaHook.commitInsert method.
We would like to extend the Hive query language with Iceberg specific bits, 
like timetravel / Iceberg specific partitioning etc

We fully expect to find even more roadblocks as we progress with our roadmap. 
We might be able to work around the limitations by some hacky solutions but 
those do not pave the road for long term stable integration. The good solution 
for this problem should be to extend the SerDe API and enhance the query 
execution logic based on the new SerDe API. This will be an iterative process 
where the API will be constantly evolving until we reach the "final" stable 
stage.

To make the process above streamlined, we propose to create an iceberg-handler 
module in Hive and use the existing iceberg-mr/iceberg-hive3 Iceberg modules as 
a baseline for it. We can extend and use the new SerDe API in this new 
iceberg-handler module and iterate faster. When there is a Hive release we can 
decide our next steps based on the actual landscape, and in the meantime we can 
port the changes between the 2 repo which does not require the new APIs.

I would like to hear both teams opinion of the proposed solution.

Thanks,
Peter

Reply via email to