Hi Team,

We are starting to implement insert overwrites for Iceberg tables in Hive.
The current situation is that we are committing our inserts on the
TezAM/Application master (MR) side, where we have no information whether
the insert query was an insert overwrite or not. To make insert overwrites
work, we have to migrate our job commit logic to the HS2 side into the
HiveMetaHook, which does provide this overwrite flag that we need.

However, in the HiveMetaHook, we lack some crucial information for the
commit that we previously relied on, such as the JobID, the Tez VertexId,
and the number of map/reduce tasks that have produced data files. What we
do have access to is the table location and the query id. So our solution
would be to collect all the information under the tableLocation/queryId
folder during Tez/MR execution, and then in the HiveMetaHook, use file
listing to get the contents of that folder - which would provide us with
all the info we need for the commit to work reliably.

This would mean a single listing operation per query, so while there's a
performance overhead, it shouldn't be significant. Also, now that S3
listing is consistent, it would be safe to rely on the results. However,
given how the project has previously tried to minimize listing operations,
I wanted to get your opinions on this, whether you have any objections or
see any risks.

Thanks a lot,
Marton

Reply via email to