Hi Team, We are starting to implement insert overwrites for Iceberg tables in Hive. The current situation is that we are committing our inserts on the TezAM/Application master (MR) side, where we have no information whether the insert query was an insert overwrite or not. To make insert overwrites work, we have to migrate our job commit logic to the HS2 side into the HiveMetaHook, which does provide this overwrite flag that we need.
However, in the HiveMetaHook, we lack some crucial information for the commit that we previously relied on, such as the JobID, the Tez VertexId, and the number of map/reduce tasks that have produced data files. What we do have access to is the table location and the query id. So our solution would be to collect all the information under the tableLocation/queryId folder during Tez/MR execution, and then in the HiveMetaHook, use file listing to get the contents of that folder - which would provide us with all the info we need for the commit to work reliably. This would mean a single listing operation per query, so while there's a performance overhead, it shouldn't be significant. Also, now that S3 listing is consistent, it would be safe to rely on the results. However, given how the project has previously tried to minimize listing operations, I wanted to get your opinions on this, whether you have any objections or see any risks. Thanks a lot, Marton