[PR] HIVE-27458: Support HCatStorer and HCatLoader for Iceberg tables [hive]

via GitHub Sat, 31 May 2025 03:04:49 -0700


maswin opened a new pull request, #4446:
URL: https://github.com/apache/hive/pull/4446

Changes Made:
[HIVE-27458: Support multiple OutputCommitter in a single
Vertex](https://github.com/apache/hive/pull/4446/commits/c5ae473d84cc7d1ab332c6c056681aef1d72216d)
:
Reason: There can be more than one OutputCommitter per vertex. If a job
writes to 2 Iceberg tables in the same vertex, there can be 2 commit tasks
happening in same vertex. In that case the first committer after finishing the
commitTask will remove all the writers from the WritersRegistry that belongs to
this task. WritersRegistry holds all writers in a concurrentMap with
taskAttempt id as key -
https://github.com/apache/hive/blob/9da7488179e7c69d986dbc8a6654a5c3dc6c0210/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/writer/WriterRegistry.java#L29
To fix this:

- Added a new setting "iceberg.output.id" to differentiate between different
output committers so the commitTask will only remove writer belonging to this
OutputCommitter.
- For existing Hive jobs it won't cause any issue as the id will be null and
still the key will use (taskId, null) as key which is similar to existing flow.

[HIVE-27458: Support HCatStorer and HCatLoader for Iceberg
tables](https://github.com/apache/hive/pull/4446/commits/af8b44889152194639511a48bf6e9ea5cd32c9fd)
:

Reason: The [Pig
Reader](https://github.com/apache/iceberg/tree/master/pig/src/main/java/org/apache/iceberg/pig)
available in Iceberg can only read Parquet tables using Hadoop catalog. There
is no way to read tables in Glue or Hive catalog via Pig. Also there is no way
to write data via Pig.

- There is not much change required apart from setting the write operation
which right now will always be OTHER as delete/update or not supported in Pig.
- "iceberg.mr.output.tables" setting was missing as it is set in
configureJob. This can be set in the configureOutputProperties along with the
"iceberg.mr.operation.type" property.
- configureOutputProperties can be called multiple times. The implementation
should take care of [populating similar
properties](https://github.com/apache/hive/blob/9da7488179e7c69d986dbc8a6654a5c3dc6c0210/ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java#L161).
So to avoid adding the same table name again and again, added a check to see
if the outputTable name is already present in the config.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

[PR] HIVE-27458: Support HCatStorer and HCatLoader for Iceberg tables [hive]

Reply via email to