maswin opened a new pull request, #4446: URL: https://github.com/apache/hive/pull/4446
Changes Made: [HIVE-27458: Support multiple OutputCommitter in a single Vertex](https://github.com/apache/hive/pull/4446/commits/c5ae473d84cc7d1ab332c6c056681aef1d72216d) : Reason: There can be more than one OutputCommitter per vertex. If a job writes to 2 Iceberg tables in the same vertex, there can be 2 commit tasks happening in same vertex. In that case the first committer after finishing the commitTask will remove all the writers from the WritersRegistry that belongs to this task. WritersRegistry holds all writers in a concurrentMap with taskAttempt id as key - https://github.com/apache/hive/blob/9da7488179e7c69d986dbc8a6654a5c3dc6c0210/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/writer/WriterRegistry.java#L29 To fix this: - Added a new setting "iceberg.output.id" to differentiate between different output committers so the commitTask will only remove writer belonging to this OutputCommitter. - For existing Hive jobs it won't cause any issue as the id will be null and still the key will use (taskId, null) as key which is similar to existing flow. [HIVE-27458: Support HCatStorer and HCatLoader for Iceberg tables](https://github.com/apache/hive/pull/4446/commits/af8b44889152194639511a48bf6e9ea5cd32c9fd) : Reason: The [Pig Reader](https://github.com/apache/iceberg/tree/master/pig/src/main/java/org/apache/iceberg/pig) available in Iceberg can only read Parquet tables using Hadoop catalog. There is no way to read tables in Glue or Hive catalog via Pig. Also there is no way to write data via Pig. - There is not much change required apart from setting the write operation which right now will always be OTHER as delete/update or not supported in Pig. - "iceberg.mr.output.tables" setting was missing as it is set in configureJob. This can be set in the configureOutputProperties along with the "iceberg.mr.operation.type" property. - configureOutputProperties can be called multiple times. The implementation should take care of [populating similar properties](https://github.com/apache/hive/blob/9da7488179e7c69d986dbc8a6654a5c3dc6c0210/ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java#L161). So to avoid adding the same table name again and again, added a check to see if the outputTable name is already present in the config. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org