maswin opened a new pull request, #4446:
URL: https://github.com/apache/hive/pull/4446

   Changes Made:
   [HIVE-27458: Support multiple OutputCommitter in a single 
Vertex](https://github.com/apache/hive/pull/4446/commits/c5ae473d84cc7d1ab332c6c056681aef1d72216d)
 : 
   Reason: There can be more than one OutputCommitter per vertex. If a job 
writes to 2 Iceberg tables in the same vertex, there can be 2 commit tasks 
happening in same vertex. In that case the first committer after finishing the 
commitTask will remove all the writers from the WritersRegistry that belongs to 
this task. WritersRegistry holds all writers in a concurrentMap with 
taskAttempt id as key - 
https://github.com/apache/hive/blob/9da7488179e7c69d986dbc8a6654a5c3dc6c0210/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/writer/WriterRegistry.java#L29
   To fix this:
   
   - Added a new setting "iceberg.output.id" to differentiate between different 
output committers so the commitTask will only remove writer belonging to this 
OutputCommitter.
   - For existing Hive jobs it won't cause any issue as the id will be null and 
still the key will use (taskId, null) as key which is similar to existing flow.
   
   [HIVE-27458: Support HCatStorer and HCatLoader for Iceberg 
tables](https://github.com/apache/hive/pull/4446/commits/af8b44889152194639511a48bf6e9ea5cd32c9fd)
 : 
   
   Reason: The [Pig 
Reader](https://github.com/apache/iceberg/tree/master/pig/src/main/java/org/apache/iceberg/pig)
 available in Iceberg can only read Parquet tables using Hadoop catalog. There 
is no way to read tables in Glue or Hive catalog via Pig. Also there is no way 
to write data via Pig.
   
   - There is not much change required apart from setting the write operation 
which right now will always be OTHER as delete/update or not supported in Pig.
   - "iceberg.mr.output.tables" setting was missing as it is set in 
configureJob. This can be set in the configureOutputProperties along with the 
"iceberg.mr.operation.type" property. 
   - configureOutputProperties can be called multiple times. The implementation 
should take care of [populating similar 
properties](https://github.com/apache/hive/blob/9da7488179e7c69d986dbc8a6654a5c3dc6c0210/ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java#L161).
 So to avoid adding the same table name again and again, added a check to see 
if the outputTable name is already present in the config. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Reply via email to