marton-bod commented on a change in pull request #2228:
URL: https://github.com/apache/iceberg/pull/2228#discussion_r573633680
##########
File path:
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##########
@@ -135,29 +153,31 @@ public void abortTask(TaskAttemptContext originalContext)
throws IOException {
@Override
public void commitJob(JobContext originalContext) throws IOException {
JobContext jobContext = TezUtil.enrichContextWithVertexId(originalContext);
-
- JobConf conf = jobContext.getJobConf();
- Table table = Catalogs.loadTable(conf);
+ JobConf jobConf = jobContext.getJobConf();
long startTime = System.currentTimeMillis();
- LOG.info("Committing job has started for table: {}, using location: {}",
table,
- generateJobLocation(conf, jobContext.getJobID()));
+ LOG.info("Committing job {} has started", jobContext.getJobID());
- FileIO io = HiveIcebergStorageHandler.io(jobContext.getJobConf());
- List<DataFile> dataFiles = dataFiles(jobContext, io, true);
+ Map<String, String> outputs =
SerializationUtil.deserializeFromBase64(jobConf.get(InputFormatConfig.OUTPUT_TABLES));
- if (dataFiles.size() > 0) {
- // Appending data files to the table
- AppendFiles append = table.newAppend();
- dataFiles.forEach(append::appendFile);
- append.commit();
- LOG.info("Commit took {} ms for table: {} with {} file(s)",
System.currentTimeMillis() - startTime, table,
- dataFiles.size());
- LOG.debug("Added files {}", dataFiles);
- } else {
- LOG.info("Commit took {} ms for table: {} with no new files",
System.currentTimeMillis() - startTime, table);
+ ExecutorService fileExecutor = fileExecutor(jobConf);
+ ExecutorService tableExecutor = tableExecutor(jobConf, outputs.size());
+ try {
+ // Commits the changes for the output tables in parallel
+ Tasks.foreach(outputs.entrySet())
+ .throwFailureWhenFinished()
+ .stopOnFailure()
+ .executeWith(tableExecutor)
+ .run(entry -> commitTable(fileExecutor, jobContext, entry.getKey(),
entry.getValue()));
+ } finally {
+ fileExecutor.shutdown();
+ if (tableExecutor != null) {
+ tableExecutor.shutdown();
+ }
}
+ LOG.info("Commit took {} ms for job {}", System.currentTimeMillis() -
startTime, jobContext.getJobID());
Review comment:
If all the commits are dispatched asynchronously, is there a chance of
logging this too early and potentially running the `cleanup` while the commit
is still in progress (i.e. in case of long-running commits or many table
locations)?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]