[GitHub] [iceberg] pvary commented on a change in pull request #2228: Hive: Implement multi-table inserts

GitBox Mon, 29 Mar 2021 04:41:26 -0700


pvary commented on a change in pull request #2228:
URL: https://github.com/apache/iceberg/pull/2228#discussion_r603223107




##########
File path: 
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##########
@@ -118,170 +134,269 @@ public void abortTask(TaskAttemptContext 
originalContext) throws IOException {
     TaskAttemptContext context = 
TezUtil.enrichContextWithAttemptWrapper(originalContext);
 
     // Clean up writer data from the local store
-    HiveIcebergRecordWriter writer = 
HiveIcebergRecordWriter.removeWriter(context.getTaskAttemptID());
+    Map<String, HiveIcebergRecordWriter> writers = 
HiveIcebergRecordWriter.removeWriters(context.getTaskAttemptID());
 
     // Remove files if it was not done already
-    if (writer != null) {
-      writer.close(true);
+    if (writers != null) {
+      for (HiveIcebergRecordWriter writer : writers.values()) {
+        writer.close(true);
+      }
     }
   }
 
   /**
-   * Reads the commit files stored in the temp directory and collects the 
generated committed data files.
-   * Appends the data files to the table. At the end removes the temporary 
directory.
+   * Reads the commit files stored in the temp directories and collects the 
generated committed data files.
+   * Appends the data files to the tables. At the end removes the temporary 
directories.
    * @param originalContext The job context
-   * @throws IOException if there is a failure deleting the files
+   * @throws IOException if there is a failure accessing the files
    */
   @Override
   public void commitJob(JobContext originalContext) throws IOException {
     JobContext jobContext = TezUtil.enrichContextWithVertexId(originalContext);
-
-    JobConf conf = jobContext.getJobConf();
-    Table table = Catalogs.loadTable(conf);
+    JobConf jobConf = jobContext.getJobConf();
 
     long startTime = System.currentTimeMillis();
-    LOG.info("Committing job has started for table: {}, using location: {}", 
table,
-        generateJobLocation(conf, jobContext.getJobID()));
+    LOG.info("Committing job {} has started", jobContext.getJobID());
 
-    FileIO io = HiveIcebergStorageHandler.table(jobContext.getJobConf()).io();
-    List<DataFile> dataFiles = dataFiles(jobContext, io, true);
+    Collection<String> outputs = 
HiveIcebergStorageHandler.outputTables(jobContext.getJobConf());
+    Queue<String> jobLocations = new ConcurrentLinkedQueue<>();
 
-    if (dataFiles.size() > 0) {
-      // Appending data files to the table
-      AppendFiles append = table.newAppend();
-      dataFiles.forEach(append::appendFile);
-      append.commit();
-      LOG.info("Commit took {} ms for table: {} with {} file(s)", 
System.currentTimeMillis() - startTime, table,
-          dataFiles.size());
-      LOG.debug("Added files {}", dataFiles);
-    } else {
-      LOG.info("Commit took {} ms for table: {} with no new files", 
System.currentTimeMillis() - startTime, table);
+    ExecutorService fileExecutor = fileExecutor(jobConf);
+    ExecutorService tableExecutor = tableExecutor(jobConf, outputs.size());
+    try {
+      // Commits the changes for the output tables in parallel
+      Tasks.foreach(outputs)
+          .throwFailureWhenFinished()
+          .stopOnFailure()
+          .executeWith(tableExecutor)
+          .run(output -> {
+            Table table = HiveIcebergStorageHandler.table(jobConf, output);
+            jobLocations.add(generateJobLocation(table.location(), jobConf, 
jobContext.getJobID()));
+            commitTable(table.io(), fileExecutor, jobContext, output, 
table.location());
+          });
+    } finally {
+      fileExecutor.shutdown();
+      if (tableExecutor != null) {
+        tableExecutor.shutdown();
+      }
     }
 
-    cleanup(jobContext);
+    LOG.info("Commit took {} ms for job {}", System.currentTimeMillis() - 
startTime, jobContext.getJobID());
+
+    cleanup(jobContext, jobLocations);
   }
 
   /**
-   * Removes the generated data files, if there is a commit file already 
generated for them.
-   * The cleanup at the end removes the temporary directory as well.
+   * Removes the generated data files if there is a commit file already 
generated for them.
+   * The cleanup at the end removes the temporary directories as well.
    * @param originalContext The job context
    * @param status The status of the job
    * @throws IOException if there is a failure deleting the files
    */
   @Override
   public void abortJob(JobContext originalContext, int status) throws 
IOException {
     JobContext jobContext = TezUtil.enrichContextWithVertexId(originalContext);
+    JobConf jobConf = jobContext.getJobConf();
 
-    String location = generateJobLocation(jobContext.getJobConf(), 
jobContext.getJobID());
-    LOG.info("Job {} is aborted. Cleaning job location {}", 
jobContext.getJobID(), location);
-
-    FileIO io = HiveIcebergStorageHandler.table(jobContext.getJobConf()).io();
-    List<DataFile> dataFiles = dataFiles(jobContext, io, false);
+    LOG.info("Job {} is aborted. Data file cleaning started", 
jobContext.getJobID());
+    Collection<String> outputs = 
HiveIcebergStorageHandler.outputTables(jobContext.getJobConf());
+    Queue<String> jobLocations = new ConcurrentLinkedQueue<>();

Review comment:
       In concurrent sets we have check if the element is already in the set or 
not, so I assume that also means synchronization.
   
   OTOH I get you point. What about using Collection as a target variable, so 
we can show that the FIFO processing is not needed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pvary commented on a change in pull request #2228: Hive: Implement multi-table inserts

Reply via email to