pvary commented on a change in pull request #1407:
URL: https://github.com/apache/iceberg/pull/1407#discussion_r486493706



##########
File path: 
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
##########
@@ -271,28 +271,25 @@ public void setupTask(TaskAttemptContext 
taskAttemptContext) {
     }
 
     @Override
-    public boolean needsTaskCommit(TaskAttemptContext taskAttemptContext) {
-      return true;
+    public boolean needsTaskCommit(TaskAttemptContext context) {
+      // We need to commit if this is the last phase of a MapReduce process
+      return 
TaskType.REDUCE.equals(context.getTaskAttemptID().getTaskID().getTaskType()) ||
+          context.getJobConf().getNumReduceTasks() == 0;

Review comment:
       We can define OutputCommitter on job level only, and AFAIK Hive writes 
only on the last stage of the job be it a Reducer for a MapReduce job, or a Map 
for a MapOnly job.
   
   This way we can save IO by not generating and reading those unnecessary 
files, but depending on assumptions of the way how Hive works.
   
   Your thoughts? Still voting for the more general solution? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to