[GitHub] [iceberg] pvary commented on a change in pull request #1407: Hive: HiveIcebergOutputFormat first implementation for handling Hive inserts into unpartitioned Iceberg tables - WIP

GitBox Fri, 11 Sep 2020 11:24:10 -0700


pvary commented on a change in pull request #1407:
URL: https://github.com/apache/iceberg/pull/1407#discussion_r487214980




##########
File path: 
mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputFormat.java
##########
@@ -271,28 +271,25 @@ public void setupTask(TaskAttemptContext 
taskAttemptContext) {
     }
 
     @Override
-    public boolean needsTaskCommit(TaskAttemptContext taskAttemptContext) {
-      return true;
+    public boolean needsTaskCommit(TaskAttemptContext context) {
+      // We need to commit if this is the last phase of a MapReduce process
+      return 
TaskType.REDUCE.equals(context.getTaskAttemptID().getTaskID().getTaskType()) ||
+          context.getJobConf().getNumReduceTasks() == 0;

Review comment:
       I have tested this with adding the jars to the Hive master from the 
apache repo and running the queries in local mode. I got some exception and 
started to investigate the source.
   
   The actual query I have used for this had an order by so it had both a 
Mapper and a Reducer phase:
   ```
   insert into purchases select * from purchases order by id;
   ```
   My debugging revealed that the commitTask was called for Mappers and 
Reducers as well. The JobRunner infrastructure calls those and it does not have 
any information about if the task has actually written anything to anywhere or 
not. I think this is the purpose of the _needsTaskCommit()_ method, so the 
developer of the OutputCommitter could decide.
   
   @massdosage: Do you have an easy example for HiveRunner write tests?
   
   Thanks,
   Peter




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pvary commented on a change in pull request #1407: Hive: HiveIcebergOutputFormat first implementation for handling Hive inserts into unpartitioned Iceberg tables - WIP

Reply via email to