[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

ASF GitHub Bot (Jira) Tue, 20 Apr 2021 04:53:06 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585741&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585741
 ]


ASF GitHub Bot logged work on HIVE-25006:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/Apr/21 11:52
            Start Date: 20/Apr/21 11:52
    Worklog Time Spent: 10m 
      Work Description: marton-bod commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r616608877



##########
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##########
@@ -105,13 +105,18 @@ public void commitTask(TaskAttemptContext 
originalContext) throws IOException {
           .executeWith(tableExecutor)
           .run(output -> {
             Table table = 
HiveIcebergStorageHandler.table(context.getJobConf(), output);
-            HiveIcebergRecordWriter writer = writers.get(output);
-            DataFile[] closedFiles = writer != null ? writer.dataFiles() : new 
DataFile[0];
-            String fileForCommitLocation = 
generateFileForCommitLocation(table.location(), jobConf,
-                attemptID.getJobID(), attemptID.getTaskID().getId());
-
-            // Creating the file containing the data files generated by this 
task for this table
-            createFileForCommit(closedFiles, fileForCommitLocation, 
table.io());
+            if (table != null) {

Review comment:
       This happens during task commit, so before the commitInsert hook is 
called. 
   
   The essential problem here is that `OUTPUT_TABLES` contains all the tables, 
however, only those tables are serialized into the jobconfig that are relevant 
for the given task. So it tries to iterate over 1...N tables (based on 
`OUTPUT_TABLES`), but only has access to serialized Table 1 (hence the if). The 
whole parallel commit logic for multitable inserts on both the task commit and 
job commit side are broken I think, if there is more than one vertex writing to 
target tables. Currently the tests pass because it creates a single writer 
vertex, which will have both tables serialized into its config.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 585741)
    Time Spent: 1h 50m  (was: 1h 40m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> ------------------------------------------------------
>
>                 Key: HIVE-25006
>                 URL: https://issues.apache.org/jira/browse/HIVE-25006
>             Project: Hive
>          Issue Type: Task
>            Reporter: Marton Bod
>            Assignee: Marton Bod
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

Reply via email to