marton-bod commented on a change in pull request #2347:
URL: https://github.com/apache/hive/pull/2347#discussion_r645459374



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
##########
@@ -384,24 +382,28 @@ private void collectCommitInformation(TezWork work) 
throws IOException, TezExcep
               // get all target tables this vertex wrote to
               List<String> tables = new ArrayList<>();
               for (Map.Entry<String, String> entry : jobConf) {
-                if (entry.getKey().startsWith("iceberg.mr.serialized.table.")) 
{
-                  
tables.add(entry.getKey().substring("iceberg.mr.serialized.table.".length()));
+                if 
(entry.getKey().startsWith(ICEBERG_SERIALIZED_TABLE_PREFIX)) {
+                  
tables.add(entry.getKey().substring(ICEBERG_SERIALIZED_TABLE_PREFIX.length()));
                 }
               }
-              // save information for each target table (jobID, task num, 
query state)
+              // find iceberg props in jobConf as they can be needed, but not 
available, during job commit
+              Map<String, String> icebergProperties = new HashMap<>();
+              jobConf.forEach(e -> {
+                // don't copy the serialized tables, they're not needed 
anymore and take up lots of space
+                if (e.getKey().startsWith("iceberg.mr.") && 
!e.getKey().startsWith(ICEBERG_SERIALIZED_TABLE_PREFIX)) {
+                  icebergProperties.put(e.getKey(), e.getValue());
+                }
+              });
+              // save information for each target table (jobID, task num)
               for (String table : tables) {
-                sessionConf.set(HIVE_TEZ_COMMIT_JOB_ID_PREFIX + table, 
jobIdStr);
-                sessionConf.setInt(HIVE_TEZ_COMMIT_TASK_COUNT_PREFIX + table,
-                    status.getProgress().getSucceededTaskCount());
+                SessionStateUtil.newCommitInfo(jobConf, table)

Review comment:
       So what we want to do here is somehow store several pieces information 
that belong together, related to commits.
   We have a few options:
   1. Have a single util method with a long list of parameters, e.g. 
`SessionStateUtil.addCommitInfo(conf, tableName, jobID, taskNum, 
additionalProps)`. The `CommitInfo` object would be constructed then in the 
Util, and retrieved on the client side in the MetaHook/JobCommitter.
   2. Get a `CommitInfo` object from the Util, populate it and save/cache it 
with a fluent API (as in the PR)
   2. Get a `CommitInfo` object from the Util, populate it but use a Util 
method to actually save it, so make it a 3-step process e.g.  `1. CommitInfo 
info = Util.newCommit(); 2. // populate info fields ; 3. 
Util.addCommitInfo(info);`
   
   I don't have a strong preference for either. I chose the second option 
because I don't favour long parameter lists which are unnamed, and I like 
fluent APIs :)  But I'm open to option 1 and 3 as well. With only this many 
parameters, option 1 seems alright. What is your take on this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to