[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4716: [HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues

GitBox Mon, 31 Jan 2022 13:27:24 -0800


alexeykudinkin commented on a change in pull request #4716:
URL: https://github.com/apache/hudi/pull/4716#discussion_r796077526




##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
##########
@@ -442,8 +443,11 @@ private void updateTableMetadata(HoodieTable<T, 
JavaRDD<HoodieRecord<T>>, JavaRD
             metaClient, config, context, 
SparkUpgradeDowngradeHelper.getInstance())
             .run(HoodieTableVersion.current(), instantTime);
         metaClient.reloadActiveTimeline();
-        initializeMetadataTable(Option.of(instantTime));
       }
+      // Initialize Metadata Table to make sure it's bootstrapped _before_ the 
operation,
+      // if it didn't exist before
+      // See https://issues.apache.org/jira/browse/HUDI-3343 for more details
+      initializeMetadataTable(Option.of(instantTime));

Review comment:
       So the thinking is like following: 
   1. MT should be aware of all the (relevant) files in the directory
   2. MT should not be bootstrapped in the middle of the operation (to avoid it 
seeing intermediate state) 
   
   In case of multi-writers -- there're multiple possible scenarios
   1. Writers running concurrently: this should be handled by taking the mutex, 
to make sure MT is bootstrapped once
   2. Writer bootstrapping MT running after previous Writer failed to commit 
(leaving intermediate state in the directory): in that case i'd still argue 
it's better to MT to be aware of this files than not, so that it's inline with 
FS-based implementation. Whether these files will be part of the output should 
be decided by the record-reader, not MT.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4716: [HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues

Reply via email to