alexeykudinkin commented on a change in pull request #4716:
URL: https://github.com/apache/hudi/pull/4716#discussion_r796077526
##########
File path:
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
##########
@@ -442,8 +443,11 @@ private void updateTableMetadata(HoodieTable<T,
JavaRDD<HoodieRecord<T>>, JavaRD
metaClient, config, context,
SparkUpgradeDowngradeHelper.getInstance())
.run(HoodieTableVersion.current(), instantTime);
metaClient.reloadActiveTimeline();
- initializeMetadataTable(Option.of(instantTime));
}
+ // Initialize Metadata Table to make sure it's bootstrapped _before_ the
operation,
+ // if it didn't exist before
+ // See https://issues.apache.org/jira/browse/HUDI-3343 for more details
+ initializeMetadataTable(Option.of(instantTime));
Review comment:
So the thinking is like following:
1. MT should be aware of all the (relevant) files in the directory
2. MT should not be bootstrapped in the middle of the operation (to avoid it
seeing intermediate state)
In case of multi-writers -- there're multiple possible scenarios
1. Writers running concurrently: this should be handled by taking the mutex,
to make sure MT is bootstrapped once
2. Writer bootstrapping MT running after previous Writer failed to commit
(leaving intermediate state in the directory): in that case i'd still argue
it's better to MT to be aware of this files than not, so that it's inline with
FS-based implementation. Whether these files will be part of the output should
be decided by the record-reader, not MT.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]