nsivabalan commented on code in PR #8684:
URL: https://github.com/apache/hudi/pull/8684#discussion_r1220135253
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1138,7 +1120,11 @@ public DirectoryInfo(String relativePath, FileStatus[]
fileStatus) {
this.isHoodiePartition = true;
} else if (FSUtils.isDataFile(status.getPath())) {
// Regular HUDI data file (base file or log file)
- filenameToSizeMap.put(status.getPath().getName(), status.getLen());
+ String dataFileCommitTime =
FSUtils.getCommitTime(status.getPath().getName());
Review Comment:
Incase of MOR table, for a log file, the base instance time could be <
actual delta commit time. So, we might skip the log files based on this logic?
since in L1125, we are filtering for files whose commit time < max instant
time. May be we should use last mod time instead?
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java:
##########
@@ -337,30 +337,15 @@ protected void initMetadataTable(WriteOperationType
operationType, Option<String
*
* @param inFlightInstantTimestamp - The in-flight action responsible for
the metadata table initialization
*/
- private void initializeMetadataTable(WriteOperationType operationType,
Option<String> inFlightInstantTimestamp) {
+ private void initializeMetadataTable(Option<String>
inFlightInstantTimestamp) {
if (!config.isMetadataTableEnabled()) {
return;
}
try (HoodieTableMetadataWriter writer =
SparkHoodieBackedTableMetadataWriter.create(context.getHadoopConf().get(),
config,
- context, Option.empty(), inFlightInstantTimestamp)) {
+ context, Option.empty(), inFlightInstantTimestamp)) {
if (writer.isInitialized()) {
Review Comment:
We added the guard rail for table services in MDT to be triggered only by
regular writers in Data table, so that for a single writer modes with async
table services, there won't be any race conditions.
Ref: https://github.com/apache/hudi/pull/3900
But the code evolved and we automatically enable in process lock provider
for single writer mode w/ async table services. And so, we should be good to
remove the constraint. Just that we might keep triggering the schedule of
compaction in MDT everytime. May be we can intercept from the active timeline
on when is the last time, compaction was triggered and add in some
optimization.
Nothing required for this patch. But as a follow up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]