[GitHub] [hudi] nsivabalan commented on a diff in pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

via GitHub Tue, 06 Jun 2023 11:59:27 -0700


nsivabalan commented on code in PR #8684:
URL: https://github.com/apache/hudi/pull/8684#discussion_r1220135253



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1138,7 +1120,11 @@ public DirectoryInfo(String relativePath, FileStatus[] 
fileStatus) {
           this.isHoodiePartition = true;
         } else if (FSUtils.isDataFile(status.getPath())) {
           // Regular HUDI data file (base file or log file)
-          filenameToSizeMap.put(status.getPath().getName(), status.getLen());
+          String dataFileCommitTime = 
FSUtils.getCommitTime(status.getPath().getName());

Review Comment:
   Incase of MOR table, for a log file, the base instance time could be < 
actual delta commit time. So, we might skip the log files based on this logic? 
   since in L1125, we are filtering for files whose commit time < max instant 
time. May be we should use last mod time instead? 



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java:
##########
@@ -337,30 +337,15 @@ protected void initMetadataTable(WriteOperationType 
operationType, Option<String
    *
    * @param inFlightInstantTimestamp - The in-flight action responsible for 
the metadata table initialization
    */
-  private void initializeMetadataTable(WriteOperationType operationType, 
Option<String> inFlightInstantTimestamp) {
+  private void initializeMetadataTable(Option<String> 
inFlightInstantTimestamp) {
     if (!config.isMetadataTableEnabled()) {
       return;
     }
 
     try (HoodieTableMetadataWriter writer = 
SparkHoodieBackedTableMetadataWriter.create(context.getHadoopConf().get(), 
config,
-            context, Option.empty(), inFlightInstantTimestamp)) {
+        context, Option.empty(), inFlightInstantTimestamp)) {
       if (writer.isInitialized()) {

Review Comment:
   We added the guard rail for table services in MDT to be triggered only by 
regular writers in Data table, so that for a single writer modes with async 
table services, there won't be any race conditions. 
   Ref: https://github.com/apache/hudi/pull/3900
   But the code evolved and we automatically enable in process lock provider 
for single writer mode w/ async table services. And so, we should be good to 
remove the constraint. Just that we might keep triggering the schedule of 
compaction in MDT everytime. May be we can intercept from the active timeline 
on when is the last time, compaction was triggered and add in some 
optimization. 
   Nothing required for this patch. But as a follow up. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

Reply via email to