Re: [PR] perf(metadata): Parse RLI instant time once per batch instead of per … [hudi]

via GitHub Wed, 10 Jun 2026 23:53:14 -0700


voonhous commented on code in PR #18965:
URL: https://github.com/apache/hudi/pull/18965#discussion_r3393838872



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2499,8 +2501,39 @@ public static HoodieRecordGlobalLocation 
getLocationFromRecordIndexInfo(
       fileId = originalFileId;
     }
 
-    final java.util.Date instantDate = new java.util.Date(instantTime);
-    return new HoodieRecordGlobalLocation(partition, 
HoodieInstantTimeGenerator.formatDate(instantDate), fileId);
+    return new HoodieRecordGlobalLocation(partition, 
formatRecordIndexInstant(instantTime), fileId);
+  }
+
+  // The formatted instant of a record-index entry is a pure function of the 
entry's epoch millis
+  // and the JVM default time zone used by 
HoodieInstantTimeGenerator.formatDate, so cached strings
+  // are only valid for the zone they were formatted under. The cache is per 
thread because
+  // lookups run on concurrent executor task threads and decode per-record 
last-write instants,
+  // where a shared slot would just be eviction churn.
+  private static final int FORMATTED_INSTANT_CACHE_MAX_SIZE = 1024;
+  private static final ThreadLocal<FormattedInstantCache> 
FORMATTED_INSTANT_CACHE =
+      ThreadLocal.withInitial(FormattedInstantCache::new);
+
+  private static final class FormattedInstantCache {
+    private ZoneId zoneId;
+    private final Map<Long, String> formattedByMillis = new HashMap<>();
+  }
+
+  private static String formatRecordIndexInstant(long instantTimeMillis) {
+    FormattedInstantCache cache = FORMATTED_INSTANT_CACHE.get();
+    ZoneId currentZoneId = ZoneId.systemDefault();

Review Comment:
   On the timezone question: yes, the millis are computed by 
`TimelineUtils.parseDateFromInstantTime`, which resolves the instant digits in 
the JVM default time zone, and `formatDate` converts back through the same 
default zone. That is pre-existing behavior on master; this PR changed neither 
how the millis are computed nor what is stored in the index entries, only how 
often the parse runs.
   
   On the cache: agreed, it needed zone-change invalidation to stay correct, 
which is more machinery than the formatting cost justifies. Dropped it -- 
`getLocationFromRecordIndexInfo` now formats directly per call exactly as on 
master, and the PR is scoped to the write-side parse hoisting only.
   
   TLDR: Optimization is only on write-side which does not require cache, 
read-side optimization reverted to match master.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf(metadata): Parse RLI instant time once per batch instead of per … [hudi]

Reply via email to