voonhous commented on code in PR #18965:
URL: https://github.com/apache/hudi/pull/18965#discussion_r3393838872
##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2499,8 +2501,39 @@ public static HoodieRecordGlobalLocation
getLocationFromRecordIndexInfo(
fileId = originalFileId;
}
- final java.util.Date instantDate = new java.util.Date(instantTime);
- return new HoodieRecordGlobalLocation(partition,
HoodieInstantTimeGenerator.formatDate(instantDate), fileId);
+ return new HoodieRecordGlobalLocation(partition,
formatRecordIndexInstant(instantTime), fileId);
+ }
+
+ // The formatted instant of a record-index entry is a pure function of the
entry's epoch millis
+ // and the JVM default time zone used by
HoodieInstantTimeGenerator.formatDate, so cached strings
+ // are only valid for the zone they were formatted under. The cache is per
thread because
+ // lookups run on concurrent executor task threads and decode per-record
last-write instants,
+ // where a shared slot would just be eviction churn.
+ private static final int FORMATTED_INSTANT_CACHE_MAX_SIZE = 1024;
+ private static final ThreadLocal<FormattedInstantCache>
FORMATTED_INSTANT_CACHE =
+ ThreadLocal.withInitial(FormattedInstantCache::new);
+
+ private static final class FormattedInstantCache {
+ private ZoneId zoneId;
+ private final Map<Long, String> formattedByMillis = new HashMap<>();
+ }
+
+ private static String formatRecordIndexInstant(long instantTimeMillis) {
+ FormattedInstantCache cache = FORMATTED_INSTANT_CACHE.get();
+ ZoneId currentZoneId = ZoneId.systemDefault();
Review Comment:
On the timezone question: yes, the millis are computed by
`TimelineUtils.parseDateFromInstantTime`, which resolves the instant digits in
the JVM default time zone, and `formatDate` converts back through the same
default zone. That is pre-existing behavior on master; this PR changed neither
how the millis are computed nor what is stored in the index entries, only how
often the parse runs.
On the cache: agreed, it needed zone-change invalidation to stay correct,
which is more machinery than the formatting cost justifies. Dropped it --
`getLocationFromRecordIndexInfo` now formats directly per call exactly as on
master, and the PR is scoped to the write-side parse hoisting only.
TLDR: Optimization is only on write-side which does not require cache,
read-side optimization reverted to match master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]