Re: [PR] [IcebergIO] Improve TableCache [beam]

via GitHub Tue, 09 Jun 2026 14:29:27 -0700


gemini-code-assist[bot] commented on code in PR #38882:
URL: https://github.com/apache/beam/pull/38882#discussion_r3383964564



##########
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WritePartitionedRowsToFiles.java:
##########
@@ -91,8 +86,8 @@ private static class WriteDoFn extends DoFn<KV<Row, 
Iterable<Row>>, FileWriteRes
     private final IcebergCatalogConfig catalogConfig;
     private final String filePrefix;
     private final Schema dataSchema;
-    static final Cache<TableIdentifier, LastRefreshedTable> 
LAST_REFRESHED_TABLE_CACHE =
-        CacheBuilder.newBuilder().expireAfterAccess(10, 
TimeUnit.MINUTES).build();
+    private int specId = Integer.MIN_VALUE;
+    private Map<String, PartitionField> partitionFieldMap = Maps.newHashMap();

Review Comment:
   ![high](https://www.gstatic.com/codereviewagent/high-priority.svg)
   
   Storing `specId` and `partitionFieldMap` as single instance variables in 
`WriteDoFn` is incorrect because a single `WriteDoFn` instance can process 
elements for multiple different tables (destinations) within the same bundle. 
Since different tables can have the same `specId` (e.g., `0`), this can lead to 
using the partition field map of one table for another table, causing incorrect 
partition paths or failures. We should key these by `TableIdentifier` instead.
   
   ```suggestion
       private final Map<TableIdentifier, Integer> specIds = Maps.newHashMap();
       private final Map<TableIdentifier, Map<String, PartitionField>> 
partitionFieldMaps = Maps.newHashMap();
   ```



##########
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WritePartitionedRowsToFiles.java:
##########
@@ -113,9 +108,8 @@ public void processElement(
       String partitionPath = 
checkStateNotNull(element.getKey().getString(PARTITION));
 
       IcebergDestination destination = 
dynamicDestinations.instantiateDestination(tableIdentifier);
-      LastRefreshedTable lastRefreshedTable = getOrCreateTable(destination, 
dataSchema);
-      Table table = lastRefreshedTable.table;
-      partitionPath = getPartitionDataPath(partitionPath, 
lastRefreshedTable.partitionFieldMap);
+      Table table = getOrCreateTable(destination, dataSchema);
+      partitionPath = getPartitionDataPath(partitionPath, 
getPartitionFieldMap(table));

Review Comment:
   ![high](https://www.gstatic.com/codereviewagent/high-priority.svg)
   
   Update the call to `getPartitionFieldMap` to pass the table identifier so 
that the correct partition field map is retrieved for the specific table.
   
   ```suggestion
         partitionPath = getPartitionDataPath(partitionPath, 
getPartitionFieldMap(destination.getTableIdentifier(), table));
   ```



##########
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WritePartitionedRowsToFiles.java:
##########
@@ -146,60 +140,30 @@ public void processElement(
               .build());
     }
 
-    static final class LastRefreshedTable {
-      final Table table;
-      volatile Instant lastRefreshTime;
-      static final Duration STALENESS_THRESHOLD = Duration.ofMinutes(2);
-      private int specId;
-      volatile Map<String, PartitionField> partitionFieldMap = 
Maps.newHashMap();
-
-      LastRefreshedTable(Table table, Instant lastRefreshTime) {
-        this.table = table;
-        this.specId = table.spec().specId();
-        this.lastRefreshTime = lastRefreshTime;
-        for (PartitionField partitionField : table.spec().fields()) {
-          partitionFieldMap.put(partitionField.name(), partitionField);
-        }
+    private Map<String, PartitionField> getPartitionFieldMap(Table table) {
+      if (table.spec().specId() == this.specId) {
+        return partitionFieldMap;
       }
-
-      /**
-       * Refreshes the table metadata if it is considered stale (older than 2 
minutes).
-       *
-       * <p>This method first performs a non-synchronized check on the table's 
freshness. This
-       * provides a lock-free fast path that avoids synchronization overhead 
in the common case
-       * where the table does not need to be refreshed. If the table might be 
stale, it then enters
-       * a synchronized block to ensure that only one thread performs the 
refresh operation.
-       */
-      void refreshIfStale() {
-        // Fast path: Avoid entering the synchronized block if the table is 
not stale.
-        if (lastRefreshTime.isAfter(Instant.now().minus(STALENESS_THRESHOLD))) 
{
-          return;
-        }
-        synchronized (this) {
-          if 
(lastRefreshTime.isBefore(Instant.now().minus(STALENESS_THRESHOLD))) {
-            table.refresh();
-            lastRefreshTime = Instant.now();
-            if (table.spec().specId() != this.specId) {
-              partitionFieldMap = Maps.newHashMap();
-              for (PartitionField partitionField : table.spec().fields()) {
-                partitionFieldMap.put(partitionField.name(), partitionField);
-              }
-              this.specId = table.spec().specId();
-            }
-          }
-        }
+      Map<String, PartitionField> partitionFieldMap = Maps.newHashMap();
+      for (PartitionField partitionField : table.spec().fields()) {
+        partitionFieldMap.put(partitionField.name(), partitionField);
       }
+      this.specId = table.spec().specId();
+      this.partitionFieldMap = partitionFieldMap;
+      return partitionFieldMap;
     }

Review Comment:
   ![high](https://www.gstatic.com/codereviewagent/high-priority.svg)
   
   Update `getPartitionFieldMap` to accept the `TableIdentifier` and look 
up/store the partition field map and spec ID per table identifier.
   
   ```suggestion
       private Map<String, PartitionField> getPartitionFieldMap(TableIdentifier 
tableId, Table table) {
         int currentSpecId = table.spec().specId();
         if (currentSpecId == specIds.getOrDefault(tableId, Integer.MIN_VALUE)) 
{
           return partitionFieldMaps.get(tableId);
         }
         Map<String, PartitionField> fieldMap = Maps.newHashMap();
         for (PartitionField partitionField : table.spec().fields()) {
           fieldMap.put(partitionField.name(), partitionField);
         }
         specIds.put(tableId, currentSpecId);
         partitionFieldMaps.put(tableId, fieldMap);
         return fieldMap;
       }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [IcebergIO] Improve TableCache [beam]

Reply via email to