gemini-code-assist[bot] commented on code in PR #38882:
URL: https://github.com/apache/beam/pull/38882#discussion_r3383964564
##########
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WritePartitionedRowsToFiles.java:
##########
@@ -91,8 +86,8 @@ private static class WriteDoFn extends DoFn<KV<Row,
Iterable<Row>>, FileWriteRes
private final IcebergCatalogConfig catalogConfig;
private final String filePrefix;
private final Schema dataSchema;
- static final Cache<TableIdentifier, LastRefreshedTable>
LAST_REFRESHED_TABLE_CACHE =
- CacheBuilder.newBuilder().expireAfterAccess(10,
TimeUnit.MINUTES).build();
+ private int specId = Integer.MIN_VALUE;
+ private Map<String, PartitionField> partitionFieldMap = Maps.newHashMap();
Review Comment:

Storing `specId` and `partitionFieldMap` as single instance variables in
`WriteDoFn` is incorrect because a single `WriteDoFn` instance can process
elements for multiple different tables (destinations) within the same bundle.
Since different tables can have the same `specId` (e.g., `0`), this can lead to
using the partition field map of one table for another table, causing incorrect
partition paths or failures. We should key these by `TableIdentifier` instead.
```suggestion
private final Map<TableIdentifier, Integer> specIds = Maps.newHashMap();
private final Map<TableIdentifier, Map<String, PartitionField>>
partitionFieldMaps = Maps.newHashMap();
```
##########
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WritePartitionedRowsToFiles.java:
##########
@@ -113,9 +108,8 @@ public void processElement(
String partitionPath =
checkStateNotNull(element.getKey().getString(PARTITION));
IcebergDestination destination =
dynamicDestinations.instantiateDestination(tableIdentifier);
- LastRefreshedTable lastRefreshedTable = getOrCreateTable(destination,
dataSchema);
- Table table = lastRefreshedTable.table;
- partitionPath = getPartitionDataPath(partitionPath,
lastRefreshedTable.partitionFieldMap);
+ Table table = getOrCreateTable(destination, dataSchema);
+ partitionPath = getPartitionDataPath(partitionPath,
getPartitionFieldMap(table));
Review Comment:

Update the call to `getPartitionFieldMap` to pass the table identifier so
that the correct partition field map is retrieved for the specific table.
```suggestion
partitionPath = getPartitionDataPath(partitionPath,
getPartitionFieldMap(destination.getTableIdentifier(), table));
```
##########
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WritePartitionedRowsToFiles.java:
##########
@@ -146,60 +140,30 @@ public void processElement(
.build());
}
- static final class LastRefreshedTable {
- final Table table;
- volatile Instant lastRefreshTime;
- static final Duration STALENESS_THRESHOLD = Duration.ofMinutes(2);
- private int specId;
- volatile Map<String, PartitionField> partitionFieldMap =
Maps.newHashMap();
-
- LastRefreshedTable(Table table, Instant lastRefreshTime) {
- this.table = table;
- this.specId = table.spec().specId();
- this.lastRefreshTime = lastRefreshTime;
- for (PartitionField partitionField : table.spec().fields()) {
- partitionFieldMap.put(partitionField.name(), partitionField);
- }
+ private Map<String, PartitionField> getPartitionFieldMap(Table table) {
+ if (table.spec().specId() == this.specId) {
+ return partitionFieldMap;
}
-
- /**
- * Refreshes the table metadata if it is considered stale (older than 2
minutes).
- *
- * <p>This method first performs a non-synchronized check on the table's
freshness. This
- * provides a lock-free fast path that avoids synchronization overhead
in the common case
- * where the table does not need to be refreshed. If the table might be
stale, it then enters
- * a synchronized block to ensure that only one thread performs the
refresh operation.
- */
- void refreshIfStale() {
- // Fast path: Avoid entering the synchronized block if the table is
not stale.
- if (lastRefreshTime.isAfter(Instant.now().minus(STALENESS_THRESHOLD)))
{
- return;
- }
- synchronized (this) {
- if
(lastRefreshTime.isBefore(Instant.now().minus(STALENESS_THRESHOLD))) {
- table.refresh();
- lastRefreshTime = Instant.now();
- if (table.spec().specId() != this.specId) {
- partitionFieldMap = Maps.newHashMap();
- for (PartitionField partitionField : table.spec().fields()) {
- partitionFieldMap.put(partitionField.name(), partitionField);
- }
- this.specId = table.spec().specId();
- }
- }
- }
+ Map<String, PartitionField> partitionFieldMap = Maps.newHashMap();
+ for (PartitionField partitionField : table.spec().fields()) {
+ partitionFieldMap.put(partitionField.name(), partitionField);
}
+ this.specId = table.spec().specId();
+ this.partitionFieldMap = partitionFieldMap;
+ return partitionFieldMap;
}
Review Comment:

Update `getPartitionFieldMap` to accept the `TableIdentifier` and look
up/store the partition field map and spec ID per table identifier.
```suggestion
private Map<String, PartitionField> getPartitionFieldMap(TableIdentifier
tableId, Table table) {
int currentSpecId = table.spec().specId();
if (currentSpecId == specIds.getOrDefault(tableId, Integer.MIN_VALUE))
{
return partitionFieldMaps.get(tableId);
}
Map<String, PartitionField> fieldMap = Maps.newHashMap();
for (PartitionField partitionField : table.spec().fields()) {
fieldMap.put(partitionField.name(), partitionField);
}
specIds.put(tableId, currentSpecId);
partitionFieldMaps.put(tableId, fieldMap);
return fieldMap;
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]