Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23113 )

Change subject: IMPALA-13267: Display number of partitions for Iceberg tables
......................................................................


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/23113/12/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java
File fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java:

http://gerrit.cloudera.org:8080/#/c/23113/12/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@242
PS12, Line 242:     synchronized (loadedIcebergPartitions_) {
              :       partitionId = loadedIcebergPartitions_.get(partition);
              :       if (partitionId == null) {
              :         partitionId = loadedIcebergPartitions_.size();
              :         loadedIcebergPartitions_.put(partition, partitionId);
              :       }
              :     }
> We can increase concurrency with this construct:
Can't we have concurrency problems with all of these approaches?


In the original one:
First we query the size of the map, then we insert an element. Because it is a 
concurrent map, it is guaranteed that these are "atomic" individually, but not 
together: after we query the size, another thread may insert or remove elements 
before we do the insertion. Note that it is not guaranteed (and probably not 
true) that the map uses its own default monitor for implementing its methods 
thread-safely, so having the two statements in a synchronized block does not 
help. Also, we have very similar code on L189 without a synchronized block.


The computeIfAbsent() approach:
It can still collide with the code on L189, or any other code: if somebody 
modifies the map concurrently, the counter could be out-of-sync.


Are this and L189 the only places where we modify this map? If so, then maybe 
we don't need a ConcurrentHashMap, instead we could manually synchronize a 
normal map. The problem is complicated by the getIcebergPartitions() method, 
which returns a reference to this map. Do we know whether access to the map 
through getIcebergPartitions() can happen simultaneously with these 
modifications? If not, we could simply return a constant view of the map (or 
not even bother because we're not modifying it anyway).

If access and modification can be concurrent, we could consider returning a 
(shallow) copy in getIcebergPartitions(), although that may lead to worse 
performance if this map is huge. Alternatively, we could return a constant view 
of a ConcurrentHashMap so that reading the map (by users of 
getIcebergPartitions()) is thread-safe in case we concurrently modify it (here 
in this class), but in the code within this class we don't need to worry about 
some other code (the caller of getIcebergPartitions()) modifying the map.



--
To view, visit http://gerrit.cloudera.org:8080/23113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb2f654bc6c9bdf9cfafc27b38b5ca2f7b6b4872
Gerrit-Change-Number: 23113
Gerrit-PatchSet: 12
Gerrit-Owner: Noemi Pap-Takacs <npaptak...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com>
Gerrit-Reviewer: Peter Rozsa <pro...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Sep 2025 13:07:20 +0000
Gerrit-HasComments: Yes

Reply via email to