Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/23113 )
Change subject: IMPALA-13267: Display number of partitions for Iceberg tables ...................................................................... Patch Set 12: (2 comments) http://gerrit.cloudera.org:8080/#/c/23113/12/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java File fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java: http://gerrit.cloudera.org:8080/#/c/23113/12/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@242 PS12, Line 242: synchronized (loadedIcebergPartitions_) { : partitionId = loadedIcebergPartitions_.get(partition); : if (partitionId == null) { : partitionId = loadedIcebergPartitions_.size(); : loadedIcebergPartitions_.put(partition, partitionId); : } : } We can increase concurrency with this construct: partitionId = loadedIcebergPartitions_.get(partition); if (partitionId == null) { synchronized (loadedIcebergPartitions_) { // Repeat get() as someone might have alredy insterted the partition. partitionId = loadedIcebergPartitions_.get(partition); if (partitionId == null) { partitionId = loadedIcebergPartitions_.size(); loadedIcebergPartitions_.put(partition, partitionId); } } } So different threads can freely check in parallel if 'partition' is present in the map. But 'loadedIcebergPartitions_.size()' can still be problematic as it is not guaranteed to be O(1). Consider the following which should be the most efficient, also much cleaner than the above double-checked locking: private final AtomicInteger nextPartitionId = new AtomicInteger(); ... private int addPartitionInfo(ContentFile<?> contentFile) { TIcebergPartition partition = IcebergUtil.createIcebergPartitionInfo(iceTbl_, contentFile); return loadedIcebergPartitions_.computeIfAbsent( partition, p -> nextPartitionId.getAndIncrement() ); } http://gerrit.cloudera.org:8080/#/c/23113/12/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@354 PS12, Line 354: createLocatedFd Braindumping some future optimization possibilities: here we call this for content files that live in the same directory, i.e. probably in the same partition as well, so we could probably fast-check if they have the same partition then we can use the same partition id. We could even group the content files based on the actual partitions instead of using the parent directories. That happens in collectPartitionPaths() which is currently single-threaded so it is not an obvious win. Just to be clear, none if these has to happen in the current CR. -- To view, visit http://gerrit.cloudera.org:8080/23113 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifb2f654bc6c9bdf9cfafc27b38b5ca2f7b6b4872 Gerrit-Change-Number: 23113 Gerrit-PatchSet: 12 Gerrit-Owner: Noemi Pap-Takacs <npaptak...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com> Gerrit-Reviewer: Peter Rozsa <pro...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 11 Sep 2025 09:21:12 +0000 Gerrit-HasComments: Yes