Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23113 )

Change subject: IMPALA-13267: Display number of partitions for Iceberg tables
......................................................................


Patch Set 12:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/23113/12/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java
File fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java:

http://gerrit.cloudera.org:8080/#/c/23113/12/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@242
PS12, Line 242:     synchronized (loadedIcebergPartitions_) {
              :       partitionId = loadedIcebergPartitions_.get(partition);
              :       if (partitionId == null) {
              :         partitionId = loadedIcebergPartitions_.size();
              :         loadedIcebergPartitions_.put(partition, partitionId);
              :       }
              :     }
We can increase concurrency with this construct:

    partitionId = loadedIcebergPartitions_.get(partition);
    if (partitionId == null) {
      synchronized (loadedIcebergPartitions_) {
        // Repeat get() as someone might have alredy insterted the partition.
        partitionId = loadedIcebergPartitions_.get(partition);
        if (partitionId == null) {
          partitionId = loadedIcebergPartitions_.size();
          loadedIcebergPartitions_.put(partition, partitionId);
        }
      }
    }

So different threads can freely check in parallel if 'partition' is present in 
the map.

But 'loadedIcebergPartitions_.size()' can still be problematic as it is not 
guaranteed to be O(1).

Consider the following which should be the most efficient, also much cleaner 
than the above double-checked locking:

  private final AtomicInteger nextPartitionId = new AtomicInteger();

  ...
  private int addPartitionInfo(ContentFile<?> contentFile) {
    TIcebergPartition partition =
        IcebergUtil.createIcebergPartitionInfo(iceTbl_, contentFile);
    return loadedIcebergPartitions_.computeIfAbsent(
        partition,
        p -> nextPartitionId.getAndIncrement()
    );
  }


http://gerrit.cloudera.org:8080/#/c/23113/12/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@354
PS12, Line 354: createLocatedFd
Braindumping some future optimization possibilities: here we call this for 
content files that live in the same directory, i.e. probably in the same 
partition as well, so we could probably fast-check if they have the same 
partition then we can use the same partition id.

We could even group the content files based on the actual partitions instead of 
using the parent directories. That happens in collectPartitionPaths() which is 
currently single-threaded so it is not an obvious win.

Just to be clear, none if these has to happen in the current CR.



--
To view, visit http://gerrit.cloudera.org:8080/23113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifb2f654bc6c9bdf9cfafc27b38b5ca2f7b6b4872
Gerrit-Change-Number: 23113
Gerrit-PatchSet: 12
Gerrit-Owner: Noemi Pap-Takacs <npaptak...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com>
Gerrit-Reviewer: Peter Rozsa <pro...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Sep 2025 09:21:12 +0000
Gerrit-HasComments: Yes

Reply via email to