[Impala-ASF-CR] IMPALA-13122: Add detailed file metadata statistics to table loading logs

Quanlong Huang (Code Review) Mon, 09 Feb 2026 02:40:45 -0800

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23906 )


Change subject: IMPALA-13122: Add detailed file metadata statistics to table 
loading logs
......................................................................


Patch Set 2:

(2 comments)

> Can you review this related to efforts to reduce Iceberg metadata memory 
> pressure? I'm a bit worried about the extra file metadata we'll be keeping 
> around.

I think IcebergTable is fine the file metadata loading code path is different 
and not modified by this patch.
https://github.com/apache/impala/blob/09bed366fe7e681108f164ee53433b791dada90a/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java#L523-L549

BTW, when loading Iceberg tables, there are logs from 
LoggingMetricsReporter.java showing a scan report which has more information 
like number of data/delete files. I think they are good enough.

> About the catalogd logs: assert_catalogd_log_contains() is only used in 
> custom cluster tests, probably it doesn't work in normal tests.
> The following suggests that it may need non-default flags to work:
> https://github.com/apache/impala/blob/3dac0135fba0717dd977043e7ecc6b52bf55189f/tests/common/impala_test_suite.py#L1664C7-L1664C18
>
> It also doesn't seem reliable in EE tests by design, even if only one test is 
> run in parallel: the function doesn't check if the log line comes from the 
> given query (e.g. by checking timestamp of log lines) and a previous test can 
> make the log "dirty". This is ok in tests that always restart catalog (and 
> create a new log file).
>
> It would be nice to improve the log detection logic by ignoring older log 
> lines, but in this case I think that it is ok to simply not write automatic 
> tests at all and just write in the commit message that the logs were checked 
> manually.

Or we can simply implement the test to be a custom-cluster test to avoid impact 
by other tests.

http://gerrit.cloudera.org:8080/#/c/23906/2/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
File fe/src/main/java/org/apache/impala/catalog/FeFsTable.java:

http://gerrit.cloudera.org:8080/#/c/23906/2/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java@124
PS2, Line 124:     // Set of unique host indices (for HDFS/Ozone only)
             :     public Set<Integer> uniqueHostIndices = new HashSet<>();
             :     // Set of unique disk IDs (for HDFS/Ozone only)
             :     public Set<Short> uniqueDiskIds = new HashSet<>();
> Should we change this to track host:disk pairs instead, or add it as an add
Agree on this. Disk IDs without the host is useless since they are 0-based 
indexes on a given host. Just replace this with host:disk pairs.

https://github.com/apache/impala/blob/049e105/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java#L80


http://gerrit.cloudera.org:8080/#/c/23906/2/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java@188
PS2, Line 188:       // Note: We cannot accurately update min/max values when 
removing stats
             :       // Remove hosts and disks from the sets if they belong to 
the removed files
> Simple removeAll() is O(n), while counter maps may add memory overhead and
I think this is not that important and we can leave it for future tasks. This 
remove() method is only used in HdfsTable.dropPartition() and not used in the 
metadata loading code paths. When it's used, it updates fileMetadataStats_ of 
the table. The new fields of fileMetadataStats_ are not used anywhere (e.g. 
WebUI/metrics) so it's fine to pick a simple solution or even not updating them 
here. But at least we should add a comment on fileMetadataStats_ to explain 
this. Make sure future developers won't trust these new fields of 
fileMetadataStats_. You can also file a JIRA for correctly updating these 
fields of fileMetadataStats_ in remove().

On the other side, everytime when HdfsTable.load() runs, it will update 
fileMetadataStats_ with all current partitions to get the fresh values. So most 
of the time fileMetadataStats_ are still correct.
https://github.com/apache/impala/blob/09bed366fe7e681108f164ee53433b791dada90a/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L1324



--
To view, visit http://gerrit.cloudera.org:8080/23906
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6f4592f173c047e5064058402f83be6d1f5c9a79
Gerrit-Change-Number: 23906
Gerrit-PatchSet: 2
Gerrit-Owner: Arnab Karmakar <[email protected]>
Gerrit-Reviewer: Arnab Karmakar <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jason Fehr <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Pranav Lodha <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Mon, 09 Feb 2026 10:40:31 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-13122: Add detailed file metadata statistics to table loading logs

Reply via email to