Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/23906 )
Change subject: IMPALA-13122: Add detailed file metadata statistics to table loading logs ...................................................................... Patch Set 2: (2 comments) > Can you review this related to efforts to reduce Iceberg metadata memory > pressure? I'm a bit worried about the extra file metadata we'll be keeping > around. I think IcebergTable is fine the file metadata loading code path is different and not modified by this patch. https://github.com/apache/impala/blob/09bed366fe7e681108f164ee53433b791dada90a/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java#L523-L549 BTW, when loading Iceberg tables, there are logs from LoggingMetricsReporter.java showing a scan report which has more information like number of data/delete files. I think they are good enough. > About the catalogd logs: assert_catalogd_log_contains() is only used in > custom cluster tests, probably it doesn't work in normal tests. > The following suggests that it may need non-default flags to work: > https://github.com/apache/impala/blob/3dac0135fba0717dd977043e7ecc6b52bf55189f/tests/common/impala_test_suite.py#L1664C7-L1664C18 > > It also doesn't seem reliable in EE tests by design, even if only one test is > run in parallel: the function doesn't check if the log line comes from the > given query (e.g. by checking timestamp of log lines) and a previous test can > make the log "dirty". This is ok in tests that always restart catalog (and > create a new log file). > > It would be nice to improve the log detection logic by ignoring older log > lines, but in this case I think that it is ok to simply not write automatic > tests at all and just write in the commit message that the logs were checked > manually. Or we can simply implement the test to be a custom-cluster test to avoid impact by other tests. http://gerrit.cloudera.org:8080/#/c/23906/2/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java File fe/src/main/java/org/apache/impala/catalog/FeFsTable.java: http://gerrit.cloudera.org:8080/#/c/23906/2/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java@124 PS2, Line 124: // Set of unique host indices (for HDFS/Ozone only) : public Set<Integer> uniqueHostIndices = new HashSet<>(); : // Set of unique disk IDs (for HDFS/Ozone only) : public Set<Short> uniqueDiskIds = new HashSet<>(); > Should we change this to track host:disk pairs instead, or add it as an add Agree on this. Disk IDs without the host is useless since they are 0-based indexes on a given host. Just replace this with host:disk pairs. https://github.com/apache/impala/blob/049e105/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java#L80 http://gerrit.cloudera.org:8080/#/c/23906/2/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java@188 PS2, Line 188: // Note: We cannot accurately update min/max values when removing stats : // Remove hosts and disks from the sets if they belong to the removed files > Simple removeAll() is O(n), while counter maps may add memory overhead and I think this is not that important and we can leave it for future tasks. This remove() method is only used in HdfsTable.dropPartition() and not used in the metadata loading code paths. When it's used, it updates fileMetadataStats_ of the table. The new fields of fileMetadataStats_ are not used anywhere (e.g. WebUI/metrics) so it's fine to pick a simple solution or even not updating them here. But at least we should add a comment on fileMetadataStats_ to explain this. Make sure future developers won't trust these new fields of fileMetadataStats_. You can also file a JIRA for correctly updating these fields of fileMetadataStats_ in remove(). On the other side, everytime when HdfsTable.load() runs, it will update fileMetadataStats_ with all current partitions to get the fresh values. So most of the time fileMetadataStats_ are still correct. https://github.com/apache/impala/blob/09bed366fe7e681108f164ee53433b791dada90a/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L1324 -- To view, visit http://gerrit.cloudera.org:8080/23906 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6f4592f173c047e5064058402f83be6d1f5c9a79 Gerrit-Change-Number: 23906 Gerrit-PatchSet: 2 Gerrit-Owner: Arnab Karmakar <[email protected]> Gerrit-Reviewer: Arnab Karmakar <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Jason Fehr <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Pranav Lodha <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Mon, 09 Feb 2026 10:40:31 +0000 Gerrit-HasComments: Yes
