[ https://issues.apache.org/jira/browse/IMPALA-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18019311#comment-18019311 ]
ASF subversion and git services commented on IMPALA-13254: ---------------------------------------------------------- Commit 711797e7fbda6f30fc49d91e30ad6ab31a4f4a69 in impala's branch refs/heads/master from Zoltan Borok-Nagy [ https://gitbox.apache.org/repos/asf?p=impala.git;h=711797e7f ] IMPALA-14349: Encode FileDescriptors in time in loading Iceberg Tables With this patch we create Iceberg file descriptors from LocatedFileStatus objects during IcebergFileMetadataLoader's parallelListing(). This has the following benefits: * We parallelize the creation of Iceberg file descriptor objects * We don't need to maintain a large hash map with all the LocatedFileStatus objects at once. Now we only need to keep a few LocatedFileStatus objects per partition in memory while we are converting them to Iceberg file descriptors. I.e., the GC is free to destroy the LocatedFileStatus objects we don't use anymore. This patch retires startup flag 'iceberg_reload_new_files_threshold'. Since IMPALA-13254 we only list partitions that have new data files, and we load them in parallel, i.e. efficient incremental table loading is already covered. From that point the startup flag only added unnecessary code complexity. Measurements I created two tables (from tpcds.store_sales) to measure table loading times for large tables: Table #1: PARTITIONED BY SPEC(ss_item_sk, BUCKET(5, ss_sold_time_sk)) partitions: 107818 files: 754726 Table #2: PARTITIONED BY SPEC(ss_item_sk) partitions: 18000 files: 504224 Time taken in IcebergFileMetadataLoader.load() during full table reload: +----------+-------+------+---------+ | | Base | New | Speedup | +----------+-------+------+---------+ | Table #1 | 17.3s | 8.1s | 2.14 | | Table #2 | 7.8s | 4.3s | 1.8 | +----------+-------+------+---------+ I measured incremental table loading only for Table #2 (since there are more files per partition this is the worse scenario for the new code, as it only uses file listings, and each new file were created in a separate partition) Time taken in IcebergFileMetadataLoader.load() during incremental table reload: +------------+------+------+---------+ | #new files | Base | New | Speedup | +------------+------+------+---------+ | 1 | 1.4s | 1.6s | 0.9 | | 100 | 1.5s | 1.9s | 0.8 | | 200 | 1.5s | 1.5s | 1 | +------------+------+------+---------+ We lose a few tenths of a second, but I think the simplified code justifies it. Testing: * some tests were updated because we we don't have startup flag 'iceberg_reload_new_files_threshold' anymore Change-Id: Ia1c2a7119d76db7ce7c43caec2ccb122a014851b Reviewed-on: http://gerrit.cloudera.org:8080/23363 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > Optimizing incremental reload performance of Iceberg tables > ----------------------------------------------------------- > > Key: IMPALA-13254 > URL: https://issues.apache.org/jira/browse/IMPALA-13254 > Project: IMPALA > Issue Type: Improvement > Components: Catalog > Affects Versions: Impala 4.4.0 > Reporter: Fu Lili > Assignee: Fu Lili > Priority: Major > Fix For: Impala 4.5.0 > > > When performing a {{REFRESH}} on an Iceberg table, if the number of changed > files exceeds the {{iceberg_reload_new_files_threshold}} configuration > (default is 100), a highly inefficient reload operation is triggered. > The main issue with this code lies in the > {{IcebergFileMetadataLoader.getFileStatuses}} function. During incremental > loading, the {{listWithLocations}} parameter is always set to {{{}false{}}}, > resulting in {{fs.getFileStatus}} and {{fs.getFileBlockLocations}} operations > being performed on each {{contentFile}} sequentially (if the filesystem > supports {{{}StorageIds{}}}). > To optimize this logic, the following changes can be made: > # In the {{IcebergFileMetadataLoader.getFileStatuses}} function, always > trigger {{parallelListing}} to quickly retrieve {{{}nameToFileStatus{}}}, > avoiding the sequential fetching for each {{{}contentFile{}}}. > # Increase the default value of {{iceberg_reload_new_files_threshold}} to > 1000. When changes are fewer than {{{}iceberg_reload_new_files_threshold{}}}, > perform a single RPC for each changed file to get the {{{}FileDescriptor{}}}. > The average time for a single operation is 1 to 3 milliseconds, so 1000 > operations would take approximately 1 to 3 seconds, which is within a > reasonable range. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org