Opened the following tickets to track the items dropped out from this thread:
1. IMPALA-13651 <https://issues.apache.org/jira/browse/IMPALA-13651>: Iceberg tables load column stats twice 2. 1. IMPALA-13652 <https://issues.apache.org/jira/browse/IMPALA-13652>: Optimize plain SELECT LIMIT queries for Iceberg tables 1. Cheers, 2. Zoltan On Thu, Jan 9, 2025 at 2:13 PM Saulius Valatka <saulius...@gmail.com> wrote: > sure, attached the logs > > 2025-01-09, kt, 14:02 Zoltán Borók-Nagy <borokna...@cloudera.com> rašė: > >> Thanks Saulius, this is very interesting. Trino seems to be much better >> optimized for small queries on huge tables. This is definitely a room >> for improvement for us. >> >> Where most time were being spent: >> - Metadata of all 2 tables cached: 42s528ms (42s528ms) >> - Single node plan created: 1m13s (30s493ms) >> - Completed admission: 1m55s (28s320ms) >> >> Since the query has a LIMIT 10 clause, we could probably short circuit the >> first two (I guess there are no delete files), and then admission would be >> also faster probably. >> >> Is it possible to filter the coordinator logs for the query id and send it >> to us, if it doesn't contain any sensitive information (column names, >> etc.)? That'll provide us with more hints about where the time was spent. >> >> Cheers, >> Zoltan >> >> >> On Thu, Jan 9, 2025 at 12:47 PM Saulius Valatka <saulius...@gmail.com> >> wrote: >> >> > Yeah, sorry, I was probably wrong about the "blocked" part: it's the >> select >> > statement without a filter that's taking very long, regardless of any >> > running refreshes in parallel. Adding a partition filter for the last >> day >> > brings down the execution time to ~10 seconds (still way below Trino, it >> > can do it in ~2 seconds). >> > >> > 2025-01-09, kt, 13:31 Saulius Valatka <saulius...@gmail.com> rašė: >> > >> > > The table being "blocked" is just my hypothesis, maybe I'm wrong? I >> can >> > > see that when a REFRESH statement is running, issuing a simple >> "select * >> > > from table limit 10" sits in "CREATED Query submitted" state until >> > roughly >> > > when the refresh statement finishes. >> > > Just out of curiosity I launched a Trino cluster on the same servers >> as >> > > Impala and hooked it up to the same Iceberg catalog, issuing an >> identical >> > > "select * from table limit 10" runs in ~2 seconds, whereas in Impala >> it's >> > > ~2 minutes. >> > > >> > > Here's the details for the select query execution: >> > > >> > > Query Compilation: 1m14s >> > > - Metadata of all 2 tables cached: 42s528ms (42s528ms) >> > > - Analysis finished: 42s539ms (10.112ms) >> > > - Authorization finished (ranger): 42s540ms (1.246ms) >> > > - Value transfer graph computed: 42s540ms (609.596us) >> > > - Single node plan created: 1m13s (30s493ms) >> > > - Runtime filters computed: 1m13s (44.325us) >> > > - Distributed plan created: 1m13s (29.075us) >> > > - Parallel plans created: 1m13s (266.400us) >> > > - Planning finished: 1m14s (1s727ms) >> > > Query Timeline: 2m13s >> > > - Query submitted: 50.627us (50.627us) >> > > - Planning finished: 1m27s (1m27s) >> > > - Submit for admission: 1m27s (273.383us) >> > > - Completed admission: 1m55s (28s320ms) >> > > - Ready to start on 121 backends: 1m55s (91.685ms) >> > > - All 121 execution backends (1921 fragment instances) started: >> 2m7s >> > > (11s578ms) >> > > - Rows available: 2m7s (1.412ms) >> > > - First row fetched: 2m7s (291.051ms) >> > > - Last row fetched: 2m7s (225.054ms) >> > > - Released admission control resources: 2m13s (5s433ms) >> > > - Unregister query: 2m13s (100.938ms) >> > > >> > > Catalog Server Operation: 49s120ms >> > > - Got Metastore client: 3.116us (3.116us) >> > > - Got catalog version read lock: 2.250ms (2.247ms) >> > > - Got catalog version write lock and table write lock: 2.315ms >> > > (65.074us) >> > > - Got Metastore client: 2.320ms (4.258us) >> > > - Fetched table from Metastore: 12.885ms (10.565ms) >> > > - Loaded Iceberg API table: 72.570ms (59.684ms) >> > > - Loaded schema from Iceberg: 72.751ms (181.127us) >> > > - Loaded Iceberg files: 7s925ms (7s852ms) >> > > - Loaded all column stats: 7s953ms (28.404ms) >> > > - Loaded table schema: 7s958ms (5.226ms) >> > > - Start refreshing file metadata: 7s959ms (431.115us) >> > > - Loaded file metadata for 1 partitions: 22s081ms (14s122ms) >> > > - Loaded all column stats: 49s120ms (27s038ms) >> > > - Loaded table: 49s120ms (30.027us) >> > > - Finished resetMetadata request: 49s120ms (312.849us) >> > > Query Compilation: 53s727ms >> > > - Metadata of all 2 tables cached: 53s726ms (53s726ms) >> > > - Analysis finished: 53s726ms (67.048us) >> > > - Authorization finished (ranger): 53s727ms (422.569us) >> > > - Planning finished: 53s727ms (137.463us) >> > > Query Timeline: 1m44s >> > > - Query submitted: 42.071us (42.071us) >> > > - Planning finished: 53s735ms (53s735ms) >> > > - CatalogDdlRequest finished: 1m42s (49s122ms) >> > > - Applied catalog updates from DDL: 1m43s (1s049ms) >> > > - Request finished: 1m43s (14.695ms) >> > > - Unregister query: 1m44s (912.626ms) >> > > >> > > 2025-01-09, kt, 12:33 Zoltán Borók-Nagy <borokna...@cloudera.com> >> rašė: >> > > >> > >> Thanks for the update. >> > >> >> > >> So the whole REFRESH operation took 1m50s. From this >> CatalogDdlRequest >> > was >> > >> only 46s313ms. This 46s313ms is the time when the table is blocked, >> > right? >> > >> From CatalogDdlRequest the longest operation was loading column stats >> > >> which >> > >> took 27s300ms. This is a single RPC (getTableColumnStatistics()) >> toward >> > >> HMS, it would be good to know why it took so long. Especially given >> that >> > >> loading file metadata for this huge table was around 19 seconds. >> > >> It's also interesting that "Loaded all column stats" appears twice in >> > the >> > >> catalog timeline. At first it took 11.697ms, and the second >> invocation >> > was >> > >> the one that took 27s300ms. Hopefully we can get rid of the second >> > >> invocation but that'll require a code change. >> > >> >> > >> I also wonder why the table is not queryable from the Coordinator >> cache >> > >> while it is being reloaded in CatalogD, I hope we can fix this as >> well. >> > >> >> > >> Is the table expected to grow indefinitely? Or do you drop/relocate >> old >> > >> partitions after some time? >> > >> >> > >> Cheers, >> > >> Zoltan >> > >> >> > >> >> > >> On Wed, Jan 8, 2025 at 9:19 PM Saulius Valatka <saulius...@gmail.com >> > >> > >> wrote: >> > >> >> > >> > Hi, >> > >> > >> > >> > so I just tried applying IMPALA-13254 on top of 4.4.1, redeployed >> and >> > >> > refresh times for the largest table went down from ~80 minutes to >> ~2 >> > >> > minutes! >> > >> > That's waaay better, but still not ideal: if we issue a refresh >> every >> > 15 >> > >> > minutes, there's still a lot of time the table is blocked for a >> minute >> > >> or >> > >> > two, but at least now it's queryable. >> > >> > >> > >> > Here's an example REFRESH timeline: >> > >> > >> > >> > Catalog Server Operation: 46s213ms >> > >> > - Got Metastore client: 5.771us (5.771us) >> > >> > - Got catalog version read lock: 2.334ms (2.328ms) >> > >> > - Got catalog version write lock and table write lock: 2.459ms >> > >> > (125.470us) >> > >> > - Got Metastore client: 2.465ms (6.181us) >> > >> > - Fetched table from Metastore: 12.831ms (10.366ms) >> > >> > - Loaded Iceberg API table: 139.024ms (126.192ms) >> > >> > - Loaded schema from Iceberg: 139.175ms (150.949us) >> > >> > - Loaded Iceberg files: 5s036ms (4s897ms) >> > >> > - Loaded all column stats: 5s047ms (11.697ms) >> > >> > - Loaded table schema: 5s053ms (5.279ms) >> > >> > - Start refreshing file metadata: 5s053ms (291.959us) >> > >> > - Loaded file metadata for 1 partitions: 18s912ms (13s859ms) >> > >> > - Loaded all column stats: 46s213ms (27s300ms) >> > >> > - Loaded table: 46s213ms (36.600us) >> > >> > - Finished resetMetadata request: 46s213ms (485.329us) >> > >> > Query Compilation: 1m3s >> > >> > - Metadata of all 2 tables cached: 1m3s (1m3s) >> > >> > - Analysis finished: 1m3s (237.705us) >> > >> > - Authorization finished (ranger): 1m3s (808.267us) >> > >> > - Planning finished: 1m3s (581.713us) >> > >> > Query Timeline: 1m51s >> > >> > - Query submitted: 40.227us (40.227us) >> > >> > - Planning finished: 1m3s (1m3s) >> > >> > - CatalogDdlRequest finished: 1m50s (46s313ms) >> > >> > - Applied catalog updates from DDL: 1m50s (24.206ms) >> > >> > - Request finished: 1m50s (202.949us) >> > >> > - Unregister query: 1m51s (763.412ms) >> > >> > >> > >> > >> > >> > 2025-01-08, tr, 17:50 Zoltán Borók-Nagy <borokna...@cloudera.com> >> > rašė: >> > >> > >> > >> > > Thanks for the info, Saulius. >> > >> > > >> > >> > > If you try out IMPALA-13254, please let us know how much it >> helps in >> > >> > > your case. >> > >> > > Hopefully it speeds up table loading times enough so it won't >> cause >> > >> too >> > >> > > much turbulence. >> > >> > > Some table loading statistics would be also helpful to know where >> > the >> > >> > time >> > >> > > is being spent. >> > >> > > >> > >> > > Do you use local catalog mode? >> > >> > > >> > https://impala.apache.org/docs/build/html/topics/impala_metadata.html >> > >> > > I'm not sure how much it will help, but it could be worth trying >> > out. >> > >> > > >> > >> > > Cheers, >> > >> > > Zoltan >> > >> > > >> > >> > > >> > >> > > On Wed, Jan 8, 2025 at 2:45 PM Saulius Valatka < >> > saulius...@gmail.com> >> > >> > > wrote: >> > >> > > >> > >> > > > Hi, >> > >> > > > >> > >> > > > sorry, maybe I worded my question wrong: I understand that >> > >> refreshing >> > >> > is >> > >> > > > needed (either automatic or manual), main concerns are the >> latency >> > >> of >> > >> > the >> > >> > > > refresh and the fact that the table is not queryable while it's >> > >> being >> > >> > > > refreshed - for large tables that are being updated frequently >> > this >> > >> > > > combination makes them essentially un-queryable. >> > >> > > > >> > >> > > > 2025-01-08, tr, 15:17 Gabor Kaszab <gaborkas...@apache.org> >> rašė: >> > >> > > > >> > >> > > > > Hi, >> > >> > > > > >> > >> > > > > I don't think that the issue you describe is specific to >> Iceberg >> > >> in a >> > >> > > > sense >> > >> > > > > that even for Hive tables if you make changes using an engine >> > that >> > >> > > > doesn't >> > >> > > > > trigger HMS events, one has to issue refresh/invalidate >> metadata >> > >> to >> > >> > see >> > >> > > > the >> > >> > > > > changes reflected in Impala. >> > >> > > > > Could you share what catalog you use for your Iceberg tables? >> > And >> > >> > what >> > >> > > > tool >> > >> > > > > do you use for data ingestion into these tables? >> > >> > > > > If you use the HMS backed HiveCatalog as a catalog and an >> engine >> > >> that >> > >> > > > > triggers HMS notifications, like Spark or Hive then even for >> > >> Iceberg >> > >> > > > tables >> > >> > > > > you can avoid executing refresh manually. >> > >> > > > > >> > >> > > > > Gabor >> > >> > > > > >> > >> > > > > On Wed, Jan 8, 2025 at 1:48 PM Saulius Valatka < >> > >> saulius...@gmail.com >> > >> > > >> > >> > > > > wrote: >> > >> > > > > >> > >> > > > > > Hi, >> > >> > > > > > >> > >> > > > > > If I understand correctly, once an Iceberg table is mutated >> > >> outside >> > >> > > of >> > >> > > > > > Impala one has to run a refresh or invalidate statement. We >> > >> noticed >> > >> > > > that >> > >> > > > > > running refresh on huge tables can take minutes and while >> that >> > >> is >> > >> > > > > happening >> > >> > > > > > querying them is blocked. We have large event tables that >> are >> > >> being >> > >> > > > > updated >> > >> > > > > > very frequently in real-time, by default we run a refresh >> > after >> > >> > each >> > >> > > > > > update, so effectively this means such tables are >> > un-queryable, >> > >> as >> > >> > > > > they're >> > >> > > > > > constantly being refreshed. >> > >> > > > > > >> > >> > > > > > Is there something I'm missing? What would the >> recommendation >> > >> here >> > >> > > be? >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > > >> > >> >