Hi, so I just tried applying IMPALA-13254 on top of 4.4.1, redeployed and refresh times for the largest table went down from ~80 minutes to ~2 minutes! That's waaay better, but still not ideal: if we issue a refresh every 15 minutes, there's still a lot of time the table is blocked for a minute or two, but at least now it's queryable.
Here's an example REFRESH timeline: Catalog Server Operation: 46s213ms - Got Metastore client: 5.771us (5.771us) - Got catalog version read lock: 2.334ms (2.328ms) - Got catalog version write lock and table write lock: 2.459ms (125.470us) - Got Metastore client: 2.465ms (6.181us) - Fetched table from Metastore: 12.831ms (10.366ms) - Loaded Iceberg API table: 139.024ms (126.192ms) - Loaded schema from Iceberg: 139.175ms (150.949us) - Loaded Iceberg files: 5s036ms (4s897ms) - Loaded all column stats: 5s047ms (11.697ms) - Loaded table schema: 5s053ms (5.279ms) - Start refreshing file metadata: 5s053ms (291.959us) - Loaded file metadata for 1 partitions: 18s912ms (13s859ms) - Loaded all column stats: 46s213ms (27s300ms) - Loaded table: 46s213ms (36.600us) - Finished resetMetadata request: 46s213ms (485.329us) Query Compilation: 1m3s - Metadata of all 2 tables cached: 1m3s (1m3s) - Analysis finished: 1m3s (237.705us) - Authorization finished (ranger): 1m3s (808.267us) - Planning finished: 1m3s (581.713us) Query Timeline: 1m51s - Query submitted: 40.227us (40.227us) - Planning finished: 1m3s (1m3s) - CatalogDdlRequest finished: 1m50s (46s313ms) - Applied catalog updates from DDL: 1m50s (24.206ms) - Request finished: 1m50s (202.949us) - Unregister query: 1m51s (763.412ms) 2025-01-08, tr, 17:50 Zoltán Borók-Nagy <borokna...@cloudera.com> rašė: > Thanks for the info, Saulius. > > If you try out IMPALA-13254, please let us know how much it helps in > your case. > Hopefully it speeds up table loading times enough so it won't cause too > much turbulence. > Some table loading statistics would be also helpful to know where the time > is being spent. > > Do you use local catalog mode? > https://impala.apache.org/docs/build/html/topics/impala_metadata.html > I'm not sure how much it will help, but it could be worth trying out. > > Cheers, > Zoltan > > > On Wed, Jan 8, 2025 at 2:45 PM Saulius Valatka <saulius...@gmail.com> > wrote: > > > Hi, > > > > sorry, maybe I worded my question wrong: I understand that refreshing is > > needed (either automatic or manual), main concerns are the latency of the > > refresh and the fact that the table is not queryable while it's being > > refreshed - for large tables that are being updated frequently this > > combination makes them essentially un-queryable. > > > > 2025-01-08, tr, 15:17 Gabor Kaszab <gaborkas...@apache.org> rašė: > > > > > Hi, > > > > > > I don't think that the issue you describe is specific to Iceberg in a > > sense > > > that even for Hive tables if you make changes using an engine that > > doesn't > > > trigger HMS events, one has to issue refresh/invalidate metadata to see > > the > > > changes reflected in Impala. > > > Could you share what catalog you use for your Iceberg tables? And what > > tool > > > do you use for data ingestion into these tables? > > > If you use the HMS backed HiveCatalog as a catalog and an engine that > > > triggers HMS notifications, like Spark or Hive then even for Iceberg > > tables > > > you can avoid executing refresh manually. > > > > > > Gabor > > > > > > On Wed, Jan 8, 2025 at 1:48 PM Saulius Valatka <saulius...@gmail.com> > > > wrote: > > > > > > > Hi, > > > > > > > > If I understand correctly, once an Iceberg table is mutated outside > of > > > > Impala one has to run a refresh or invalidate statement. We noticed > > that > > > > running refresh on huge tables can take minutes and while that is > > > happening > > > > querying them is blocked. We have large event tables that are being > > > updated > > > > very frequently in real-time, by default we run a refresh after each > > > > update, so effectively this means such tables are un-queryable, as > > > they're > > > > constantly being refreshed. > > > > > > > > Is there something I'm missing? What would the recommendation here > be? > > > > > > > > > >