Thanks for the info, Saulius. If you try out IMPALA-13254, please let us know how much it helps in your case. Hopefully it speeds up table loading times enough so it won't cause too much turbulence. Some table loading statistics would be also helpful to know where the time is being spent.
Do you use local catalog mode? https://impala.apache.org/docs/build/html/topics/impala_metadata.html I'm not sure how much it will help, but it could be worth trying out. Cheers, Zoltan On Wed, Jan 8, 2025 at 2:45 PM Saulius Valatka <saulius...@gmail.com> wrote: > Hi, > > sorry, maybe I worded my question wrong: I understand that refreshing is > needed (either automatic or manual), main concerns are the latency of the > refresh and the fact that the table is not queryable while it's being > refreshed - for large tables that are being updated frequently this > combination makes them essentially un-queryable. > > 2025-01-08, tr, 15:17 Gabor Kaszab <gaborkas...@apache.org> rašė: > > > Hi, > > > > I don't think that the issue you describe is specific to Iceberg in a > sense > > that even for Hive tables if you make changes using an engine that > doesn't > > trigger HMS events, one has to issue refresh/invalidate metadata to see > the > > changes reflected in Impala. > > Could you share what catalog you use for your Iceberg tables? And what > tool > > do you use for data ingestion into these tables? > > If you use the HMS backed HiveCatalog as a catalog and an engine that > > triggers HMS notifications, like Spark or Hive then even for Iceberg > tables > > you can avoid executing refresh manually. > > > > Gabor > > > > On Wed, Jan 8, 2025 at 1:48 PM Saulius Valatka <saulius...@gmail.com> > > wrote: > > > > > Hi, > > > > > > If I understand correctly, once an Iceberg table is mutated outside of > > > Impala one has to run a refresh or invalidate statement. We noticed > that > > > running refresh on huge tables can take minutes and while that is > > happening > > > querying them is blocked. We have large event tables that are being > > updated > > > very frequently in real-time, by default we run a refresh after each > > > update, so effectively this means such tables are un-queryable, as > > they're > > > constantly being refreshed. > > > > > > Is there something I'm missing? What would the recommendation here be? > > > > > >