Hi, thanks for the swift response. We have not enabled automatic refresh, so we're running it manually for now, we may consider changing this in the future. Our Impala is 4.4.1, I might try to build it with IMPALA-13254, as it looks exactly what I'm looking for :)
Our largest event is ~300k msg/s, we batch ~512mb files and commit, so this ends up being a commit every few minutes. The total size of the table is currently ~400TB with ~3 million files (contains data for many years), we compact it frequently. 2025-01-08, tr, 15:20 Zoltán Borók-Nagy <borokna...@cloudera.com> rašė: > Hey Saulius, > > Thanks for reaching out. > > once an Iceberg table is mutated outside of Impala one has to run a refresh > > or invalidate statement > > If the Iceberg table lives in the HiveCatalog, and Automatic > Invalidation/Refresh of Metadata is enabled, then you don't need to, i.e. > Impala will eventually pick up the new table state. > See https://impala.apache.org/docs/build/html/topics/impala_metadata.html > > We noticed that running refresh on huge tables can take minutes and while > > that is happening querying them is blocked > > > What version do you use? There were quite a few improvements in that area > lately. > Though a major improvement is coming in 4.5.0: IMPALA-13254 > <https://issues.apache.org/jira/browse/IMPALA-13254> > > We have large event tables that are being updated very frequently in > > real-time > > > I'm a bit curious about what "very frequent" means here. Is it possible for > you to share some numbers? > > What would the recommendation here be? > > > Until Impala 4.5 you can try reducing the frequency of table updates. Also > the number of files play a huge role in table loading times. Maybe you can > try compacting the table from time to time. > > Cheers, > Zoltan > > > On Wed, Jan 8, 2025 at 1:48 PM Saulius Valatka <saulius...@gmail.com> > wrote: > > > Hi, > > > > If I understand correctly, once an Iceberg table is mutated outside of > > Impala one has to run a refresh or invalidate statement. We noticed that > > running refresh on huge tables can take minutes and while that is > happening > > querying them is blocked. We have large event tables that are being > updated > > very frequently in real-time, by default we run a refresh after each > > update, so effectively this means such tables are un-queryable, as > they're > > constantly being refreshed. > > > > Is there something I'm missing? What would the recommendation here be? > > >