Hi, thanks for the swift response.

We have not enabled automatic refresh, so we're running it manually for
now, we may consider changing this in the future.
Our Impala is 4.4.1, I might try to build it with IMPALA-13254, as it looks
exactly what I'm looking for :)

Our largest event is ~300k msg/s, we batch ~512mb files and commit, so this
ends up being a commit every few minutes.
The total size of the table is currently ~400TB with ~3 million files
(contains data for many years), we compact it frequently.

2025-01-08, tr, 15:20 Zoltán Borók-Nagy <borokna...@cloudera.com> rašė:

> Hey Saulius,
>
> Thanks for reaching out.
>
> once an Iceberg table is mutated outside of Impala one has to run a refresh
> > or invalidate statement
>
> If the Iceberg table lives in the HiveCatalog, and Automatic
> Invalidation/Refresh of Metadata is enabled, then you don't need to, i.e.
> Impala will eventually pick up the new table state.
> See https://impala.apache.org/docs/build/html/topics/impala_metadata.html
>
> We noticed that running refresh on huge tables can take minutes and while
> > that is happening querying them is blocked
> >
> What version do you use? There were quite a few improvements in that area
> lately.
> Though a major improvement is coming in 4.5.0: IMPALA-13254
> <https://issues.apache.org/jira/browse/IMPALA-13254>
>
> We have large event tables that are being updated very frequently in
> > real-time
> >
> I'm a bit curious about what "very frequent" means here. Is it possible for
> you to share some numbers?
>
> What would the recommendation here be?
> >
> Until Impala 4.5 you can try reducing the frequency of table updates. Also
> the number of files play a huge role in table loading times. Maybe you can
> try compacting the table from time to time.
>
> Cheers,
>     Zoltan
>
>
> On Wed, Jan 8, 2025 at 1:48 PM Saulius Valatka <saulius...@gmail.com>
> wrote:
>
> > Hi,
> >
> > If I understand correctly, once an Iceberg table is mutated outside of
> > Impala one has to run a refresh or invalidate statement. We noticed that
> > running refresh on huge tables can take minutes and while that is
> happening
> > querying them is blocked. We have large event tables that are being
> updated
> > very frequently in real-time, by default we run a refresh after each
> > update, so effectively this means such tables are un-queryable, as
> they're
> > constantly being refreshed.
> >
> > Is there something I'm missing? What would the recommendation here be?
> >
>

Reply via email to