Re: Iceberg table refresh/invalidate latency

Saulius Valatka Wed, 08 Jan 2025 05:30:16 -0800

Hi, thanks for the swift response.

We have not enabled automatic refresh, so we're running it manually for
now, we may consider changing this in the future.
Our Impala is 4.4.1, I might try to build it with IMPALA-13254, as it looks
exactly what I'm looking for :)


Our largest event is ~300k msg/s, we batch ~512mb files and commit, so this
ends up being a commit every few minutes.
The total size of the table is currently ~400TB with ~3 million files
(contains data for many years), we compact it frequently.

2025-01-08, tr, 15:20 Zoltán Borók-Nagy <borokna...@cloudera.com> rašė:

> Hey Saulius,
>
> Thanks for reaching out.
>
> once an Iceberg table is mutated outside of Impala one has to run a refresh
> > or invalidate statement
>
> If the Iceberg table lives in the HiveCatalog, and Automatic
> Invalidation/Refresh of Metadata is enabled, then you don't need to, i.e.
> Impala will eventually pick up the new table state.
> See https://impala.apache.org/docs/build/html/topics/impala_metadata.html
>
> We noticed that running refresh on huge tables can take minutes and while
> > that is happening querying them is blocked
> >
> What version do you use? There were quite a few improvements in that area
> lately.
> Though a major improvement is coming in 4.5.0: IMPALA-13254
> <https://issues.apache.org/jira/browse/IMPALA-13254>
>
> We have large event tables that are being updated very frequently in
> > real-time
> >
> I'm a bit curious about what "very frequent" means here. Is it possible for
> you to share some numbers?
>
> What would the recommendation here be?
> >
> Until Impala 4.5 you can try reducing the frequency of table updates. Also
> the number of files play a huge role in table loading times. Maybe you can
> try compacting the table from time to time.
>
> Cheers,
>     Zoltan
>
>
> On Wed, Jan 8, 2025 at 1:48 PM Saulius Valatka <saulius...@gmail.com>
> wrote:
>
> > Hi,
> >
> > If I understand correctly, once an Iceberg table is mutated outside of
> > Impala one has to run a refresh or invalidate statement. We noticed that
> > running refresh on huge tables can take minutes and while that is
> happening
> > querying them is blocked. We have large event tables that are being
> updated
> > very frequently in real-time, by default we run a refresh after each
> > update, so effectively this means such tables are un-queryable, as
> they're
> > constantly being refreshed.
> >
> > Is there something I'm missing? What would the recommendation here be?
> >
>

Re: Iceberg table refresh/invalidate latency

Reply via email to