Thanks for the update.

So the whole REFRESH operation took 1m50s. From this CatalogDdlRequest was
only 46s313ms. This 46s313ms is the time when the table is blocked, right?
>From CatalogDdlRequest the longest operation was loading column stats which
took 27s300ms. This is a single RPC (getTableColumnStatistics()) toward
HMS, it would be good to know why it took so long. Especially given that
loading file metadata for this huge table was around 19 seconds.
It's also interesting that "Loaded all column stats" appears twice in the
catalog timeline. At first it took 11.697ms, and the second invocation was
the one that took 27s300ms. Hopefully we can get rid of the second
invocation but that'll require a code change.

I also wonder why the table is not queryable from the Coordinator cache
while it is being reloaded in CatalogD, I hope we can fix this as well.

Is the table expected to grow indefinitely? Or do you drop/relocate old
partitions after some time?

Cheers,
    Zoltan


On Wed, Jan 8, 2025 at 9:19 PM Saulius Valatka <saulius...@gmail.com> wrote:

> Hi,
>
> so I just tried applying IMPALA-13254 on top of 4.4.1, redeployed and
> refresh times for the largest table went down from ~80 minutes to ~2
> minutes!
> That's waaay better, but still not ideal: if we issue a refresh every 15
> minutes, there's still a lot of time the table is blocked for a minute or
> two, but at least now it's queryable.
>
> Here's an example REFRESH timeline:
>
> Catalog Server Operation: 46s213ms
>    - Got Metastore client: 5.771us (5.771us)
>    - Got catalog version read lock: 2.334ms (2.328ms)
>    - Got catalog version write lock and table write lock: 2.459ms
> (125.470us)
>    - Got Metastore client: 2.465ms (6.181us)
>    - Fetched table from Metastore: 12.831ms (10.366ms)
>    - Loaded Iceberg API table: 139.024ms (126.192ms)
>    - Loaded schema from Iceberg: 139.175ms (150.949us)
>    - Loaded Iceberg files: 5s036ms (4s897ms)
>    - Loaded all column stats: 5s047ms (11.697ms)
>    - Loaded table schema: 5s053ms (5.279ms)
>    - Start refreshing file metadata: 5s053ms (291.959us)
>    - Loaded file metadata for 1 partitions: 18s912ms (13s859ms)
>    - Loaded all column stats: 46s213ms (27s300ms)
>    - Loaded table: 46s213ms (36.600us)
>    - Finished resetMetadata request: 46s213ms (485.329us)
> Query Compilation: 1m3s
>    - Metadata of all 2 tables cached: 1m3s (1m3s)
>    - Analysis finished: 1m3s (237.705us)
>    - Authorization finished (ranger): 1m3s (808.267us)
>    - Planning finished: 1m3s (581.713us)
> Query Timeline: 1m51s
>    - Query submitted: 40.227us (40.227us)
>    - Planning finished: 1m3s (1m3s)
>    - CatalogDdlRequest finished: 1m50s (46s313ms)
>    - Applied catalog updates from DDL: 1m50s (24.206ms)
>    - Request finished: 1m50s (202.949us)
>    - Unregister query: 1m51s (763.412ms)
>
>
> 2025-01-08, tr, 17:50 Zoltán Borók-Nagy <borokna...@cloudera.com> rašė:
>
> > Thanks for the info, Saulius.
> >
> > If you try out IMPALA-13254, please let us know how much it helps in
> > your case.
> > Hopefully it speeds up table loading times enough so it won't cause too
> > much turbulence.
> > Some table loading statistics would be also helpful to know where the
> time
> > is being spent.
> >
> > Do you use local catalog mode?
> > https://impala.apache.org/docs/build/html/topics/impala_metadata.html
> > I'm not sure how much it will help, but it could be worth trying out.
> >
> > Cheers,
> >     Zoltan
> >
> >
> > On Wed, Jan 8, 2025 at 2:45 PM Saulius Valatka <saulius...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > sorry, maybe I worded my question wrong: I understand that refreshing
> is
> > > needed (either automatic or manual), main concerns are the latency of
> the
> > > refresh and the fact that the table is not queryable while it's being
> > > refreshed - for large tables that are being updated frequently this
> > > combination makes them essentially un-queryable.
> > >
> > > 2025-01-08, tr, 15:17 Gabor Kaszab <gaborkas...@apache.org> rašė:
> > >
> > > > Hi,
> > > >
> > > > I don't think that the issue you describe is specific to Iceberg in a
> > > sense
> > > > that even for Hive tables if you make changes using an engine that
> > > doesn't
> > > > trigger HMS events, one has to issue refresh/invalidate metadata to
> see
> > > the
> > > > changes reflected in Impala.
> > > > Could you share what catalog you use for your Iceberg tables? And
> what
> > > tool
> > > > do you use for data ingestion into these tables?
> > > > If you use the HMS backed HiveCatalog as a catalog and an engine that
> > > > triggers HMS notifications, like Spark or Hive then even for Iceberg
> > > tables
> > > > you can avoid executing refresh manually.
> > > >
> > > > Gabor
> > > >
> > > > On Wed, Jan 8, 2025 at 1:48 PM Saulius Valatka <saulius...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > If I understand correctly, once an Iceberg table is mutated outside
> > of
> > > > > Impala one has to run a refresh or invalidate statement. We noticed
> > > that
> > > > > running refresh on huge tables can take minutes and while that is
> > > > happening
> > > > > querying them is blocked. We have large event tables that are being
> > > > updated
> > > > > very frequently in real-time, by default we run a refresh after
> each
> > > > > update, so effectively this means such tables are un-queryable, as
> > > > they're
> > > > > constantly being refreshed.
> > > > >
> > > > > Is there something I'm missing? What would the recommendation here
> > be?
> > > > >
> > > >
> > >
> >
>

Reply via email to