Hi,

so I just tried applying IMPALA-13254 on top of 4.4.1, redeployed and
refresh times for the largest table went down from ~80 minutes to ~2
minutes!
That's waaay better, but still not ideal: if we issue a refresh every 15
minutes, there's still a lot of time the table is blocked for a minute or
two, but at least now it's queryable.

Here's an example REFRESH timeline:

Catalog Server Operation: 46s213ms
   - Got Metastore client: 5.771us (5.771us)
   - Got catalog version read lock: 2.334ms (2.328ms)
   - Got catalog version write lock and table write lock: 2.459ms
(125.470us)
   - Got Metastore client: 2.465ms (6.181us)
   - Fetched table from Metastore: 12.831ms (10.366ms)
   - Loaded Iceberg API table: 139.024ms (126.192ms)
   - Loaded schema from Iceberg: 139.175ms (150.949us)
   - Loaded Iceberg files: 5s036ms (4s897ms)
   - Loaded all column stats: 5s047ms (11.697ms)
   - Loaded table schema: 5s053ms (5.279ms)
   - Start refreshing file metadata: 5s053ms (291.959us)
   - Loaded file metadata for 1 partitions: 18s912ms (13s859ms)
   - Loaded all column stats: 46s213ms (27s300ms)
   - Loaded table: 46s213ms (36.600us)
   - Finished resetMetadata request: 46s213ms (485.329us)
Query Compilation: 1m3s
   - Metadata of all 2 tables cached: 1m3s (1m3s)
   - Analysis finished: 1m3s (237.705us)
   - Authorization finished (ranger): 1m3s (808.267us)
   - Planning finished: 1m3s (581.713us)
Query Timeline: 1m51s
   - Query submitted: 40.227us (40.227us)
   - Planning finished: 1m3s (1m3s)
   - CatalogDdlRequest finished: 1m50s (46s313ms)
   - Applied catalog updates from DDL: 1m50s (24.206ms)
   - Request finished: 1m50s (202.949us)
   - Unregister query: 1m51s (763.412ms)


2025-01-08, tr, 17:50 Zoltán Borók-Nagy <borokna...@cloudera.com> rašė:

> Thanks for the info, Saulius.
>
> If you try out IMPALA-13254, please let us know how much it helps in
> your case.
> Hopefully it speeds up table loading times enough so it won't cause too
> much turbulence.
> Some table loading statistics would be also helpful to know where the time
> is being spent.
>
> Do you use local catalog mode?
> https://impala.apache.org/docs/build/html/topics/impala_metadata.html
> I'm not sure how much it will help, but it could be worth trying out.
>
> Cheers,
>     Zoltan
>
>
> On Wed, Jan 8, 2025 at 2:45 PM Saulius Valatka <saulius...@gmail.com>
> wrote:
>
> > Hi,
> >
> > sorry, maybe I worded my question wrong: I understand that refreshing is
> > needed (either automatic or manual), main concerns are the latency of the
> > refresh and the fact that the table is not queryable while it's being
> > refreshed - for large tables that are being updated frequently this
> > combination makes them essentially un-queryable.
> >
> > 2025-01-08, tr, 15:17 Gabor Kaszab <gaborkas...@apache.org> rašė:
> >
> > > Hi,
> > >
> > > I don't think that the issue you describe is specific to Iceberg in a
> > sense
> > > that even for Hive tables if you make changes using an engine that
> > doesn't
> > > trigger HMS events, one has to issue refresh/invalidate metadata to see
> > the
> > > changes reflected in Impala.
> > > Could you share what catalog you use for your Iceberg tables? And what
> > tool
> > > do you use for data ingestion into these tables?
> > > If you use the HMS backed HiveCatalog as a catalog and an engine that
> > > triggers HMS notifications, like Spark or Hive then even for Iceberg
> > tables
> > > you can avoid executing refresh manually.
> > >
> > > Gabor
> > >
> > > On Wed, Jan 8, 2025 at 1:48 PM Saulius Valatka <saulius...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > If I understand correctly, once an Iceberg table is mutated outside
> of
> > > > Impala one has to run a refresh or invalidate statement. We noticed
> > that
> > > > running refresh on huge tables can take minutes and while that is
> > > happening
> > > > querying them is blocked. We have large event tables that are being
> > > updated
> > > > very frequently in real-time, by default we run a refresh after each
> > > > update, so effectively this means such tables are un-queryable, as
> > > they're
> > > > constantly being refreshed.
> > > >
> > > > Is there something I'm missing? What would the recommendation here
> be?
> > > >
> > >
> >
>

Reply via email to