[ 
https://issues.apache.org/jira/browse/IMPALA-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770739#comment-17770739
 ] 

Zoltán Borók-Nagy commented on IMPALA-12463:
--------------------------------------------

Thanks for clarifying this, from the title it wasn't clear to me that we are 
only talking about partition level events here.

Looking at the code I see we only batch partition-level insert events and alter 
partition events.

I agree that cross table consistency is a lost cause for external tables, so we 
can relax the canBeBatched() condition quite freely.

Hive ACID: yeah, if the "cut" mechanism is aware of transactions we should be 
good. If that's too hard, then not doing any optimization for Hive ACID tables 
at the beginning should be also fine I think, as these tables are not too 
common.

Iceberg: INSERTs (and any other modifications to Iceberg tables) actually 
generate ALTER TABLE events (that modify the table property 
"metadata_location"), these are not batched, so we should be fine.

Kudu tables: there are no events fired for INSERTs (but we also not cache 
anything, so we should be fine)

> Allow batching of non consecutive metastore events
> --------------------------------------------------
>
>                 Key: IMPALA-12463
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12463
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Csaba Ringhofer
>            Assignee: Joe McDonnell
>            Priority: Major
>         Attachments: concurrent_metadata_load.py
>
>
> Currently Impala tries to batch events like partition insert/creation only if:
> 1. the next event is for the same table as the previous one
> 2. the next event's id is the previous one's + 1
> 3. the next event has the same type as the previous one
> (2 can be stricter than 1 if some events were filtered between the two)
> See 
> https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java#L315
> Another limit is that only events in the same batch from HMS can be merged. 
> Currently 1000 events are polled at the same time: 
> https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L218
> Making this configurable could be also useful.
> Event batching could be improved by batching all events to the current one if 
> they modify the same table, unless they are "cut" by:
> a. an event on the same table but with a different type
> b. a rename table event where the original or the new name is the same as the 
> current event
> If such an event occurs, the events after that can be only merged to a newer 
> event.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to