[
https://issues.apache.org/jira/browse/IMPALA-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771609#comment-17771609
]
Joe McDonnell commented on IMPALA-12463:
----------------------------------------
I put together a prototype for this and posted it here:
[https://gerrit.cloudera.org/#/c/20533/]
There are certain tests that I still need to add, but I wanted to have the code
up and available so people can start to think about whether it is correct (and
whether it is what people had in mind for this JIRA).
Other parts of the event processor currently rely on the events being
monotonically increasing in Event ID, so my prototype maintains that property.
That sets some boundaries on how much this impacts cross-table correctness.
Basically, for [~boroknagyz]'s scenario, if all the inserts into table 1 come
before the inserts to table 2, then the inserts to table 1 have event IDs that
are earlier than everything for table 2. The event processor would emit the
events for table 1 before the events for table 2 in that scenario. Batching can
move events later, but not earlier.
> Allow batching of non consecutive metastore events
> --------------------------------------------------
>
> Key: IMPALA-12463
> URL: https://issues.apache.org/jira/browse/IMPALA-12463
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Reporter: Csaba Ringhofer
> Assignee: Joe McDonnell
> Priority: Major
> Attachments: concurrent_metadata_load.py
>
>
> Currently Impala tries to batch events like partition insert/creation only if:
> 1. the next event is for the same table as the previous one
> 2. the next event's id is the previous one's + 1
> 3. the next event has the same type as the previous one
> (2 can be stricter than 1 if some events were filtered between the two)
> See
> https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java#L315
> Another limit is that only events in the same batch from HMS can be merged.
> Currently 1000 events are polled at the same time:
> https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L218
> Making this configurable could be also useful.
> Event batching could be improved by batching all events to the current one if
> they modify the same table, unless they are "cut" by:
> a. an event on the same table but with a different type
> b. a rename table event where the original or the new name is the same as the
> current event
> If such an event occurs, the events after that can be only merged to a newer
> event.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]