[ 
https://issues.apache.org/jira/browse/IMPALA-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770657#comment-17770657
 ] 

Zoltán Borók-Nagy commented on IMPALA-12463:
--------------------------------------------

I am not sure if it is a valid optimization as it can mess up inter-table 
consistency and referential integrity.

E.g.:
 # INSERT into dimension table with new key
 # INSERT into fact table that references the new key

If the above statements are observed in order, then concurrent SELECTs that 
LEFT JOIN the two table should never see NULLs in the dimenstion columns.

But if the event processor reorders the above, so it is processed in the order:
 # INSERT into fact table that references the new key
 # INSERT into dimension table with new key

Then a conccurent SELECT between 1 and 2 will produce NULL values in the 
dimension columns.

We haven't really cared about inter-table consistency and referential integrity 
in the past, and even currently,  e.g. users can freely issue REFRESH tbl; 
anytime. Also, the event processor might just refresh the table to the current 
state, and not to the state of the event. But it will be very hard to support 
referential integrity and transactions in the future if the event processor 
starts to reorder events.

> Allow batching of non consecutive metastore events
> --------------------------------------------------
>
>                 Key: IMPALA-12463
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12463
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Csaba Ringhofer
>            Assignee: Joe McDonnell
>            Priority: Major
>         Attachments: concurrent_metadata_load.py
>
>
> Currently Impala tries to batch events like partition insert/creation only if:
> 1. the next event is for the same table as the previous one
> 2. the next event's id is the previous one's + 1
> 3. the next event has the same type as the previous one
> (2 can be stricter than 1 if some events were filtered between the two)
> See 
> https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java#L315
> Another limit is that only events in the same batch from HMS can be merged. 
> Currently 1000 events are polled at the same time: 
> https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L218
> Making this configurable could be also useful.
> Event batching could be improved by batching all events to the current one if 
> they modify the same table, unless they are "cut" by:
> a. an event on the same table but with a different type
> b. a rename table event where the original or the new name is the same as the 
> current event
> If such an event occurs, the events after that can be only merged to a newer 
> event.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to