Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20022 )

Change subject: IMPALA-11535: Skip older events in the event processor based on 
the latestRefreshEventID
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/20022/6/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java
File fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java:

http://gerrit.cloudera.org:8080/#/c/20022/6/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java@1491
PS6, Line 1491:       if (isOlderEvent(null)) {
              :         infoLog("Not processing the alter table event {} as it 
is an older event",
              :             getEventId());
              :         return;
              :       }
              :
              :       if (isRename_) {
              :         processRename();
              :         return;
              :       }
I am trying to understand the RENAME case. I think that it would be clearer to 
skip isOlderEvent() logic for RENAME, as renameTableFromEvent() has its own 
handling of edge cases like renaming a table and then creating a table with the 
same name.


http://gerrit.cloudera.org:8080/#/c/20022/6/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java@2341
PS6, Line 2341: isOlderEvent
I am trying to understand what will happen in the batched case.

Wouldn't it make sense to call isOlderEvent for batchedEvents_ individually? If 
I understand correctly, currently we consider the eventId of the last event in 
batchedEvents_, so if the refresh happened somewhere in the middle, then we 
won't skip the early events. For example:

from Hive: insert to partition of tbl
from Impala: refresh tbl
from Hive: insert to partition of tbl

The events from the two events could be batched, right?

Also, this path looks untested as we always do only a single insert in the 
tests. This could be covered by an insert with dynamic partitioning in Hive.


http://gerrit.cloudera.org:8080/#/c/20022/6/tests/custom_cluster/test_events_custom_configs.py
File tests/custom_cluster/test_events_custom_configs.py:

http://gerrit.cloudera.org:8080/#/c/20022/6/tests/custom_cluster/test_events_custom_configs.py@315
PS6, Line 315: catalogd_args
Is there a reason why enable_sync_to_latest_event_on_ddls is true in the Java 
test but the default false here?

Is the long term goal to enable that flag by default?



--
To view, visit http://gerrit.cloudera.org:8080/20022
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic0dc5c7396d80616680d8a5805ce80db293b72e1
Gerrit-Change-Number: 20022
Gerrit-PatchSet: 6
Gerrit-Owner: Sai Hemanth Gantasala <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Sai Hemanth Gantasala <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Mon, 14 Aug 2023 16:01:31 +0000
Gerrit-HasComments: Yes

Reply via email to