Sourabh Goyal created IMPALA-10925:
--------------------------------------
Summary: Improved self event detection for event processor in
catalogd
Key: IMPALA-10925
URL: https://issues.apache.org/jira/browse/IMPALA-10925
Project: IMPALA
Issue Type: Epic
Components: Catalog
Reporter: Sourabh Goyal
Assignee: Sourabh Goyal
h3. Problem Statement
Impala catalogd has Events processor which polls metastore events at regular
intervals to automatically apply changes to the metadata in the catalogd.
However, the current design to detect the self-generated events (DDL/DMLs
coming from the same catalogd) have consistency problems which can cause query
failures under certain circumstances.
h3. Current Design
The current design of self-event detection is based on adding markers to the
HMS objects which are detected when the event is received later to determine if
the event is self-generated or not. These markers constitute a serviceID which
is unique to the catalogd instance and a catalog version number which is unique
for each catalog object. When a DDL is executed, catalogd adds these as object
parameters. When the event is received, Events processor checks the serviceID
and if the catalog version of the current object with the same name in the
catalogd cache and makes a decision of whether to ignore the event or not.
h3. Problems with the current design
The approach is problematic under some circumstances where there are
conflicting DDLs repeated at a faster interval. For example, a sequence of
create/drop table DDLs will generate CREATE_TABLE and DROP_TABLE events. When
the events are received, it is possible that the CREATE_TABLE event is
processed because the catalogd doesn’t have the table in the catalogd cache.
h3. Proposed Solution
The main idea of the solution is to keep track of the last event id for a given
table as eventId which the catalogd has synced to in the Table object. The
events processor ignores any event whose EVENT_ID is less than or equal to the
eventId stored in the table. Once the events processor successfully processes a
given event, it updates the value of eventId in the table before releasing the
table lock. Also, any DDL or refresh operation on the catalogd will follow the
steps given below to update the event id for the table. The solution relies on
the existing locking mechanism in the catalogd to prevent any other concurrent
updates to the table (even via EventsProcessor).
In case of database objects, we will also have a similar eventId which
represents the events on the database object (CREATE, DROP, ALTER database) and
to which the catalogd as synced to. Since there is no refresh database command,
catalogOpExecutor will only update the database eventId when there are DDLs at
the database level (e.g CREATE, DROP, ALTER database)
cc - [~vihangk1] [~kishendas]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)