[ 
https://issues.apache.org/jira/browse/SENTRY-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138858#comment-16138858
 ] 

Sergio Peña commented on SENTRY-1895:
-------------------------------------

Based on a chat offline with [~akolb], the proposal to handle duplicated events 
is option #2. This will allow us to keep a track of all events (including 
duplicates) and avoid two sentry servers to process the same event twice.

The solution will require changes on the {{SENTRY_HMS_NOTIFICATION_ID}}. 
Currently, the table has a non-indexed and non-unique column named 
{{NOTIFICATION_ID}}. The new change should add a non-unique index to 
{{NOTIFICATION_ID}} to allow getting the max(id) faster and a unique indexed 
column called {{NOTIFICATION_HASH}} that will contain the hash of the HMS 
notification. Making the hash unique will allow us to prevent another sentry 
server to process the notification twice.

The HASH should be calculated from the full notification content, such as 
hash(EVENT_TYPE + EVENT_TIME + DB_NAME + TBL_NAME + MESSAGE).

This option #2 solution will be temporary until Hive fixes the issue with 
duplicated events (see HIVE-16886). Once Hive fixes it and Sentry bumps the 
Hive version to the one containing the fix, then the hash() calculation will be 
removed, but to keep backwards compatibility on the DB schema, then we will 
keep the columns and indexes, but Sentry will use the same notification ID on 
both columns NOTIFICATION_ID and NOTIFICATION_HASH. Because ID will be unique, 
then the unique constraint on NOTIFICATION_HASH will pass.

We would be able to remove the NOTIFICATION_HASH and make the NOTIFICATION_ID a 
primary key in a new major version of Sentry if required.

[~akolb] Does it sound good?

> Sentry should handle the case of multiple notifications with the same ID
> ------------------------------------------------------------------------
>
>                 Key: SENTRY-1895
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1895
>             Project: Sentry
>          Issue Type: Sub-task
>          Components: Sentry
>    Affects Versions: 2.0.0
>            Reporter: Alexander Kolbasov
>            Assignee: Sergio Peña
>             Fix For: 2.0.0
>
>
> As shown in HIVE-16886, notification IDs generated by Hive may be non-unique 
> and there may be cases with different evnts sharing the same ID. This creates 
> various problems for Sentry/Hive interaction and we should fine some short 
> -term solution until it is fixed in Hive.
> The issue was addressed in SENTRY-1803 by removing a primary-key constraint 
> on the notification Id which allows for multiple keys. But this creates other 
> problems:
> 1. We are using the primary key constraint to prevent multiple instances of 
> Sentry from processing the same notifications multiple times.
> 2. We are using max(notificationId) to find the last processed event. When 
> the field is a primary key, this operation is an index scan, but when it 
> isn't, it is a full table scan which is more expensive.
> We also have a few other problems caused by duplicate IDs which are not 
> related and not addressed by SENTRY-1803:
> 1. There is a  synchronization mechanism between HMS and Sentry which ensures 
> that a given event is processed. This doesn't work in the presence of 
> duplicate IDs.
> 2. Some events may be missed due to the way they are processed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to