[ 
https://issues.apache.org/jira/browse/SENTRY-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136937#comment-16136937
 ] 

Alexander Kolbasov commented on SENTRY-1895:
--------------------------------------------

Some ideas on what we can do here.

1. Just drop duplicates and wait for HIVE-16886 to be fixed.
2. [~spena] suggested an interesting approach of using other fields from the 
notification as a key. We can compute MD5 or sha1 checksum of the event and use 
it as a primary key. This will solve the uniqueness problem above, but would 
not solve the performance problem caused by table scan. Still, this is better 
then the original SENTRY-1803 fix.
3. I have another idea inspired by BASIC language which used line numbers like 
10, 20, ... and when something needed to be added later we use 11, 25, etc. If 
we assume that we can't have more then 10 duplicates (somewhat arbitrary 
assumption but may be practically ok) we can store notification IDs multiplied 
by 10. When we see a duplicate we store it as an increment, so for example if 
we have 2 notifications with index 1 we'll store them as 10, 11, 12. This is 
used only for storage in DB. For all outside consumers we divide by 10 and 
return back duplicate values. This approach restores the uniqueness constraint 
and helps us to account for missing events from HMS. It doesn't solve the issue 
with HMS synchronization, but this will be addressed when HIVE-16886 is fixed.

> Sentry should handle the case of multiple notifications with the same ID
> ------------------------------------------------------------------------
>
>                 Key: SENTRY-1895
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1895
>             Project: Sentry
>          Issue Type: Sub-task
>          Components: Sentry
>    Affects Versions: 2.0.0
>            Reporter: Alexander Kolbasov
>            Assignee: Sergio Peña
>             Fix For: 2.0.0
>
>
> As shown in HIVE-16886, notification IDs generated by Hive may be non-unique 
> and there may be cases with different evnts sharing the same ID. This creates 
> various problems for Sentry/Hive interaction and we should fine some short 
> -term solution until it is fixed in Hive.
> The issue was addressed in SENTRY-1803 by removing a primary-key constraint 
> on the notification Id which allows for multiple keys. But this creates other 
> problems:
> 1. We are using the primary key constraint to prevent multiple instances of 
> Sentry from processing the same notifications multiple times.
> 2. We are using max(notificationId) to find the last processed event. When 
> the field is a primary key, this operation is an index scan, but when it 
> isn't, it is a full table scan which is more expensive.
> We also have a few other problems caused by duplicate IDs which are not 
> related and not addressed by SENTRY-1803:
> 1. There is a  synchronization mechanism between HMS and Sentry which ensures 
> that a given event is processed. This doesn't work in the presence of 
> duplicate IDs.
> 2. Some events may be missed due to the way they are processed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to