[
https://issues.apache.org/jira/browse/SENTRY-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136937#comment-16136937
]
Alexander Kolbasov commented on SENTRY-1895:
--------------------------------------------
Some ideas on what we can do here.
1. Just drop duplicates and wait for HIVE-16886 to be fixed.
2. [~spena] suggested an interesting approach of using other fields from the
notification as a key. We can compute MD5 or sha1 checksum of the event and use
it as a primary key. This will solve the uniqueness problem above, but would
not solve the performance problem caused by table scan. Still, this is better
then the original SENTRY-1803 fix.
3. I have another idea inspired by BASIC language which used line numbers like
10, 20, ... and when something needed to be added later we use 11, 25, etc. If
we assume that we can't have more then 10 duplicates (somewhat arbitrary
assumption but may be practically ok) we can store notification IDs multiplied
by 10. When we see a duplicate we store it as an increment, so for example if
we have 2 notifications with index 1 we'll store them as 10, 11, 12. This is
used only for storage in DB. For all outside consumers we divide by 10 and
return back duplicate values. This approach restores the uniqueness constraint
and helps us to account for missing events from HMS. It doesn't solve the issue
with HMS synchronization, but this will be addressed when HIVE-16886 is fixed.
> Sentry should handle the case of multiple notifications with the same ID
> ------------------------------------------------------------------------
>
> Key: SENTRY-1895
> URL: https://issues.apache.org/jira/browse/SENTRY-1895
> Project: Sentry
> Issue Type: Sub-task
> Components: Sentry
> Affects Versions: 2.0.0
> Reporter: Alexander Kolbasov
> Assignee: Sergio Peña
> Fix For: 2.0.0
>
>
> As shown in HIVE-16886, notification IDs generated by Hive may be non-unique
> and there may be cases with different evnts sharing the same ID. This creates
> various problems for Sentry/Hive interaction and we should fine some short
> -term solution until it is fixed in Hive.
> The issue was addressed in SENTRY-1803 by removing a primary-key constraint
> on the notification Id which allows for multiple keys. But this creates other
> problems:
> 1. We are using the primary key constraint to prevent multiple instances of
> Sentry from processing the same notifications multiple times.
> 2. We are using max(notificationId) to find the last processed event. When
> the field is a primary key, this operation is an index scan, but when it
> isn't, it is a full table scan which is more expensive.
> We also have a few other problems caused by duplicate IDs which are not
> related and not addressed by SENTRY-1803:
> 1. There is a synchronization mechanism between HMS and Sentry which ensures
> that a given event is processed. This doesn't work in the presence of
> duplicate IDs.
> 2. Some events may be missed due to the way they are processed.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)