[ 
https://issues.apache.org/jira/browse/SENTRY-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111014#comment-16111014
 ] 

Sergio Peña commented on SENTRY-1803:
-------------------------------------

[~akolb] Processing events with the same EVENT_ID should not be a problem for 
Sentry, but the issue is if those multiple events are not fetched in the 
request that Sentry made to HMS, then they won't be processed next time because 
Sentry will request events with ID larger than the already duplicated ID 
processed. Below is an example:
{noformat}
1. Senty requests ID > 0
2. HMS returns ID = 1, 2 and 3
3. Sentry processes ID = 1, 2 and 3 (3 is latest processed)

At this moment after the HMS response, HMS could have written a new event with 
the duplicated ID = 3.
Sentry won't be able to fetch such event in the next request because it will 
request events ID > 3

4. Sentry requests ID > 3
5. HMS returns ID = 4 and 5
6. Sentry processed ID = 4 and 5 (5 is latest processed)
{noformat}

During my investigation, I found that HMS has another type of ID named NL_ID. 
This is an id auto incremented by datastore identity. This looks consecutive 
and without duplicates. However, while looking at the tests we did with HMS HA, 
I found that NL_ID has holes. Also, the Hive community thinks that using 
datastore identity can behave differently on Oracle, such as next increment 
will be in a sequence of 10 (1, 10, 20).

Now, I think we can work around the duplicated IDs on Sentry by requesting the 
same ID to HMS for a period of time and process all new events fetched. This is 
assuming the HMS will write the duplicated ID at some point less than the time 
frame Sentry uses. See this example:
{noformat}
# Period of time is 5ms (ts is a timestamp in ms)
1. Sentry requests ID > 0 with ts = 0
2. HMS returns ID = 1 (ts=1), 2 (ts=1), 3 (ts=2)
3. Sentry processes ID = 1, 2, 3
4. Sentry requests ID > 0 again because the latest ID processed is still in the 
5ms time range
5. HMS returns ID = 1 (ts=1), 2 (ts=1), 3 (ts=2), 4 (ts=2)
6. Sentry processes ID = 4 only because 1,2,3 were already processed before 
7. Sentry requests ID > 0 again because the latest ID processed is still in the 
5ms time range
8. HMS returns ID = 1 (ts=1), 2 (ts=1), 3 (ts=2), 3 (ts=2), 4 (ts=2), 5 (ts=5)
9. Sentry processes ID = 5 and 3 (duplicated)
10. Sentry requests ID > 2 because the 5ms time range 
{noformat}

Of course, 5ms is too low, but that was just an example. Before digging into 
more detail about the implementation, first I'd like to know your opinion. This 
is the best way I could think about to fetch those duplicated events. 

> HMSFollower should handle the case of multiple notifications with the same ID
> -----------------------------------------------------------------------------
>
>                 Key: SENTRY-1803
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1803
>             Project: Sentry
>          Issue Type: Sub-task
>          Components: Sentry
>    Affects Versions: 2.0.0
>            Reporter: Alexander Kolbasov
>            Assignee: Sergio Peña
>             Fix For: 2.0.0
>
>
> According to HIVE-16886, it is possible that HMSFollower will encounter 
> multiple events with the same notification ID. It should do something sane in 
> this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to