[
https://issues.apache.org/jira/browse/SENTRY-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111014#comment-16111014
]
Sergio Peña commented on SENTRY-1803:
-------------------------------------
[~akolb] Processing events with the same EVENT_ID should not be a problem for
Sentry, but the issue is if those multiple events are not fetched in the
request that Sentry made to HMS, then they won't be processed next time because
Sentry will request events with ID larger than the already duplicated ID
processed. Below is an example:
{noformat}
1. Senty requests ID > 0
2. HMS returns ID = 1, 2 and 3
3. Sentry processes ID = 1, 2 and 3 (3 is latest processed)
At this moment after the HMS response, HMS could have written a new event with
the duplicated ID = 3.
Sentry won't be able to fetch such event in the next request because it will
request events ID > 3
4. Sentry requests ID > 3
5. HMS returns ID = 4 and 5
6. Sentry processed ID = 4 and 5 (5 is latest processed)
{noformat}
During my investigation, I found that HMS has another type of ID named NL_ID.
This is an id auto incremented by datastore identity. This looks consecutive
and without duplicates. However, while looking at the tests we did with HMS HA,
I found that NL_ID has holes. Also, the Hive community thinks that using
datastore identity can behave differently on Oracle, such as next increment
will be in a sequence of 10 (1, 10, 20).
Now, I think we can work around the duplicated IDs on Sentry by requesting the
same ID to HMS for a period of time and process all new events fetched. This is
assuming the HMS will write the duplicated ID at some point less than the time
frame Sentry uses. See this example:
{noformat}
# Period of time is 5ms (ts is a timestamp in ms)
1. Sentry requests ID > 0 with ts = 0
2. HMS returns ID = 1 (ts=1), 2 (ts=1), 3 (ts=2)
3. Sentry processes ID = 1, 2, 3
4. Sentry requests ID > 0 again because the latest ID processed is still in the
5ms time range
5. HMS returns ID = 1 (ts=1), 2 (ts=1), 3 (ts=2), 4 (ts=2)
6. Sentry processes ID = 4 only because 1,2,3 were already processed before
7. Sentry requests ID > 0 again because the latest ID processed is still in the
5ms time range
8. HMS returns ID = 1 (ts=1), 2 (ts=1), 3 (ts=2), 3 (ts=2), 4 (ts=2), 5 (ts=5)
9. Sentry processes ID = 5 and 3 (duplicated)
10. Sentry requests ID > 2 because the 5ms time range
{noformat}
Of course, 5ms is too low, but that was just an example. Before digging into
more detail about the implementation, first I'd like to know your opinion. This
is the best way I could think about to fetch those duplicated events.
> HMSFollower should handle the case of multiple notifications with the same ID
> -----------------------------------------------------------------------------
>
> Key: SENTRY-1803
> URL: https://issues.apache.org/jira/browse/SENTRY-1803
> Project: Sentry
> Issue Type: Sub-task
> Components: Sentry
> Affects Versions: 2.0.0
> Reporter: Alexander Kolbasov
> Assignee: Sergio Peña
> Fix For: 2.0.0
>
>
> According to HIVE-16886, it is possible that HMSFollower will encounter
> multiple events with the same notification ID. It should do something sane in
> this case.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)