[
https://issues.apache.org/jira/browse/SENTRY-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139069#comment-16139069
]
Sergio Peña commented on SENTRY-1895:
-------------------------------------
[~akolb] I think Lina has a point. I was looking at the code and if we re-use
the NOTIFICATION_ID from the SENTRY_PATH_CHANGE, then the code changes would be
easier than persisting it on the SENTRY_HMS_NOTIFICATION_ID. Here's why:
Lina's proposal:
* Code changes are easier if we do persist the HASH on the SENTRY_PATH_CHANGE
table. The hash calculation may happen on the
DeltaTransactionBlock#persistUpdate() method because the Update object brings
the seqNum + authObj + locations.
* Just a unique index (removed on SENTRY-1803) will be added to the
NOTIFICATION_ID column from the SENTRY_PATH_CHANGE table.
* In an upgrade, we can just remove the hash calculation but continuing using
the ID in the NOTIFICATION_ID because we have the unique constraint.
SENTRY-1885 proposes to remove such column, but we can do it in a major Sentry
version.
Sergio and Sasha's proposal:
* Code changes are easy but needs more changes because we need pass the hash
value to the DeltaTransactionBlock#persistUpdate() so that it is persisted on
the SENTRY_HMS_NOTIFICATION_ID. We also need to persist it on the
ObjectStore#persistLastProcessedNotificationID().
* The hash value would be calculated on the HMSFollower class because it is the
one that interfacts with fetching notifications and passing the info to the
NotificationProcessor and ObjectStore.
* We will need to add a new column on the SENTRY_HMS_NOTIFICATION_ID where to
persist the hash. This needs changes on the MSentryHmsNotification class as
well.
* In an upgrade, we will remove the hash calculation and all the parameters
that accept the hash value. We can use the same ID to persist it instead of the
hash value. We can create another JIRA to remove this column in a major Sentry
version.
I was also thinking if the information from the Update object can be used and
hashed and be unique as it is the NotificationEvent, and seems it is:
{noformat}
1, CREATE DATABASE db1; (hash: 1, ("db1", "/db1"))
2, DROP DATABASE db1; (hash: 2, ("db1", "/db1"))
3, CREATE DATABASE db1; (hash: 3, ("db1", "/db1"))
1, ALTER TABLE db1.tbl1 RENAME TO tbl2; (hash: 1, ("db1.tbl1", "/db1/tbl1"),
("db1.tbl2", "/db1/tbl12"))
1, ALTER TABLE db1.tbl1 RENAME TO tbl3; (hash: 1, ("db1.tbl1", "/db1/tbl1"),
("db1.tbl3", "/db1/tbl13"))
2, CREATE DATABASE db1.tbl1; (hash: 2, ("db1.tbl1", "/db1/tbl1"))
3, DROP DATABASE db1.tbl3; (hash: 3, ("db1.tbl3"))
4, ALTER TABLE db1.tbl1 RENAME TO tbl3; (hash: 4, ("db1.tbl1", "/db1/tbl1"),
("db1.tbl3", "/db1/tbl13"))
1, ALTER TABLE db1.tbl1 SET LOCATION "/db1/tbl2" (hash: 1, ("db1.tbl1",
D:"/db1/tbl1", A:"/db1/tbl2"))
2, ALTER TABLE db1.tbl1 SET LOCATION "/db1/tbl1" (hash: 2, ("db1.tbl1",
D:"/db1/tbl2", A:"/db1/tbl1"))
3, ALTER TABLE db1.tbl1 SET LOCATION "/db1/tbl2" (hash: 3, ("db1.tbl1",
D:"/db1/tbl1", A:"/db1/tbl2"))
{noformat}
Another example, if for some reason Hive creates two similar ALTER
notifications, then we will still have the same problem if we hash from the
NotificationEvent because all values will be identical if they happen to be
executed at the same time. But I think one of them will fail on HMS.
[~akolb] any thoughts?
> Sentry should handle the case of multiple notifications with the same ID
> ------------------------------------------------------------------------
>
> Key: SENTRY-1895
> URL: https://issues.apache.org/jira/browse/SENTRY-1895
> Project: Sentry
> Issue Type: Sub-task
> Components: Sentry
> Affects Versions: 2.0.0
> Reporter: Alexander Kolbasov
> Assignee: Sergio Peña
> Fix For: 2.0.0
>
>
> As shown in HIVE-16886, notification IDs generated by Hive may be non-unique
> and there may be cases with different evnts sharing the same ID. This creates
> various problems for Sentry/Hive interaction and we should fine some short
> -term solution until it is fixed in Hive.
> The issue was addressed in SENTRY-1803 by removing a primary-key constraint
> on the notification Id which allows for multiple keys. But this creates other
> problems:
> 1. We are using the primary key constraint to prevent multiple instances of
> Sentry from processing the same notifications multiple times.
> 2. We are using max(notificationId) to find the last processed event. When
> the field is a primary key, this operation is an index scan, but when it
> isn't, it is a full table scan which is more expensive.
> We also have a few other problems caused by duplicate IDs which are not
> related and not addressed by SENTRY-1803:
> 1. There is a synchronization mechanism between HMS and Sentry which ensures
> that a given event is processed. This doesn't work in the presence of
> duplicate IDs.
> 2. Some events may be missed due to the way they are processed.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)