[ 
https://issues.apache.org/jira/browse/SENTRY-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096961#comment-16096961
 ] 

Na Li commented on SENTRY-1855:
-------------------------------

1) Summary

The current approach updates changeID (primary key for permission change and 
path change) manually. In a single transaction, the code reads the max changeID 
from DB, increases it by one, and save the value in new change entry. If two 
threads are adding changes into DB at the same time, collision happens (primary 
key does not allow multiple entries having the same value), and one transaction 
fails. Then the failed transaction goes through multiple re-tries. If retry 
count reaches max value, the transaction fails.

In our stress testing on a single sentry server, with 15 clients doing 
grant/revoke operations concurrently, we saw multiple transaction failures, and 
the exponential-back-off retry increases the latency of every transaction in 
sentry. We have serious performance issue on saving permission and path updates.

2) Potential solutions

2.1) Find out why we have collision on a single sentry server with 
synchronization on saving updates. Once find the cause, fix it.
+ Follow existing approach. Does not introduce big change to the code base. 
- Need time to investigate why synchronization at application level on a single 
sentry server does not prevent key collision.
- Does not scale. All updates are serialized, not much concurrency. 
- Still have key collision exception and transaction failure when more than one 
Sentry servers are deployed. 
- Transaction failure at collision increases time to execute a transaction. 
- It is confusing to customer that there are transaction failure in normal 
operation. Increase support cases

2.2) Auto-increment changeID and send delta changes as much as possible
The patch that achieves 5 times or more performance increase than the current 
approach.
It contains the following changes
revert sentry-1795 (so the changeID is auto-increment. This avoids key 
collision. This is main reason we have performance improvement)
revert sentry-1824 (no need to synchronize when changeID is auto-increment)
get continuous delta list from SentryStore even when the delta list has hole 
(for example, the list is 1,2,3,5,6, return 1,2,3. If the hole is at the front 
of list, return full snapshot)
+ Relative small changes. Verified working with good performance when there is 
no transaction failure
+ When the hole in delta list is temporary (transaction in flight), return the 
continuous delta list is effective to deal with the hole. Most likely, the hole 
will disappear next time HDFS requests for changes.
- When there is transaction failure (the hole in changeID is permanent), sends 
back full snapshot, which is expensive. If we can detect permanent hole, then 
we don't need to send full snapshot, which is very expensive, and may exhaust 
memory for big customer
2.3) Use timestamp to sort the changes

a) use timestamp, such as MSentryPermChange.createTimeMs or 
MSentryPathChange.createTimeMs to sort the entries. If there are more than one 
entry having same timestamp, use changeID to break the tie.
b) HDFS asks for updates using these timestamp values instead of changeID. 
Sentry server sends back changes at and after this timestamp. HDFS keeps the 
list of changeIDs associated with the requesting timestamp and skip entries 
already processed. This is to handle the situation when more than one entry 
having same timestamp, and some are sent in previous request, and some need to 
be send in next request.
c) changeID is primary key, only used to uniquely identify the entry, and not 
required to be sequential nor consecutive.
d) Purge the change entries in DB using timestamp instead of changeID. For 
example, keep 3 polling intervals of entries to allow HDFS getting the changes 
before they are purged.
+ Sentry only sends full snapshot to HDFS at the first time when HDFS starts, 
and then always sends delta changes to HDFS
+ High concurrency. Scale well with large number of clients
- Relative big code change for API between sentry server and sentry plugin at 
HDFS. 
- No easy way to detect that HDFS has received and processed all updates

3) Decisions

My suggestion is that we take approach 2.2) for short term and take the hit of 
full snapshot when there is transaction failure. And take approach 2.3) as long 
term solution.

> PERM/PATH transactions can fail to commit to the sentry database under load
> ---------------------------------------------------------------------------
>
>                 Key: SENTRY-1855
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1855
>             Project: Sentry
>          Issue Type: Sub-task
>          Components: Sentry
>    Affects Versions: sentry-ha-redesign
>            Reporter: Alexander Kolbasov
>            Assignee: Na Li
>             Fix For: sentry-ha-redesign
>
>         Attachments: SENTRY-1855.01-sentry-ha-redesign.patch
>
>
> Looking at the latest stress runs, we noticed that some of the transactions 
> could fail to commit to the database (with Duplicate key exception) after 
> exhausting all the retries.
> This problem has become more evident if we have more number of clients 
> connecting to Sentry to issue the permission updates. Was able to reproduce 
> consistently with 15 clients doing 100 operations each.
> In the past we introduced exponential backoff (SENTRY-1821) so as part of 
> test run increased the defaults to 750ms sleep and 20 retries. But even after 
> this, the cluster still shows up the transaction failures. This change also 
> increases the latency of every transaction in sentry.
> We need to revisit this and come up with a better way to solve this problem.
> {code}
> 2017-07-13 13:18:14,449 ERROR 
> org.apache.sentry.provider.db.service.persistent.TransactionManager: The 
> transaction has reached max retry number, Exception thrown when executing 
> query
> javax.jdo.JDOException: Exception thrown when executing query
>       at 
> org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:596)
>       at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:252)
>       at 
> org.apache.sentry.provider.db.service.persistent.SentryStore.getRole(SentryStore.java:294)
>       at 
> org.apache.sentry.provider.db.service.persistent.SentryStore.alterSentryRoleGrantPrivilegeCore(SentryStore.java:645)
>       at 
> org.apache.sentry.provider.db.service.persistent.SentryStore.access$500(SentryStore.java:101)
>       at 
> org.apache.sentry.provider.db.service.persistent.SentryStore$11.execute(SentryStore.java:601)
>       at 
> org.apache.sentry.provider.db.service.persistent.TransactionManager.executeTransaction(TransactionManager.java:159)
>       at 
> org.apache.sentry.provider.db.service.persistent.TransactionManager.access$100(TransactionManager.java:63)
>       at 
> org.apache.sentry.provider.db.service.persistent.TransactionManager$2.call(TransactionManager.java:213)
> --
>       at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:971)
>       at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3887)
>       at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3823)
>       at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2435)
>       at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2582)
>       at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2530)
>       at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1907)
>       at 
> com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2141)
>       at 
> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1773)
>       ... 33 more
> 2017-07-13 13:18:14,450 ERROR 
> org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor: 
> Unknown error for request: 
> TAlterSentryRoleGrantPrivilegeRequest(protocol_version:2, 
> requestorUserName:hive, roleName:2017_07_12_15_06_38_1_2_805, 
> privileges:[TSentryPrivilege(privilegeScope:DATABASE, serverName:server1, 
> dbName:2017_07_12_15_06_38_1_2, tableName:, URI:, action:*, 
> createTime:1499904401222, grantOption:FALSE, columnName:)]), message: The 
> transaction has reached max retry number, Exception thrown when executing 
> query
> java.lang.Exception: The transaction has reached max retry number, Exception 
> thrown when executing query
>       at 
> org.apache.sentry.provider.db.service.persistent.TransactionManager$ExponentialBackoff.execute(TransactionManager.java:255)
>       at 
> org.apache.sentry.provider.db.service.persistent.TransactionManager.executeTransactionBlocksWithRetry(TransactionManager.java:209)
>       at 
> org.apache.sentry.provider.db.service.persistent.SentryStore.execute(SentryStore.java:3330)
>       at 
> org.apache.sentry.provider.db.service.persistent.SentryStore.alterSentryRoleGrantPrivilege(SentryStore.java:593)
>       at 
> org.apache.sentry.provider.db.service.persistent.SentryStore.alterSentryRoleGrantPrivileges(SentryStore.java:633)
>       at 
> org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.alter_sentry_role_grant_privilege(SentryPolicyStoreProcessor.java:256)
>       at 
> org.apache.sentry.provider.db.service.thrift.SentryPolicyService$Processor$alter_sentry_role_grant_privilege.getResult(SentryPolicyService.java:997)
>       at 
> org.apache.sentry.provider.db.service.thrift.SentryPolicyService$Processor$alter_sentry_role_grant_privilege.getResult(SentryPolicyService.java:982)
>       at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>       at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to