Sasha, I realize a serious issue in approach 2.2) that it does not function correctly. So I re-iterate the approaches and their pros and cons.
This leaves us the option 2.1) and 2.3). The question is if the performance of approach 2.1) acceptable for now, and we can work on the long term solution 2.3) later? 2.1) Find out why we have collision on a single sentry server with synchronization on saving updates. Once find the cause, fix it. + Follow existing approach. Does not introduce big change to the code base. + Function correctly. There is no hole in changeID - Need time to investigate why synchronization at application level on a single sentry server does not prevent key collision. - Does not scale. All updates are serialized, not much concurrency. - Still have key collision exception and transaction failure when more than one Sentry servers are deployed. - Transaction failure at collision increases time to execute a transaction. - It is confusing to customer that there are transaction failure in normal operation. Increase support cases 2.2) Auto-increment changeID and send delta changes as much as possible The patch that achieves 5 times or more performance increase than the current approach. It contains the following changes - revert sentry-1795 (so the changeID is auto-increment. This avoids key collision. This is main reason we have performance improvement) - revert sentry-1824 (no need to synchronize when changeID is auto-increment) - get continuous delta list from SentryStore even when the delta list has hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the hole is at the front of list, return full snapshot) + Relative small changes. Verified working with good performance when there is no transaction failure + When the hole in delta list is temporary (transaction in flight), return the continuous delta list is effective to deal with the hole. Most likely, the hole will disappear next time HDFS requests for changes. - When there is transaction failure (the hole in changeID is permanent), sends back full snapshot, which is expensive. If we can detect permanent hole, then we don't need to send full snapshot, which is very expensive, and may exhaust memory for big customer - When there is a temporary hole at the start of the list, even sending full snapshot could skip the transaction in flight. It does not function correctly. For example, the list is 1,2,3,5,6, and transaction with changeID 4 is not done. First, Sentry returns 1,2,3. next round, 4 is still not done, so sentry gets list 5,6 with a hole at the front of the list. Sentry sends back the full snapshot with changeID 6. then next time, HDFS will ask for changes after 6, and transaction with changeID 4 will be skipped even thought it may have done. 2.3) Use timestamp to sort the changes a) use timestamp, such as MSentryPermChange.createTimeMs or MSentryPathChange.createTimeMs to sort the entries. If there are more than one entry having same timestamp, use changeID to break the tie. b) HDFS asks for updates using these timestamp values instead of changeID. Sentry server sends back changes at and after this timestamp. HDFS keeps the list of changeIDs associated with the requesting timestamp and skip entries already processed. This is to handle the situation when more than one entry having same timestamp, and some are sent in previous request, and some need to be send in next request. c) changeID is primary key, only used to uniquely identify the entry, and not required to be sequential nor consecutive. d) Purge the change entries in DB using timestamp instead of changeID. For example, keep 3 polling intervals of entries to allow HDFS getting the changes before they are purged. + Sentry only sends full snapshot to HDFS at the first time when HDFS starts, and then always sends delta changes to HDFS. For path changes, if Sentry got a new full snapshot from HMS, it sends full path snapshot. + High concurrency. Scale well with large number of clients - Relative big code change for API between sentry server and sentry plugin at HDFS. - No easy way to detect that HDFS has received and processed all updates - If DB admin changes the DB server time backwards, this will mess up the ordering using timestamp Thanks, Lina On Fri, Jul 21, 2017 at 5:53 PM, Alexander Kolbasov <ak...@cloudera.com> wrote: > Lina, Thank you for sending a detailed summary! > > > > On Jul 22, 2017, at 12:29 AM, Na Li <lina...@cloudera.com <mailto: > lina...@cloudera.com>> wrote: > > > > Hello, > > > > Can you provide your feedback on this issue? We need to make decision > soon. > > > > 1) Summary > > > > The current approach updates changeID (primary key for permission change > and path change) manually. In a single transaction, the code reads the max > changeID from DB, increases it by one, and save the value in new change > entry. If two threads are adding changes into DB at the same time, > collision happens (primary key does not allow multiple entries having the > same value), and one transaction fails. Then the failed transaction goes > through multiple re-tries. If retry count reaches max value, the > transaction fails. > > > > In our stress testing on a single sentry server, with 15 clients doing > grant/revoke operations concurrently, we saw multiple transaction failures, > and the exponential-back-off retry increases the latency of every > transaction in sentry. We have serious performance issue on saving > permission and path updates. > > How important is the performance issue? Do we expect high rate of > permission updates in realistic scenarios? Do we have any idea about > current supported rate of permission updates? > > Note that in our single sentry server case the code that updates the DB is > serialized, and, thus, we shouldn’t see any collisions at the DB level at > all. This means that there is some other bug hiding here, still > undiscovered. The retries and exponential backoff shouldn’t apply in the > single-server case at all. > > > > > 2) Potential solutions > > > > 2.1) Find out why we have collision on a single sentry server with > synchronization on saving updates. Once find the cause, fix it. > > +1 on that > > > + Follow existing approach. Does not introduce big change to the code > base. > > - Need time to investigate why synchronization at application level on a > single sentry server does not prevent key collision. > > - Does not scale. All updates are serialized, not much concurrency. > > - Still have key collision exception and transaction failure when more > than one Sentry servers are deployed. > > - Transaction failure at collision increases time to execute a > transaction. > > - It is confusing to customer that there are transaction failure in > normal operation. Increase support cases > > > > 2.2) CDH-transactionFailure.01-cdh5-1.5.1.patch > > The patch that achieves 5 times or more performance increase than the > current approach. > > It contains the following changes > > revert sentry-1795 (so the changeID is auto-increment. This avoids key > collision. This is main reason we have performance improvement) > > revert sentry-1824 (no need to synchronize when changeID is > auto-increment) > > get continuous delta list from SentryStore even when the delta list has > hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the hole is at > the front of list, return full snapshot) > > Please document the algorithm you are suggesting in more detail. How do > you propose to handle synchronization between NN and Sentry server using > auto-incremented IDs? > > Performance is an important consideration once we achieve correctness. It > would be great to have *correct* solution that also performs well. > > + Relative small changes. Verified working with good performance when > there is no transaction failure > > + When the hole in delta list is temporary (transaction in flight), > return the continuous delta list is effective to deal with the hole. Most > likely, the hole will disappear next time HDFS requests for changes. > > > > - When there is transaction failure (the hole in changeID is permanent), > sends back full snapshot, which is expensive. If we can detect permanent > hole, then we don't need to send full snapshot, which is very expensive, > and may exhaust memory for big customer > > > > Right, sending full snapshot is extremely expensive. How do you propose to > detect such permanent hole? > > 2.3) Use timestamp to sort the changes > > > > a) use timestamp, such as MSentryPermChange.createTimeMs or > MSentryPathChange.createTimeMs to sort the entries. If there are more than > one entry having same timestamp, use changeID to break the tie. > > IMO we should use either timestamps or changeIDs, not both. > > > b) HDFS asks for updates using these timestamp values instead of > changeID. > > I think that this approach may cause some updates to be lost. > > > > c) changeID is primary key, only used to uniquely identify the entry, > and not required to be sequential nor consecutive. > > d) Purge the change entries in DB using timestamp instead of changeID. > For example, keep 3 polling intervals of entries to allow HDFS getting the > changes before they are purged. > > Suppose that you do this purging. Later you restart NN and it asks for all > updates since time T, but a bunch of updates since this time were purged, > so they never reach NN and are permanently lost. And it is impossible to > detect that this happened. > > > + Sentry only sends full snapshot to HDFS at the first time when HDFS > starts, and then always sends delta changes to HDFS > > What if HDFS wasn’t able to talk to Sentry for some time and in the mean > time some entries were purged? > > > > + High concurrency. Scale well with large number of clients > > - Relative big code change for API between sentry server and sentry > plugin at HDFS. > > - No easy way to detect that HDFS has received and processed all updates > > > > 3) Decisions > > > > My suggestion is that we take approach 2.2) for short term and take the > hit of full snapshot when there is transaction failure. And take approach > 2.3) as long term solution. > > We can’t make the decision since we do not have enough information about > approach 2.2 - how does it actually work? This may be a really good > solution if we can prove that it works. > > - Sasha > > > On Jul 22, 2017, at 12:42 AM, Na Li <lina...@cloudera.com> wrote: > > > > Hello, > > > > Can you provide your feedback on this issue? We need to make decision > soon. > > > > 1) Summary > > > > The current approach updates changeID (primary key for permission change > > and path change) manually. In a single transaction, the code reads the > max > > changeID from DB, increases it by one, and save the value in new change > > entry. If two threads are adding changes into DB at the same time, > > collision happens (primary key does not allow multiple entries having the > > same value), and one transaction fails. Then the failed transaction goes > > through multiple re-tries. If retry count reaches max value, the > > transaction fails. > > > > In our stress testing on a single sentry server, with 15 clients doing > > grant/revoke operations concurrently, we saw multiple transaction > failures, > > and the exponential-back-off retry increases the latency of every > > transaction in sentry. We have serious performance issue on saving > > permission and path updates. > > > > 2) Potential solutions > > > > 2.1) Find out why we have collision on a single sentry server with > > synchronization on saving updates. Once find the cause, fix it. > > + Follow existing approach. Does not introduce big change to the code > base. > > - Need time to investigate why synchronization at application level on a > > single sentry server does not prevent key collision. > > - Does not scale. All updates are serialized, not much concurrency. > > - Still have key collision exception and transaction failure when more > than > > one Sentry servers are deployed. > > - Transaction failure at collision increases time to execute a > transaction. > > - It is confusing to customer that there are transaction failure in > normal > > operation. Increase support cases > > > > 2.2) Auto-increment changeID and send delta changes as much as possible > > > > The patch that achieves 5 times or more performance increase than the > > current approach. > > > > It contains the following changes > > > > - revert sentry-1795 (so the changeID is auto-increment. This avoids > key > > collision. This is main reason we have performance improvement) > > - revert sentry-1824 (no need to synchronize when changeID is > > auto-increment) > > - get continuous delta list from SentryStore even when the delta list > > has hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the > hole is > > at the front of list, return full snapshot) > > > > + Relative small changes. Verified working with good performance when > there > > is no transaction failure > > > > + When the hole in delta list is temporary (transaction in flight), > return > > the continuous delta list is effective to deal with the hole. Most > likely, > > the hole will disappear next time HDFS requests for changes. > > > > - When there is transaction failure (the hole in changeID is permanent), > > sends back full snapshot, which is expensive. If we can detect permanent > > hole, then we don't need to send full snapshot, which is very expensive, > > and may exhaust memory for big customer > > > > 2.3) Use timestamp to sort the changes > > > > a) use timestamp, such as MSentryPermChange.createTimeMs or > > MSentryPathChange.createTimeMs to sort the entries. If there are more > than > > one entry having same timestamp, use changeID to break the tie. > > b) HDFS asks for updates using these timestamp values instead of > changeID. > > Sentry server sends back changes at and after this timestamp. HDFS keeps > > the list of changeIDs associated with the requesting timestamp and skip > > entries already processed. This is to handle the situation when more than > > one entry having same timestamp, and some are sent in previous request, > and > > some need to be send in next request. > > c) changeID is primary key, only used to uniquely identify the entry, and > > not required to be sequential nor consecutive. > > d) Purge the change entries in DB using timestamp instead of changeID. > For > > example, keep 3 polling intervals of entries to allow HDFS getting the > > changes before they are purged. > > + Sentry only sends full snapshot to HDFS at the first time when HDFS > > starts, and then always sends delta changes to HDFS > > + High concurrency. Scale well with large number of clients > > - Relative big code change for API between sentry server and sentry > plugin > > at HDFS. > > - No easy way to detect that HDFS has received and processed all updates > > > > 3) Decisions > > > > My suggestion is that we take approach 2.2) for short term and take the > hit > > of full snapshot when there is transaction failure. And take approach > 2.3) > > as long term solution. > > > > Thanks, > > > > Lina > >