Lina, > On Jul 24, 2017, at 10:20 PM, Na Li <lina...@cloudera.com> wrote: > > Sasha, > > I found the following approach that could improve the performance of the > current approach and second approach a lot with minor code change
How do you know that it improves performance? Do you see the improvement in your experiments? If you do see improvements, let’s do the change. > > Right now, the execution order of the transaction blocks in a transaction is > 1. Delta transaction block to save the perm change and path change > 2. The actual transaction > > We can change the execution order to be > 1. The actual transaction > 2. Delta transaction block to save the perm change and path change > > The benefit of this update is: > Actual transaction has chance to fail and execution time may vary a lot. Why would it fail? Most of our transactions fail currently because of the conflict on the primary key for delta updates which IMO is happening at commit time. > Delta transaction has very low chance to fail and the execution time is > near constant. As a result, the changeID is set after actual transaction is > done. Hmm, I don’t get that. I think changeID is modified in the same transaction, so the whole thing should complete. > > A. If it is applied to the approach 2.1) (manually increase changeID, read > current max value and increase by 1) , the time window exposed for key > collision is small (the delta transaction execution time). Before, the time > windows is the sum of delta transaction and actual transaction. So key > collision should happen much less. Can you run some experiments and see whether this is the case or not? This may or may not have some impact - difficult to tell. Thanks, - Sasha > > B. If it is applied to the approach 2.2) (auto-increase the change ID), > actual transaction failure does not create permanent hole in changeID. Near > constant execution time of delta transaction makes it rare to have > temporary hole. Without permanent hole, if there is a hole, it will be > temporary hole. We should always wait for the hole to be filled and always > send delta changes. If the hole is not filled, the missing transaction is > likely purged without received by HDFS. > > Thanks, > > Lina > > > On Mon, Jul 24, 2017 at 1:03 PM, Na Li <lina...@cloudera.com> wrote: > >> Approach 2.2, it uses auto-increment of changeID. When there is a >> temporary hole at the start of the list, even sending full snapshot could >> skip the transaction in flight. It does not function correctly. >> >> For example, the list is 1,2,3,5,6, and transaction with changeID 4 is >> not done. >> >> First, Sentry returns 1,2,3. >> >> next round, 4 is still not done, so sentry gets list 5,6 with a hole at >> the front of the list. Sentry sends back the full snapshot with changeID 6. >> >> then next time, HDFS will ask for changes after 6, and transaction with >> changeID 4 will be skipped even thought it may have done. >> >> On Mon, Jul 24, 2017 at 12:02 PM, Alexander Kolbasov <ak...@cloudera.com> >> wrote: >> >>> Lina, >>> >>> can you describe what problem have you discovered with approach 2.2? >>> >>> >>> On Mon, Jul 24, 2017 at 6:37 PM, Na Li <lina...@cloudera.com> wrote: >>> >>>> Sasha, >>>> >>>> I realize a serious issue in approach 2.2) that it does not function >>>> correctly. So I re-iterate the approaches and their pros and cons. >>>> >>>> This leaves us the option 2.1) and 2.3). The question is if the >>> performance >>>> of approach 2.1) acceptable for now, and we can work on the long term >>>> solution 2.3) later? >>>> >>>> 2.1) Find out why we have collision on a single sentry server with >>>> synchronization on saving updates. Once find the cause, fix it. >>>> + Follow existing approach. Does not introduce big change to the code >>> base. >>>> + Function correctly. There is no hole in changeID >>>> - Need time to investigate why synchronization at application level on a >>>> single sentry server does not prevent key collision. >>>> - Does not scale. All updates are serialized, not much concurrency. >>>> - Still have key collision exception and transaction failure when more >>> than >>>> one Sentry servers are deployed. >>>> - Transaction failure at collision increases time to execute a >>> transaction. >>>> - It is confusing to customer that there are transaction failure in >>> normal >>>> operation. Increase support cases >>>> >>>> 2.2) Auto-increment changeID and send delta changes as much as possible >>>> >>>> The patch that achieves 5 times or more performance increase than the >>>> current approach. >>>> >>>> It contains the following changes >>>> >>>> - revert sentry-1795 (so the changeID is auto-increment. This avoids >>> key >>>> collision. This is main reason we have performance improvement) >>>> - revert sentry-1824 (no need to synchronize when changeID is >>>> auto-increment) >>>> - get continuous delta list from SentryStore even when the delta list >>>> has hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the >>> hole >>>> is >>>> at the front of list, return full snapshot) >>>> >>>> + Relative small changes. Verified working with good performance when >>> there >>>> is no transaction failure >>>> >>>> + When the hole in delta list is temporary (transaction in flight), >>> return >>>> the continuous delta list is effective to deal with the hole. Most >>> likely, >>>> the hole will disappear next time HDFS requests for changes. >>>> >>>> - When there is transaction failure (the hole in changeID is permanent), >>>> sends back full snapshot, which is expensive. If we can detect permanent >>>> hole, then we don't need to send full snapshot, which is very expensive, >>>> and may exhaust memory for big customer >>>> >>>> - When there is a temporary hole at the start of the list, even sending >>>> full snapshot could skip the transaction in flight. It does not function >>>> correctly. >>>> >>>> For example, the list is 1,2,3,5,6, and transaction with changeID 4 is >>>> not done. >>>> >>>> First, Sentry returns 1,2,3. >>>> >>>> next round, 4 is still not done, so sentry gets list 5,6 with a hole >>> at >>>> the front of the list. Sentry sends back the full snapshot with >>> changeID 6. >>>> >>>> then next time, HDFS will ask for changes after 6, and transaction >>> with >>>> changeID 4 will be skipped even thought it may have done. >>>> >>>> 2.3) Use timestamp to sort the changes >>>> >>>> a) use timestamp, such as MSentryPermChange.createTimeMs or >>>> MSentryPathChange.createTimeMs to sort the entries. If there are more >>> than >>>> one entry having same timestamp, use changeID to break the tie. >>>> b) HDFS asks for updates using these timestamp values instead of >>> changeID. >>>> Sentry server sends back changes at and after this timestamp. HDFS keeps >>>> the list of changeIDs associated with the requesting timestamp and skip >>>> entries already processed. This is to handle the situation when more >>> than >>>> one entry having same timestamp, and some are sent in previous request, >>> and >>>> some need to be send in next request. >>>> c) changeID is primary key, only used to uniquely identify the entry, >>> and >>>> not required to be sequential nor consecutive. >>>> d) Purge the change entries in DB using timestamp instead of changeID. >>> For >>>> example, keep 3 polling intervals of entries to allow HDFS getting the >>>> changes before they are purged. >>>> + Sentry only sends full snapshot to HDFS at the first time when HDFS >>>> starts, and then always sends delta changes to HDFS. For path changes, >>> if >>>> Sentry got a new full snapshot from HMS, it sends full path snapshot. >>>> + High concurrency. Scale well with large number of clients >>>> - Relative big code change for API between sentry server and sentry >>> plugin >>>> at HDFS. >>>> - No easy way to detect that HDFS has received and processed all updates >>>> - If DB admin changes the DB server time backwards, this will mess up >>> the >>>> ordering using timestamp >>>> >>>> Thanks, >>>> >>>> Lina >>>> >>>> On Fri, Jul 21, 2017 at 5:53 PM, Alexander Kolbasov <ak...@cloudera.com >>>> >>>> wrote: >>>> >>>>> Lina, Thank you for sending a detailed summary! >>>>> >>>>> >>>>>> On Jul 22, 2017, at 12:29 AM, Na Li <lina...@cloudera.com <mailto: >>>>> lina...@cloudera.com>> wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> Can you provide your feedback on this issue? We need to make >>> decision >>>>> soon. >>>>>> >>>>>> 1) Summary >>>>>> >>>>>> The current approach updates changeID (primary key for permission >>>> change >>>>> and path change) manually. In a single transaction, the code reads the >>>> max >>>>> changeID from DB, increases it by one, and save the value in new >>> change >>>>> entry. If two threads are adding changes into DB at the same time, >>>>> collision happens (primary key does not allow multiple entries having >>> the >>>>> same value), and one transaction fails. Then the failed transaction >>> goes >>>>> through multiple re-tries. If retry count reaches max value, the >>>>> transaction fails. >>>>>> >>>>>> In our stress testing on a single sentry server, with 15 clients >>> doing >>>>> grant/revoke operations concurrently, we saw multiple transaction >>>> failures, >>>>> and the exponential-back-off retry increases the latency of every >>>>> transaction in sentry. We have serious performance issue on saving >>>>> permission and path updates. >>>>> >>>>> How important is the performance issue? Do we expect high rate of >>>>> permission updates in realistic scenarios? Do we have any idea about >>>>> current supported rate of permission updates? >>>>> >>>>> Note that in our single sentry server case the code that updates the >>> DB >>>> is >>>>> serialized, and, thus, we shouldn’t see any collisions at the DB >>> level at >>>>> all. This means that there is some other bug hiding here, still >>>>> undiscovered. The retries and exponential backoff shouldn’t apply in >>> the >>>>> single-server case at all. >>>>> >>>>>> >>>>>> 2) Potential solutions >>>>>> >>>>>> 2.1) Find out why we have collision on a single sentry server with >>>>> synchronization on saving updates. Once find the cause, fix it. >>>>> >>>>> +1 on that >>>>> >>>>>> + Follow existing approach. Does not introduce big change to the >>> code >>>>> base. >>>>>> - Need time to investigate why synchronization at application level >>> on >>>> a >>>>> single sentry server does not prevent key collision. >>>>>> - Does not scale. All updates are serialized, not much concurrency. >>>>>> - Still have key collision exception and transaction failure when >>> more >>>>> than one Sentry servers are deployed. >>>>>> - Transaction failure at collision increases time to execute a >>>>> transaction. >>>>>> - It is confusing to customer that there are transaction failure in >>>>> normal operation. Increase support cases >>>>>> >>>>>> 2.2) CDH-transactionFailure.01-cdh5-1.5.1.patch >>>>>> The patch that achieves 5 times or more performance increase than >>> the >>>>> current approach. >>>>>> It contains the following changes >>>>>> revert sentry-1795 (so the changeID is auto-increment. This avoids >>> key >>>>> collision. This is main reason we have performance improvement) >>>>>> revert sentry-1824 (no need to synchronize when changeID is >>>>> auto-increment) >>>>>> get continuous delta list from SentryStore even when the delta list >>> has >>>>> hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the hole >>> is at >>>>> the front of list, return full snapshot) >>>>> >>>>> Please document the algorithm you are suggesting in more detail. How >>> do >>>>> you propose to handle synchronization between NN and Sentry server >>> using >>>>> auto-incremented IDs? >>>>> >>>>> Performance is an important consideration once we achieve >>> correctness. It >>>>> would be great to have *correct* solution that also performs well. >>>>>> + Relative small changes. Verified working with good performance >>> when >>>>> there is no transaction failure >>>>>> + When the hole in delta list is temporary (transaction in flight), >>>>> return the continuous delta list is effective to deal with the hole. >>> Most >>>>> likely, the hole will disappear next time HDFS requests for changes. >>>>>> >>>>>> - When there is transaction failure (the hole in changeID is >>>> permanent), >>>>> sends back full snapshot, which is expensive. If we can detect >>> permanent >>>>> hole, then we don't need to send full snapshot, which is very >>> expensive, >>>>> and may exhaust memory for big customer >>>>>> >>>>> >>>>> Right, sending full snapshot is extremely expensive. How do you >>> propose >>>> to >>>>> detect such permanent hole? >>>>>> 2.3) Use timestamp to sort the changes >>>>>> >>>>>> a) use timestamp, such as MSentryPermChange.createTimeMs or >>>>> MSentryPathChange.createTimeMs to sort the entries. If there are more >>>> than >>>>> one entry having same timestamp, use changeID to break the tie. >>>>> >>>>> IMO we should use either timestamps or changeIDs, not both. >>>>> >>>>>> b) HDFS asks for updates using these timestamp values instead of >>>>> changeID. >>>>> >>>>> I think that this approach may cause some updates to be lost. >>>>> >>>>> >>>>>> c) changeID is primary key, only used to uniquely identify the >>> entry, >>>>> and not required to be sequential nor consecutive. >>>>>> d) Purge the change entries in DB using timestamp instead of >>> changeID. >>>>> For example, keep 3 polling intervals of entries to allow HDFS getting >>>> the >>>>> changes before they are purged. >>>>> >>>>> Suppose that you do this purging. Later you restart NN and it asks for >>>> all >>>>> updates since time T, but a bunch of updates since this time were >>> purged, >>>>> so they never reach NN and are permanently lost. And it is impossible >>> to >>>>> detect that this happened. >>>>> >>>>>> + Sentry only sends full snapshot to HDFS at the first time when >>> HDFS >>>>> starts, and then always sends delta changes to HDFS >>>>> >>>>> What if HDFS wasn’t able to talk to Sentry for some time and in the >>> mean >>>>> time some entries were purged? >>>>> >>>>> >>>>>> + High concurrency. Scale well with large number of clients >>>>>> - Relative big code change for API between sentry server and sentry >>>>> plugin at HDFS. >>>>>> - No easy way to detect that HDFS has received and processed all >>>> updates >>>>>> >>>>>> 3) Decisions >>>>>> >>>>>> My suggestion is that we take approach 2.2) for short term and take >>> the >>>>> hit of full snapshot when there is transaction failure. And take >>> approach >>>>> 2.3) as long term solution. >>>>> >>>>> We can’t make the decision since we do not have enough information >>> about >>>>> approach 2.2 - how does it actually work? This may be a really good >>>>> solution if we can prove that it works. >>>>> >>>>> - Sasha >>>>> >>>>>> On Jul 22, 2017, at 12:42 AM, Na Li <lina...@cloudera.com> wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> Can you provide your feedback on this issue? We need to make >>> decision >>>>> soon. >>>>>> >>>>>> 1) Summary >>>>>> >>>>>> The current approach updates changeID (primary key for permission >>>> change >>>>>> and path change) manually. In a single transaction, the code reads >>> the >>>>> max >>>>>> changeID from DB, increases it by one, and save the value in new >>> change >>>>>> entry. If two threads are adding changes into DB at the same time, >>>>>> collision happens (primary key does not allow multiple entries >>> having >>>> the >>>>>> same value), and one transaction fails. Then the failed transaction >>>> goes >>>>>> through multiple re-tries. If retry count reaches max value, the >>>>>> transaction fails. >>>>>> >>>>>> In our stress testing on a single sentry server, with 15 clients >>> doing >>>>>> grant/revoke operations concurrently, we saw multiple transaction >>>>> failures, >>>>>> and the exponential-back-off retry increases the latency of every >>>>>> transaction in sentry. We have serious performance issue on saving >>>>>> permission and path updates. >>>>>> >>>>>> 2) Potential solutions >>>>>> >>>>>> 2.1) Find out why we have collision on a single sentry server with >>>>>> synchronization on saving updates. Once find the cause, fix it. >>>>>> + Follow existing approach. Does not introduce big change to the >>> code >>>>> base. >>>>>> - Need time to investigate why synchronization at application level >>> on >>>> a >>>>>> single sentry server does not prevent key collision. >>>>>> - Does not scale. All updates are serialized, not much concurrency. >>>>>> - Still have key collision exception and transaction failure when >>> more >>>>> than >>>>>> one Sentry servers are deployed. >>>>>> - Transaction failure at collision increases time to execute a >>>>> transaction. >>>>>> - It is confusing to customer that there are transaction failure in >>>>> normal >>>>>> operation. Increase support cases >>>>>> >>>>>> 2.2) Auto-increment changeID and send delta changes as much as >>> possible >>>>>> >>>>>> The patch that achieves 5 times or more performance increase than >>> the >>>>>> current approach. >>>>>> >>>>>> It contains the following changes >>>>>> >>>>>> - revert sentry-1795 (so the changeID is auto-increment. This >>> avoids >>>>> key >>>>>> collision. This is main reason we have performance improvement) >>>>>> - revert sentry-1824 (no need to synchronize when changeID is >>>>>> auto-increment) >>>>>> - get continuous delta list from SentryStore even when the delta >>> list >>>>>> has hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the >>>>> hole is >>>>>> at the front of list, return full snapshot) >>>>>> >>>>>> + Relative small changes. Verified working with good performance >>> when >>>>> there >>>>>> is no transaction failure >>>>>> >>>>>> + When the hole in delta list is temporary (transaction in flight), >>>>> return >>>>>> the continuous delta list is effective to deal with the hole. Most >>>>> likely, >>>>>> the hole will disappear next time HDFS requests for changes. >>>>>> >>>>>> - When there is transaction failure (the hole in changeID is >>>> permanent), >>>>>> sends back full snapshot, which is expensive. If we can detect >>>> permanent >>>>>> hole, then we don't need to send full snapshot, which is very >>>> expensive, >>>>>> and may exhaust memory for big customer >>>>>> >>>>>> 2.3) Use timestamp to sort the changes >>>>>> >>>>>> a) use timestamp, such as MSentryPermChange.createTimeMs or >>>>>> MSentryPathChange.createTimeMs to sort the entries. If there are >>> more >>>>> than >>>>>> one entry having same timestamp, use changeID to break the tie. >>>>>> b) HDFS asks for updates using these timestamp values instead of >>>>> changeID. >>>>>> Sentry server sends back changes at and after this timestamp. HDFS >>>> keeps >>>>>> the list of changeIDs associated with the requesting timestamp and >>> skip >>>>>> entries already processed. This is to handle the situation when more >>>> than >>>>>> one entry having same timestamp, and some are sent in previous >>> request, >>>>> and >>>>>> some need to be send in next request. >>>>>> c) changeID is primary key, only used to uniquely identify the >>> entry, >>>> and >>>>>> not required to be sequential nor consecutive. >>>>>> d) Purge the change entries in DB using timestamp instead of >>> changeID. >>>>> For >>>>>> example, keep 3 polling intervals of entries to allow HDFS getting >>> the >>>>>> changes before they are purged. >>>>>> + Sentry only sends full snapshot to HDFS at the first time when >>> HDFS >>>>>> starts, and then always sends delta changes to HDFS >>>>>> + High concurrency. Scale well with large number of clients >>>>>> - Relative big code change for API between sentry server and sentry >>>>> plugin >>>>>> at HDFS. >>>>>> - No easy way to detect that HDFS has received and processed all >>>> updates >>>>>> >>>>>> 3) Decisions >>>>>> >>>>>> My suggestion is that we take approach 2.2) for short term and take >>> the >>>>> hit >>>>>> of full snapshot when there is transaction failure. And take >>> approach >>>>> 2.3) >>>>>> as long term solution. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Lina >>>>> >>>>> >>>> >>> >> >>