Lia, I don’t understand why it would be any different, but there is no harm in trying.
> On Jul 24, 2017, at 11:50 PM, Na Li <lina...@cloudera.com> wrote: > > Sasha, > > "How do you know that it improves performance? Do you see the improvement > in your experiments? If you do see improvements, let’s do the change." > [Lina] I have not run the test. Vamsee will run the test on cluster 30 > minutes later today for current approach with re-order transactions. So we > can see. > > "Why would it fail? Most of our transactions fail currently because of the > conflict on the primary key for delta updates which IMO is happening at > commit time." > [Lina] I know. We have three types of failure: whole transaction failure > (you mentioned), delta transaction block failure, and TransactionBlock > failure. I am looking at additional individual transaction block failure, > besides key conflict failure of the whole transaction. > > "Hmm, I don’t get that. I think changeID is modified in the same > transaction, so the whole thing should complete." > [Lina] I think changeID value is set in DeltaTransactionBlock after getting > max(changeID in DB). In my proposed approach, the order of events is > 0) transaction starts > 1) PM executes TransactionBlock, > 2) PM executes DeltaTransactionBlock (where changeID value is set to the > MSentryPathChange instance or MSentryPermChange instance) > 3) It is saved into DB at transaction commit. > > In current approach, the order of events is > > 0) transaction starts > 1) PM executes DeltaTransactionBlock (where changeID value is set to the > MSentryPathChange instance or MSentryPermChange instance) > 2) PM executes TransactionBlock, > 3) It is saved into DB at transaction commit. > > Reducing the time between reading max(changeID in DB) and transaction > commit will reduce the chance of key conflict. That is the whole point of > re-order the blocks. > > > > Thanks, > > Lina > > On Mon, Jul 24, 2017 at 3:52 PM, Alexander Kolbasov <ak...@cloudera.com> > wrote: > >> Lina, >> >>> On Jul 24, 2017, at 10:20 PM, Na Li <lina...@cloudera.com> wrote: >>> >>> Sasha, >>> >>> I found the following approach that could improve the performance of the >>> current approach and second approach a lot with minor code change >> >> How do you know that it improves performance? Do you see the improvement >> in your experiments? If you do see improvements, let’s do the change. >> >>> >>> Right now, the execution order of the transaction blocks in a >> transaction is >>> 1. Delta transaction block to save the perm change and path change >>> 2. The actual transaction >>> >>> We can change the execution order to be >>> 1. The actual transaction >>> 2. Delta transaction block to save the perm change and path change >>> >>> The benefit of this update is: >>> Actual transaction has chance to fail and execution time may vary a lot. >> >> Why would it fail? Most of our transactions fail currently because of the >> conflict on the primary key for delta updates which IMO is happening at >> commit time. >> >>> Delta transaction has very low chance to fail and the execution time is >>> near constant. As a result, the changeID is set after actual transaction >> is >>> done. >> >> Hmm, I don’t get that. I think changeID is modified in the same >> transaction, so the whole thing should complete. >> >>> >>> A. If it is applied to the approach 2.1) (manually increase changeID, >> read >>> current max value and increase by 1) , the time window exposed for key >>> collision is small (the delta transaction execution time). Before, the >> time >>> windows is the sum of delta transaction and actual transaction. So key >>> collision should happen much less. >> >> Can you run some experiments and see whether this is the case or not? This >> may or may not have some impact - difficult to tell. >> >> Thanks, >> >> - Sasha >> >>> >>> B. If it is applied to the approach 2.2) (auto-increase the change ID), >>> actual transaction failure does not create permanent hole in changeID. >> Near >>> constant execution time of delta transaction makes it rare to have >>> temporary hole. Without permanent hole, if there is a hole, it will be >>> temporary hole. We should always wait for the hole to be filled and >> always >>> send delta changes. If the hole is not filled, the missing transaction is >>> likely purged without received by HDFS. >>> >>> Thanks, >>> >>> Lina >>> >>> >>> On Mon, Jul 24, 2017 at 1:03 PM, Na Li <lina...@cloudera.com> wrote: >>> >>>> Approach 2.2, it uses auto-increment of changeID. When there is a >>>> temporary hole at the start of the list, even sending full snapshot >> could >>>> skip the transaction in flight. It does not function correctly. >>>> >>>> For example, the list is 1,2,3,5,6, and transaction with changeID 4 is >>>> not done. >>>> >>>> First, Sentry returns 1,2,3. >>>> >>>> next round, 4 is still not done, so sentry gets list 5,6 with a hole at >>>> the front of the list. Sentry sends back the full snapshot with >> changeID 6. >>>> >>>> then next time, HDFS will ask for changes after 6, and transaction with >>>> changeID 4 will be skipped even thought it may have done. >>>> >>>> On Mon, Jul 24, 2017 at 12:02 PM, Alexander Kolbasov < >> ak...@cloudera.com> >>>> wrote: >>>> >>>>> Lina, >>>>> >>>>> can you describe what problem have you discovered with approach 2.2? >>>>> >>>>> >>>>> On Mon, Jul 24, 2017 at 6:37 PM, Na Li <lina...@cloudera.com> wrote: >>>>> >>>>>> Sasha, >>>>>> >>>>>> I realize a serious issue in approach 2.2) that it does not function >>>>>> correctly. So I re-iterate the approaches and their pros and cons. >>>>>> >>>>>> This leaves us the option 2.1) and 2.3). The question is if the >>>>> performance >>>>>> of approach 2.1) acceptable for now, and we can work on the long term >>>>>> solution 2.3) later? >>>>>> >>>>>> 2.1) Find out why we have collision on a single sentry server with >>>>>> synchronization on saving updates. Once find the cause, fix it. >>>>>> + Follow existing approach. Does not introduce big change to the code >>>>> base. >>>>>> + Function correctly. There is no hole in changeID >>>>>> - Need time to investigate why synchronization at application level >> on a >>>>>> single sentry server does not prevent key collision. >>>>>> - Does not scale. All updates are serialized, not much concurrency. >>>>>> - Still have key collision exception and transaction failure when more >>>>> than >>>>>> one Sentry servers are deployed. >>>>>> - Transaction failure at collision increases time to execute a >>>>> transaction. >>>>>> - It is confusing to customer that there are transaction failure in >>>>> normal >>>>>> operation. Increase support cases >>>>>> >>>>>> 2.2) Auto-increment changeID and send delta changes as much as >> possible >>>>>> >>>>>> The patch that achieves 5 times or more performance increase than the >>>>>> current approach. >>>>>> >>>>>> It contains the following changes >>>>>> >>>>>> - revert sentry-1795 (so the changeID is auto-increment. This avoids >>>>> key >>>>>> collision. This is main reason we have performance improvement) >>>>>> - revert sentry-1824 (no need to synchronize when changeID is >>>>>> auto-increment) >>>>>> - get continuous delta list from SentryStore even when the delta >> list >>>>>> has hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the >>>>> hole >>>>>> is >>>>>> at the front of list, return full snapshot) >>>>>> >>>>>> + Relative small changes. Verified working with good performance when >>>>> there >>>>>> is no transaction failure >>>>>> >>>>>> + When the hole in delta list is temporary (transaction in flight), >>>>> return >>>>>> the continuous delta list is effective to deal with the hole. Most >>>>> likely, >>>>>> the hole will disappear next time HDFS requests for changes. >>>>>> >>>>>> - When there is transaction failure (the hole in changeID is >> permanent), >>>>>> sends back full snapshot, which is expensive. If we can detect >> permanent >>>>>> hole, then we don't need to send full snapshot, which is very >> expensive, >>>>>> and may exhaust memory for big customer >>>>>> >>>>>> - When there is a temporary hole at the start of the list, even >> sending >>>>>> full snapshot could skip the transaction in flight. It does not >> function >>>>>> correctly. >>>>>> >>>>>> For example, the list is 1,2,3,5,6, and transaction with changeID 4 >> is >>>>>> not done. >>>>>> >>>>>> First, Sentry returns 1,2,3. >>>>>> >>>>>> next round, 4 is still not done, so sentry gets list 5,6 with a hole >>>>> at >>>>>> the front of the list. Sentry sends back the full snapshot with >>>>> changeID 6. >>>>>> >>>>>> then next time, HDFS will ask for changes after 6, and transaction >>>>> with >>>>>> changeID 4 will be skipped even thought it may have done. >>>>>> >>>>>> 2.3) Use timestamp to sort the changes >>>>>> >>>>>> a) use timestamp, such as MSentryPermChange.createTimeMs or >>>>>> MSentryPathChange.createTimeMs to sort the entries. If there are more >>>>> than >>>>>> one entry having same timestamp, use changeID to break the tie. >>>>>> b) HDFS asks for updates using these timestamp values instead of >>>>> changeID. >>>>>> Sentry server sends back changes at and after this timestamp. HDFS >> keeps >>>>>> the list of changeIDs associated with the requesting timestamp and >> skip >>>>>> entries already processed. This is to handle the situation when more >>>>> than >>>>>> one entry having same timestamp, and some are sent in previous >> request, >>>>> and >>>>>> some need to be send in next request. >>>>>> c) changeID is primary key, only used to uniquely identify the entry, >>>>> and >>>>>> not required to be sequential nor consecutive. >>>>>> d) Purge the change entries in DB using timestamp instead of changeID. >>>>> For >>>>>> example, keep 3 polling intervals of entries to allow HDFS getting the >>>>>> changes before they are purged. >>>>>> + Sentry only sends full snapshot to HDFS at the first time when HDFS >>>>>> starts, and then always sends delta changes to HDFS. For path changes, >>>>> if >>>>>> Sentry got a new full snapshot from HMS, it sends full path snapshot. >>>>>> + High concurrency. Scale well with large number of clients >>>>>> - Relative big code change for API between sentry server and sentry >>>>> plugin >>>>>> at HDFS. >>>>>> - No easy way to detect that HDFS has received and processed all >> updates >>>>>> - If DB admin changes the DB server time backwards, this will mess up >>>>> the >>>>>> ordering using timestamp >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Lina >>>>>> >>>>>> On Fri, Jul 21, 2017 at 5:53 PM, Alexander Kolbasov < >> ak...@cloudera.com >>>>>> >>>>>> wrote: >>>>>> >>>>>>> Lina, Thank you for sending a detailed summary! >>>>>>> >>>>>>> >>>>>>>> On Jul 22, 2017, at 12:29 AM, Na Li <lina...@cloudera.com <mailto: >>>>>>> lina...@cloudera.com>> wrote: >>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> Can you provide your feedback on this issue? We need to make >>>>> decision >>>>>>> soon. >>>>>>>> >>>>>>>> 1) Summary >>>>>>>> >>>>>>>> The current approach updates changeID (primary key for permission >>>>>> change >>>>>>> and path change) manually. In a single transaction, the code reads >> the >>>>>> max >>>>>>> changeID from DB, increases it by one, and save the value in new >>>>> change >>>>>>> entry. If two threads are adding changes into DB at the same time, >>>>>>> collision happens (primary key does not allow multiple entries having >>>>> the >>>>>>> same value), and one transaction fails. Then the failed transaction >>>>> goes >>>>>>> through multiple re-tries. If retry count reaches max value, the >>>>>>> transaction fails. >>>>>>>> >>>>>>>> In our stress testing on a single sentry server, with 15 clients >>>>> doing >>>>>>> grant/revoke operations concurrently, we saw multiple transaction >>>>>> failures, >>>>>>> and the exponential-back-off retry increases the latency of every >>>>>>> transaction in sentry. We have serious performance issue on saving >>>>>>> permission and path updates. >>>>>>> >>>>>>> How important is the performance issue? Do we expect high rate of >>>>>>> permission updates in realistic scenarios? Do we have any idea about >>>>>>> current supported rate of permission updates? >>>>>>> >>>>>>> Note that in our single sentry server case the code that updates the >>>>> DB >>>>>> is >>>>>>> serialized, and, thus, we shouldn’t see any collisions at the DB >>>>> level at >>>>>>> all. This means that there is some other bug hiding here, still >>>>>>> undiscovered. The retries and exponential backoff shouldn’t apply in >>>>> the >>>>>>> single-server case at all. >>>>>>> >>>>>>>> >>>>>>>> 2) Potential solutions >>>>>>>> >>>>>>>> 2.1) Find out why we have collision on a single sentry server with >>>>>>> synchronization on saving updates. Once find the cause, fix it. >>>>>>> >>>>>>> +1 on that >>>>>>> >>>>>>>> + Follow existing approach. Does not introduce big change to the >>>>> code >>>>>>> base. >>>>>>>> - Need time to investigate why synchronization at application level >>>>> on >>>>>> a >>>>>>> single sentry server does not prevent key collision. >>>>>>>> - Does not scale. All updates are serialized, not much concurrency. >>>>>>>> - Still have key collision exception and transaction failure when >>>>> more >>>>>>> than one Sentry servers are deployed. >>>>>>>> - Transaction failure at collision increases time to execute a >>>>>>> transaction. >>>>>>>> - It is confusing to customer that there are transaction failure in >>>>>>> normal operation. Increase support cases >>>>>>>> >>>>>>>> 2.2) CDH-transactionFailure.01-cdh5-1.5.1.patch >>>>>>>> The patch that achieves 5 times or more performance increase than >>>>> the >>>>>>> current approach. >>>>>>>> It contains the following changes >>>>>>>> revert sentry-1795 (so the changeID is auto-increment. This avoids >>>>> key >>>>>>> collision. This is main reason we have performance improvement) >>>>>>>> revert sentry-1824 (no need to synchronize when changeID is >>>>>>> auto-increment) >>>>>>>> get continuous delta list from SentryStore even when the delta list >>>>> has >>>>>>> hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the hole >>>>> is at >>>>>>> the front of list, return full snapshot) >>>>>>> >>>>>>> Please document the algorithm you are suggesting in more detail. How >>>>> do >>>>>>> you propose to handle synchronization between NN and Sentry server >>>>> using >>>>>>> auto-incremented IDs? >>>>>>> >>>>>>> Performance is an important consideration once we achieve >>>>> correctness. It >>>>>>> would be great to have *correct* solution that also performs well. >>>>>>>> + Relative small changes. Verified working with good performance >>>>> when >>>>>>> there is no transaction failure >>>>>>>> + When the hole in delta list is temporary (transaction in flight), >>>>>>> return the continuous delta list is effective to deal with the hole. >>>>> Most >>>>>>> likely, the hole will disappear next time HDFS requests for changes. >>>>>>>> >>>>>>>> - When there is transaction failure (the hole in changeID is >>>>>> permanent), >>>>>>> sends back full snapshot, which is expensive. If we can detect >>>>> permanent >>>>>>> hole, then we don't need to send full snapshot, which is very >>>>> expensive, >>>>>>> and may exhaust memory for big customer >>>>>>>> >>>>>>> >>>>>>> Right, sending full snapshot is extremely expensive. How do you >>>>> propose >>>>>> to >>>>>>> detect such permanent hole? >>>>>>>> 2.3) Use timestamp to sort the changes >>>>>>>> >>>>>>>> a) use timestamp, such as MSentryPermChange.createTimeMs or >>>>>>> MSentryPathChange.createTimeMs to sort the entries. If there are more >>>>>> than >>>>>>> one entry having same timestamp, use changeID to break the tie. >>>>>>> >>>>>>> IMO we should use either timestamps or changeIDs, not both. >>>>>>> >>>>>>>> b) HDFS asks for updates using these timestamp values instead of >>>>>>> changeID. >>>>>>> >>>>>>> I think that this approach may cause some updates to be lost. >>>>>>> >>>>>>> >>>>>>>> c) changeID is primary key, only used to uniquely identify the >>>>> entry, >>>>>>> and not required to be sequential nor consecutive. >>>>>>>> d) Purge the change entries in DB using timestamp instead of >>>>> changeID. >>>>>>> For example, keep 3 polling intervals of entries to allow HDFS >> getting >>>>>> the >>>>>>> changes before they are purged. >>>>>>> >>>>>>> Suppose that you do this purging. Later you restart NN and it asks >> for >>>>>> all >>>>>>> updates since time T, but a bunch of updates since this time were >>>>> purged, >>>>>>> so they never reach NN and are permanently lost. And it is impossible >>>>> to >>>>>>> detect that this happened. >>>>>>> >>>>>>>> + Sentry only sends full snapshot to HDFS at the first time when >>>>> HDFS >>>>>>> starts, and then always sends delta changes to HDFS >>>>>>> >>>>>>> What if HDFS wasn’t able to talk to Sentry for some time and in the >>>>> mean >>>>>>> time some entries were purged? >>>>>>> >>>>>>> >>>>>>>> + High concurrency. Scale well with large number of clients >>>>>>>> - Relative big code change for API between sentry server and sentry >>>>>>> plugin at HDFS. >>>>>>>> - No easy way to detect that HDFS has received and processed all >>>>>> updates >>>>>>>> >>>>>>>> 3) Decisions >>>>>>>> >>>>>>>> My suggestion is that we take approach 2.2) for short term and take >>>>> the >>>>>>> hit of full snapshot when there is transaction failure. And take >>>>> approach >>>>>>> 2.3) as long term solution. >>>>>>> >>>>>>> We can’t make the decision since we do not have enough information >>>>> about >>>>>>> approach 2.2 - how does it actually work? This may be a really good >>>>>>> solution if we can prove that it works. >>>>>>> >>>>>>> - Sasha >>>>>>> >>>>>>>> On Jul 22, 2017, at 12:42 AM, Na Li <lina...@cloudera.com> wrote: >>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> Can you provide your feedback on this issue? We need to make >>>>> decision >>>>>>> soon. >>>>>>>> >>>>>>>> 1) Summary >>>>>>>> >>>>>>>> The current approach updates changeID (primary key for permission >>>>>> change >>>>>>>> and path change) manually. In a single transaction, the code reads >>>>> the >>>>>>> max >>>>>>>> changeID from DB, increases it by one, and save the value in new >>>>> change >>>>>>>> entry. If two threads are adding changes into DB at the same time, >>>>>>>> collision happens (primary key does not allow multiple entries >>>>> having >>>>>> the >>>>>>>> same value), and one transaction fails. Then the failed transaction >>>>>> goes >>>>>>>> through multiple re-tries. If retry count reaches max value, the >>>>>>>> transaction fails. >>>>>>>> >>>>>>>> In our stress testing on a single sentry server, with 15 clients >>>>> doing >>>>>>>> grant/revoke operations concurrently, we saw multiple transaction >>>>>>> failures, >>>>>>>> and the exponential-back-off retry increases the latency of every >>>>>>>> transaction in sentry. We have serious performance issue on saving >>>>>>>> permission and path updates. >>>>>>>> >>>>>>>> 2) Potential solutions >>>>>>>> >>>>>>>> 2.1) Find out why we have collision on a single sentry server with >>>>>>>> synchronization on saving updates. Once find the cause, fix it. >>>>>>>> + Follow existing approach. Does not introduce big change to the >>>>> code >>>>>>> base. >>>>>>>> - Need time to investigate why synchronization at application level >>>>> on >>>>>> a >>>>>>>> single sentry server does not prevent key collision. >>>>>>>> - Does not scale. All updates are serialized, not much concurrency. >>>>>>>> - Still have key collision exception and transaction failure when >>>>> more >>>>>>> than >>>>>>>> one Sentry servers are deployed. >>>>>>>> - Transaction failure at collision increases time to execute a >>>>>>> transaction. >>>>>>>> - It is confusing to customer that there are transaction failure in >>>>>>> normal >>>>>>>> operation. Increase support cases >>>>>>>> >>>>>>>> 2.2) Auto-increment changeID and send delta changes as much as >>>>> possible >>>>>>>> >>>>>>>> The patch that achieves 5 times or more performance increase than >>>>> the >>>>>>>> current approach. >>>>>>>> >>>>>>>> It contains the following changes >>>>>>>> >>>>>>>> - revert sentry-1795 (so the changeID is auto-increment. This >>>>> avoids >>>>>>> key >>>>>>>> collision. This is main reason we have performance improvement) >>>>>>>> - revert sentry-1824 (no need to synchronize when changeID is >>>>>>>> auto-increment) >>>>>>>> - get continuous delta list from SentryStore even when the delta >>>>> list >>>>>>>> has hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the >>>>>>> hole is >>>>>>>> at the front of list, return full snapshot) >>>>>>>> >>>>>>>> + Relative small changes. Verified working with good performance >>>>> when >>>>>>> there >>>>>>>> is no transaction failure >>>>>>>> >>>>>>>> + When the hole in delta list is temporary (transaction in flight), >>>>>>> return >>>>>>>> the continuous delta list is effective to deal with the hole. Most >>>>>>> likely, >>>>>>>> the hole will disappear next time HDFS requests for changes. >>>>>>>> >>>>>>>> - When there is transaction failure (the hole in changeID is >>>>>> permanent), >>>>>>>> sends back full snapshot, which is expensive. If we can detect >>>>>> permanent >>>>>>>> hole, then we don't need to send full snapshot, which is very >>>>>> expensive, >>>>>>>> and may exhaust memory for big customer >>>>>>>> >>>>>>>> 2.3) Use timestamp to sort the changes >>>>>>>> >>>>>>>> a) use timestamp, such as MSentryPermChange.createTimeMs or >>>>>>>> MSentryPathChange.createTimeMs to sort the entries. If there are >>>>> more >>>>>>> than >>>>>>>> one entry having same timestamp, use changeID to break the tie. >>>>>>>> b) HDFS asks for updates using these timestamp values instead of >>>>>>> changeID. >>>>>>>> Sentry server sends back changes at and after this timestamp. HDFS >>>>>> keeps >>>>>>>> the list of changeIDs associated with the requesting timestamp and >>>>> skip >>>>>>>> entries already processed. This is to handle the situation when more >>>>>> than >>>>>>>> one entry having same timestamp, and some are sent in previous >>>>> request, >>>>>>> and >>>>>>>> some need to be send in next request. >>>>>>>> c) changeID is primary key, only used to uniquely identify the >>>>> entry, >>>>>> and >>>>>>>> not required to be sequential nor consecutive. >>>>>>>> d) Purge the change entries in DB using timestamp instead of >>>>> changeID. >>>>>>> For >>>>>>>> example, keep 3 polling intervals of entries to allow HDFS getting >>>>> the >>>>>>>> changes before they are purged. >>>>>>>> + Sentry only sends full snapshot to HDFS at the first time when >>>>> HDFS >>>>>>>> starts, and then always sends delta changes to HDFS >>>>>>>> + High concurrency. Scale well with large number of clients >>>>>>>> - Relative big code change for API between sentry server and sentry >>>>>>> plugin >>>>>>>> at HDFS. >>>>>>>> - No easy way to detect that HDFS has received and processed all >>>>>> updates >>>>>>>> >>>>>>>> 3) Decisions >>>>>>>> >>>>>>>> My suggestion is that we take approach 2.2) for short term and take >>>>> the >>>>>>> hit >>>>>>>> of full snapshot when there is transaction failure. And take >>>>> approach >>>>>>> 2.3) >>>>>>>> as long term solution. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Lina >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >> >>