Re: SENTRY-1855: PERM/PATH transactions can fail to commit to the sentry database under load

Alexander Kolbasov Mon, 24 Jul 2017 13:53:19 -0700

Lina,

> On Jul 24, 2017, at 10:20 PM, Na Li <lina...@cloudera.com> wrote:
> 
> Sasha,
> 
> I found the following approach that could improve the performance of the
> current approach and second approach a lot with minor code change


How do you know that it improves performance? Do you see the improvement in 
your experiments? If you do see improvements, let’s do the change.

> 
> Right now, the execution order of the transaction blocks in a transaction is
> 1. Delta transaction block to save the perm change and path change
> 2. The actual transaction
> 
> We can change the execution order to be
> 1. The actual transaction
> 2. Delta transaction block to save the perm change and path change
> 
> The benefit of this update is:
> Actual transaction has chance to fail and execution time may vary a lot.

Why would it fail? Most of our transactions fail currently because of the 
conflict on the primary key for delta updates which IMO is happening at commit 
time.

> Delta transaction has very low chance to fail and the execution time is
> near constant. As a result, the changeID is set after actual transaction is
> done.

Hmm, I don’t get that. I think changeID is modified in the same transaction, so 
the whole thing should complete.

> 
> A. If it is applied to the approach 2.1) (manually increase changeID, read
> current max value and increase by 1) , the time window exposed for key
> collision is small (the delta transaction execution time). Before, the time
> windows is the sum of delta transaction and actual transaction. So key
> collision should happen much less.

Can you run some experiments and see whether this is the case or not? This may 
or may not have some impact - difficult to tell.

Thanks,

- Sasha

> 
> B. If it is applied to the approach 2.2) (auto-increase the change ID),
> actual transaction failure does not create permanent hole in changeID. Near
> constant execution time of delta transaction makes it rare to have
> temporary hole. Without permanent hole, if there is a hole, it will be
> temporary hole. We should always wait for the hole to be filled and always
> send delta changes. If the hole is not filled, the missing transaction is
> likely purged without received by HDFS.
> 
> Thanks,
> 
> Lina
> 
> 
> On Mon, Jul 24, 2017 at 1:03 PM, Na Li <lina...@cloudera.com> wrote:
> 
>> Approach 2.2, it uses auto-increment of changeID. When there is a
>> temporary hole at the start of the list, even sending full snapshot could
>> skip the transaction in flight. It does not function correctly.
>> 
>>  For example, the list is 1,2,3,5,6, and transaction with changeID 4 is
>> not done.
>> 
>>  First, Sentry returns 1,2,3.
>> 
>>  next round, 4 is still not done, so sentry gets list 5,6 with a hole at
>> the front of the list. Sentry sends back the full snapshot with changeID 6.
>> 
>>  then next time, HDFS will ask for changes after 6, and transaction with
>> changeID 4 will be skipped even thought it may have done.
>> 
>> On Mon, Jul 24, 2017 at 12:02 PM, Alexander Kolbasov <ak...@cloudera.com>
>> wrote:
>> 
>>> Lina,
>>> 
>>> can you describe what problem have you discovered with approach 2.2?
>>> 
>>> 
>>> On Mon, Jul 24, 2017 at 6:37 PM, Na Li <lina...@cloudera.com> wrote:
>>> 
>>>> Sasha,
>>>> 
>>>> I realize a serious issue in approach 2.2) that it does not function
>>>> correctly. So I re-iterate the approaches and their pros and cons.
>>>> 
>>>> This leaves us the option 2.1) and 2.3). The question is if the
>>> performance
>>>> of approach 2.1) acceptable for now, and we can work on the long term
>>>> solution 2.3) later?
>>>> 
>>>> 2.1) Find out why we have collision on a single sentry server with
>>>> synchronization on saving updates. Once find the cause, fix it.
>>>> + Follow existing approach. Does not introduce big change to the code
>>> base.
>>>> + Function correctly. There is no hole in changeID
>>>> - Need time to investigate why synchronization at application level on a
>>>> single sentry server does not prevent key collision.
>>>> - Does not scale. All updates are serialized, not much concurrency.
>>>> - Still have key collision exception and transaction failure when more
>>> than
>>>> one Sentry servers are deployed.
>>>> - Transaction failure at collision increases time to execute a
>>> transaction.
>>>> - It is confusing to customer that there are transaction failure in
>>> normal
>>>> operation. Increase support cases
>>>> 
>>>> 2.2) Auto-increment changeID and send delta changes as much as possible
>>>> 
>>>> The patch that achieves 5 times or more performance increase than the
>>>> current approach.
>>>> 
>>>> It contains the following changes
>>>> 
>>>>   - revert sentry-1795 (so the changeID is auto-increment. This avoids
>>> key
>>>>   collision. This is main reason we have performance improvement)
>>>>   - revert sentry-1824 (no need to synchronize when changeID is
>>>>   auto-increment)
>>>>   - get continuous delta list from SentryStore even when the delta list
>>>>   has hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the
>>> hole
>>>> is
>>>>   at the front of list, return full snapshot)
>>>> 
>>>> + Relative small changes. Verified working with good performance when
>>> there
>>>> is no transaction failure
>>>> 
>>>> + When the hole in delta list is temporary (transaction in flight),
>>> return
>>>> the continuous delta list is effective to deal with the hole. Most
>>> likely,
>>>> the hole will disappear next time HDFS requests for changes.
>>>> 
>>>> - When there is transaction failure (the hole in changeID is permanent),
>>>> sends back full snapshot, which is expensive. If we can detect permanent
>>>> hole, then we don't need to send full snapshot, which is very expensive,
>>>> and may exhaust memory for big customer
>>>> 
>>>> - When there is a temporary hole at the start of the list, even sending
>>>> full snapshot could skip the transaction in flight. It does not function
>>>> correctly.
>>>> 
>>>>  For example, the list is 1,2,3,5,6, and transaction with changeID 4 is
>>>> not done.
>>>> 
>>>>  First, Sentry returns 1,2,3.
>>>> 
>>>>  next round, 4 is still not done, so sentry gets list 5,6 with a hole
>>> at
>>>> the front of the list. Sentry sends back the full snapshot with
>>> changeID 6.
>>>> 
>>>>  then next time, HDFS will ask for changes after 6, and transaction
>>> with
>>>> changeID 4 will be skipped even thought it may have done.
>>>> 
>>>> 2.3) Use timestamp to sort the changes
>>>> 
>>>> a) use timestamp, such as MSentryPermChange.createTimeMs or
>>>> MSentryPathChange.createTimeMs to sort the entries. If there are more
>>> than
>>>> one entry having same timestamp, use changeID to break the tie.
>>>> b) HDFS asks for updates using these timestamp values instead of
>>> changeID.
>>>> Sentry server sends back changes at and after this timestamp. HDFS keeps
>>>> the list of changeIDs associated with the requesting timestamp and skip
>>>> entries already processed. This is to handle the situation when more
>>> than
>>>> one entry having same timestamp, and some are sent in previous request,
>>> and
>>>> some need to be send in next request.
>>>> c) changeID is primary key, only used to uniquely identify the entry,
>>> and
>>>> not required to be sequential nor consecutive.
>>>> d) Purge the change entries in DB using timestamp instead of changeID.
>>> For
>>>> example, keep 3 polling intervals of entries to allow HDFS getting the
>>>> changes before they are purged.
>>>> + Sentry only sends full snapshot to HDFS at the first time when HDFS
>>>> starts, and then always sends delta changes to HDFS. For path changes,
>>> if
>>>> Sentry got a new full snapshot from HMS, it sends full path snapshot.
>>>> + High concurrency. Scale well with large number of clients
>>>> - Relative big code change for API between sentry server and sentry
>>> plugin
>>>> at HDFS.
>>>> - No easy way to detect that HDFS has received and processed all updates
>>>> - If DB admin changes the DB server time backwards, this will mess up
>>> the
>>>> ordering using timestamp
>>>> 
>>>> Thanks,
>>>> 
>>>> Lina
>>>> 
>>>> On Fri, Jul 21, 2017 at 5:53 PM, Alexander Kolbasov <ak...@cloudera.com
>>>> 
>>>> wrote:
>>>> 
>>>>> Lina, Thank you for sending a detailed summary!
>>>>> 
>>>>> 
>>>>>> On Jul 22, 2017, at 12:29 AM, Na Li <lina...@cloudera.com <mailto:
>>>>> lina...@cloudera.com>> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> Can you provide your feedback on this issue? We need to make
>>> decision
>>>>> soon.
>>>>>> 
>>>>>> 1) Summary
>>>>>> 
>>>>>> The current approach updates changeID (primary key for permission
>>>> change
>>>>> and path change) manually. In a single transaction, the code reads the
>>>> max
>>>>> changeID from DB, increases it by one, and save the value in new
>>> change
>>>>> entry. If two threads are adding changes into DB at the same time,
>>>>> collision happens (primary key does not allow multiple entries having
>>> the
>>>>> same value), and one transaction fails. Then the failed transaction
>>> goes
>>>>> through multiple re-tries. If retry count reaches max value, the
>>>>> transaction fails.
>>>>>> 
>>>>>> In our stress testing on a single sentry server, with 15 clients
>>> doing
>>>>> grant/revoke operations concurrently, we saw multiple transaction
>>>> failures,
>>>>> and the exponential-back-off retry increases the latency of every
>>>>> transaction in sentry. We have serious performance issue on saving
>>>>> permission and path updates.
>>>>> 
>>>>> How important is the performance issue? Do we expect high rate of
>>>>> permission updates in realistic scenarios? Do we have any idea about
>>>>> current supported rate of permission updates?
>>>>> 
>>>>> Note that in our single sentry server case the code that updates the
>>> DB
>>>> is
>>>>> serialized, and, thus, we shouldn’t see any collisions at the DB
>>> level at
>>>>> all. This means that there is some other bug hiding here, still
>>>>> undiscovered. The retries and exponential backoff shouldn’t apply in
>>> the
>>>>> single-server case at all.
>>>>> 
>>>>>> 
>>>>>> 2) Potential solutions
>>>>>> 
>>>>>> 2.1) Find out why we have collision on a single sentry server with
>>>>> synchronization on saving updates. Once find the cause, fix it.
>>>>> 
>>>>> +1 on that
>>>>> 
>>>>>> + Follow existing approach. Does not introduce big change to the
>>> code
>>>>> base.
>>>>>> - Need time to investigate why synchronization at application level
>>> on
>>>> a
>>>>> single sentry server does not prevent key collision.
>>>>>> - Does not scale. All updates are serialized, not much concurrency.
>>>>>> - Still have key collision exception and transaction failure when
>>> more
>>>>> than one Sentry servers are deployed.
>>>>>> - Transaction failure at collision increases time to execute a
>>>>> transaction.
>>>>>> - It is confusing to customer that there are transaction failure in
>>>>> normal operation. Increase support cases
>>>>>> 
>>>>>> 2.2) CDH-transactionFailure.01-cdh5-1.5.1.patch
>>>>>> The patch that achieves 5 times or more performance increase than
>>> the
>>>>> current approach.
>>>>>> It contains the following changes
>>>>>> revert sentry-1795 (so the changeID is auto-increment. This avoids
>>> key
>>>>> collision. This is main reason we have performance improvement)
>>>>>> revert sentry-1824 (no need to synchronize when changeID is
>>>>> auto-increment)
>>>>>> get continuous delta list from SentryStore even when the delta list
>>> has
>>>>> hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the hole
>>> is at
>>>>> the front of list, return full snapshot)
>>>>> 
>>>>> Please document the algorithm you are suggesting in more detail. How
>>> do
>>>>> you propose to handle synchronization between NN and Sentry server
>>> using
>>>>> auto-incremented IDs?
>>>>> 
>>>>> Performance is an important consideration once we achieve
>>> correctness. It
>>>>> would be great to have *correct* solution that also performs well.
>>>>>> + Relative small changes. Verified working with good performance
>>> when
>>>>> there is no transaction failure
>>>>>> + When the hole in delta list is temporary (transaction in flight),
>>>>> return the continuous delta list is effective to deal with the hole.
>>> Most
>>>>> likely, the hole will disappear next time HDFS requests for changes.
>>>>>> 
>>>>>> - When there is transaction failure (the hole in changeID is
>>>> permanent),
>>>>> sends back full snapshot, which is expensive. If we can detect
>>> permanent
>>>>> hole, then we don't need to send full snapshot, which is very
>>> expensive,
>>>>> and may exhaust memory for big customer
>>>>>> 
>>>>> 
>>>>> Right, sending full snapshot is extremely expensive. How do you
>>> propose
>>>> to
>>>>> detect such permanent hole?
>>>>>> 2.3) Use timestamp to sort the changes
>>>>>> 
>>>>>> a) use timestamp, such as MSentryPermChange.createTimeMs or
>>>>> MSentryPathChange.createTimeMs to sort the entries. If there are more
>>>> than
>>>>> one entry having same timestamp, use changeID to break the tie.
>>>>> 
>>>>> IMO we should use either timestamps or changeIDs, not both.
>>>>> 
>>>>>> b) HDFS asks for updates using these timestamp values instead of
>>>>> changeID.
>>>>> 
>>>>> I think that this approach may cause some updates to be lost.
>>>>> 
>>>>> 
>>>>>> c) changeID is primary key, only used to uniquely identify the
>>> entry,
>>>>> and not required to be sequential nor consecutive.
>>>>>> d) Purge the change entries in DB using timestamp instead of
>>> changeID.
>>>>> For example, keep 3 polling intervals of entries to allow HDFS getting
>>>> the
>>>>> changes before they are purged.
>>>>> 
>>>>> Suppose that you do this purging. Later you restart NN and it asks for
>>>> all
>>>>> updates since time T, but a bunch of updates since this time were
>>> purged,
>>>>> so they never reach NN and are permanently lost. And it is impossible
>>> to
>>>>> detect that this happened.
>>>>> 
>>>>>> + Sentry only sends full snapshot to HDFS at the first time when
>>> HDFS
>>>>> starts, and then always sends delta changes to HDFS
>>>>> 
>>>>> What if HDFS wasn’t able to talk to Sentry for some time and in the
>>> mean
>>>>> time some entries were purged?
>>>>> 
>>>>> 
>>>>>> + High concurrency. Scale well with large number of clients
>>>>>> - Relative big code change for API between sentry server and sentry
>>>>> plugin at HDFS.
>>>>>> - No easy way to detect that HDFS has received and processed all
>>>> updates
>>>>>> 
>>>>>> 3) Decisions
>>>>>> 
>>>>>> My suggestion is that we take approach 2.2) for short term and take
>>> the
>>>>> hit of full snapshot when there is transaction failure. And take
>>> approach
>>>>> 2.3) as long term solution.
>>>>> 
>>>>> We can’t make the decision since we do not have enough information
>>> about
>>>>> approach 2.2 - how does it actually work? This may be a really good
>>>>> solution if we can prove that it works.
>>>>> 
>>>>> - Sasha
>>>>> 
>>>>>> On Jul 22, 2017, at 12:42 AM, Na Li <lina...@cloudera.com> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> Can you provide your feedback on this issue? We need to make
>>> decision
>>>>> soon.
>>>>>> 
>>>>>> 1) Summary
>>>>>> 
>>>>>> The current approach updates changeID (primary key for permission
>>>> change
>>>>>> and path change) manually. In a single transaction, the code reads
>>> the
>>>>> max
>>>>>> changeID from DB, increases it by one, and save the value in new
>>> change
>>>>>> entry. If two threads are adding changes into DB at the same time,
>>>>>> collision happens (primary key does not allow multiple entries
>>> having
>>>> the
>>>>>> same value), and one transaction fails. Then the failed transaction
>>>> goes
>>>>>> through multiple re-tries. If retry count reaches max value, the
>>>>>> transaction fails.
>>>>>> 
>>>>>> In our stress testing on a single sentry server, with 15 clients
>>> doing
>>>>>> grant/revoke operations concurrently, we saw multiple transaction
>>>>> failures,
>>>>>> and the exponential-back-off retry increases the latency of every
>>>>>> transaction in sentry. We have serious performance issue on saving
>>>>>> permission and path updates.
>>>>>> 
>>>>>> 2) Potential solutions
>>>>>> 
>>>>>> 2.1) Find out why we have collision on a single sentry server with
>>>>>> synchronization on saving updates. Once find the cause, fix it.
>>>>>> + Follow existing approach. Does not introduce big change to the
>>> code
>>>>> base.
>>>>>> - Need time to investigate why synchronization at application level
>>> on
>>>> a
>>>>>> single sentry server does not prevent key collision.
>>>>>> - Does not scale. All updates are serialized, not much concurrency.
>>>>>> - Still have key collision exception and transaction failure when
>>> more
>>>>> than
>>>>>> one Sentry servers are deployed.
>>>>>> - Transaction failure at collision increases time to execute a
>>>>> transaction.
>>>>>> - It is confusing to customer that there are transaction failure in
>>>>> normal
>>>>>> operation. Increase support cases
>>>>>> 
>>>>>> 2.2) Auto-increment changeID and send delta changes as much as
>>> possible
>>>>>> 
>>>>>> The patch that achieves 5 times or more performance increase than
>>> the
>>>>>> current approach.
>>>>>> 
>>>>>> It contains the following changes
>>>>>> 
>>>>>>  - revert sentry-1795 (so the changeID is auto-increment. This
>>> avoids
>>>>> key
>>>>>>  collision. This is main reason we have performance improvement)
>>>>>>  - revert sentry-1824 (no need to synchronize when changeID is
>>>>>>  auto-increment)
>>>>>>  - get continuous delta list from SentryStore even when the delta
>>> list
>>>>>>  has hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the
>>>>> hole is
>>>>>>  at the front of list, return full snapshot)
>>>>>> 
>>>>>> + Relative small changes. Verified working with good performance
>>> when
>>>>> there
>>>>>> is no transaction failure
>>>>>> 
>>>>>> + When the hole in delta list is temporary (transaction in flight),
>>>>> return
>>>>>> the continuous delta list is effective to deal with the hole. Most
>>>>> likely,
>>>>>> the hole will disappear next time HDFS requests for changes.
>>>>>> 
>>>>>> - When there is transaction failure (the hole in changeID is
>>>> permanent),
>>>>>> sends back full snapshot, which is expensive. If we can detect
>>>> permanent
>>>>>> hole, then we don't need to send full snapshot, which is very
>>>> expensive,
>>>>>> and may exhaust memory for big customer
>>>>>> 
>>>>>> 2.3) Use timestamp to sort the changes
>>>>>> 
>>>>>> a) use timestamp, such as MSentryPermChange.createTimeMs or
>>>>>> MSentryPathChange.createTimeMs to sort the entries. If there are
>>> more
>>>>> than
>>>>>> one entry having same timestamp, use changeID to break the tie.
>>>>>> b) HDFS asks for updates using these timestamp values instead of
>>>>> changeID.
>>>>>> Sentry server sends back changes at and after this timestamp. HDFS
>>>> keeps
>>>>>> the list of changeIDs associated with the requesting timestamp and
>>> skip
>>>>>> entries already processed. This is to handle the situation when more
>>>> than
>>>>>> one entry having same timestamp, and some are sent in previous
>>> request,
>>>>> and
>>>>>> some need to be send in next request.
>>>>>> c) changeID is primary key, only used to uniquely identify the
>>> entry,
>>>> and
>>>>>> not required to be sequential nor consecutive.
>>>>>> d) Purge the change entries in DB using timestamp instead of
>>> changeID.
>>>>> For
>>>>>> example, keep 3 polling intervals of entries to allow HDFS getting
>>> the
>>>>>> changes before they are purged.
>>>>>> + Sentry only sends full snapshot to HDFS at the first time when
>>> HDFS
>>>>>> starts, and then always sends delta changes to HDFS
>>>>>> + High concurrency. Scale well with large number of clients
>>>>>> - Relative big code change for API between sentry server and sentry
>>>>> plugin
>>>>>> at HDFS.
>>>>>> - No easy way to detect that HDFS has received and processed all
>>>> updates
>>>>>> 
>>>>>> 3) Decisions
>>>>>> 
>>>>>> My suggestion is that we take approach 2.2) for short term and take
>>> the
>>>>> hit
>>>>>> of full snapshot when there is transaction failure. And take
>>> approach
>>>>> 2.3)
>>>>>> as long term solution.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Lina
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>>

Re: SENTRY-1855: PERM/PATH transactions can fail to commit to the sentry database under load

Reply via email to