Re: SENTRY-1855: PERM/PATH transactions can fail to commit to the sentry database under load

Na Li Mon, 24 Jul 2017 14:50:26 -0700

Sasha,

"How do you know that it improves performance? Do you see the improvement
in your experiments? If you do see improvements, let’s do the change."
[Lina] I have not run the test. Vamsee will run the test on cluster 30
minutes later today for current approach with re-order transactions. So we
can see.


"Why would it fail? Most of our transactions fail currently because of the
conflict on the primary key for delta updates which IMO is happening at
commit time."
[Lina] I know. We have three types of failure: whole transaction failure
(you mentioned), delta transaction block failure, and TransactionBlock
failure. I am looking at additional individual transaction block failure,
besides key conflict failure of the whole transaction.

"Hmm, I don’t get that. I think changeID is modified in the same
transaction, so the whole thing should complete."
[Lina] I think changeID value is set in DeltaTransactionBlock after getting
max(changeID in DB). In my proposed approach, the order of events is
0) transaction starts
1) PM executes TransactionBlock,
2) PM executes DeltaTransactionBlock (where changeID value is set to the
MSentryPathChange instance or MSentryPermChange instance)
3) It is saved into DB at transaction commit.

In current approach, the order of events is

0) transaction starts
1) PM executes DeltaTransactionBlock (where changeID value is set to the
MSentryPathChange instance or MSentryPermChange instance)
2) PM executes TransactionBlock,
3) It is saved into DB at transaction commit.

Reducing the time between reading max(changeID in DB) and transaction
commit will reduce the chance of key conflict. That is the whole point of
re-order the blocks.



Thanks,

Lina

On Mon, Jul 24, 2017 at 3:52 PM, Alexander Kolbasov <ak...@cloudera.com>
wrote:

> Lina,
>
> > On Jul 24, 2017, at 10:20 PM, Na Li <lina...@cloudera.com> wrote:
> >
> > Sasha,
> >
> > I found the following approach that could improve the performance of the
> > current approach and second approach a lot with minor code change
>
> How do you know that it improves performance? Do you see the improvement
> in your experiments? If you do see improvements, let’s do the change.
>
> >
> > Right now, the execution order of the transaction blocks in a
> transaction is
> > 1. Delta transaction block to save the perm change and path change
> > 2. The actual transaction
> >
> > We can change the execution order to be
> > 1. The actual transaction
> > 2. Delta transaction block to save the perm change and path change
> >
> > The benefit of this update is:
> > Actual transaction has chance to fail and execution time may vary a lot.
>
> Why would it fail? Most of our transactions fail currently because of the
> conflict on the primary key for delta updates which IMO is happening at
> commit time.
>
> > Delta transaction has very low chance to fail and the execution time is
> > near constant. As a result, the changeID is set after actual transaction
> is
> > done.
>
> Hmm, I don’t get that. I think changeID is modified in the same
> transaction, so the whole thing should complete.
>
> >
> > A. If it is applied to the approach 2.1) (manually increase changeID,
> read
> > current max value and increase by 1) , the time window exposed for key
> > collision is small (the delta transaction execution time). Before, the
> time
> > windows is the sum of delta transaction and actual transaction. So key
> > collision should happen much less.
>
> Can you run some experiments and see whether this is the case or not? This
> may or may not have some impact - difficult to tell.
>
> Thanks,
>
> - Sasha
>
> >
> > B. If it is applied to the approach 2.2) (auto-increase the change ID),
> > actual transaction failure does not create permanent hole in changeID.
> Near
> > constant execution time of delta transaction makes it rare to have
> > temporary hole. Without permanent hole, if there is a hole, it will be
> > temporary hole. We should always wait for the hole to be filled and
> always
> > send delta changes. If the hole is not filled, the missing transaction is
> > likely purged without received by HDFS.
> >
> > Thanks,
> >
> > Lina
> >
> >
> > On Mon, Jul 24, 2017 at 1:03 PM, Na Li <lina...@cloudera.com> wrote:
> >
> >> Approach 2.2, it uses auto-increment of changeID. When there is a
> >> temporary hole at the start of the list, even sending full snapshot
> could
> >> skip the transaction in flight. It does not function correctly.
> >>
> >>  For example, the list is 1,2,3,5,6, and transaction with changeID 4 is
> >> not done.
> >>
> >>  First, Sentry returns 1,2,3.
> >>
> >>  next round, 4 is still not done, so sentry gets list 5,6 with a hole at
> >> the front of the list. Sentry sends back the full snapshot with
> changeID 6.
> >>
> >>  then next time, HDFS will ask for changes after 6, and transaction with
> >> changeID 4 will be skipped even thought it may have done.
> >>
> >> On Mon, Jul 24, 2017 at 12:02 PM, Alexander Kolbasov <
> ak...@cloudera.com>
> >> wrote:
> >>
> >>> Lina,
> >>>
> >>> can you describe what problem have you discovered with approach 2.2?
> >>>
> >>>
> >>> On Mon, Jul 24, 2017 at 6:37 PM, Na Li <lina...@cloudera.com> wrote:
> >>>
> >>>> Sasha,
> >>>>
> >>>> I realize a serious issue in approach 2.2) that it does not function
> >>>> correctly. So I re-iterate the approaches and their pros and cons.
> >>>>
> >>>> This leaves us the option 2.1) and 2.3). The question is if the
> >>> performance
> >>>> of approach 2.1) acceptable for now, and we can work on the long term
> >>>> solution 2.3) later?
> >>>>
> >>>> 2.1) Find out why we have collision on a single sentry server with
> >>>> synchronization on saving updates. Once find the cause, fix it.
> >>>> + Follow existing approach. Does not introduce big change to the code
> >>> base.
> >>>> + Function correctly. There is no hole in changeID
> >>>> - Need time to investigate why synchronization at application level
> on a
> >>>> single sentry server does not prevent key collision.
> >>>> - Does not scale. All updates are serialized, not much concurrency.
> >>>> - Still have key collision exception and transaction failure when more
> >>> than
> >>>> one Sentry servers are deployed.
> >>>> - Transaction failure at collision increases time to execute a
> >>> transaction.
> >>>> - It is confusing to customer that there are transaction failure in
> >>> normal
> >>>> operation. Increase support cases
> >>>>
> >>>> 2.2) Auto-increment changeID and send delta changes as much as
> possible
> >>>>
> >>>> The patch that achieves 5 times or more performance increase than the
> >>>> current approach.
> >>>>
> >>>> It contains the following changes
> >>>>
> >>>>   - revert sentry-1795 (so the changeID is auto-increment. This avoids
> >>> key
> >>>>   collision. This is main reason we have performance improvement)
> >>>>   - revert sentry-1824 (no need to synchronize when changeID is
> >>>>   auto-increment)
> >>>>   - get continuous delta list from SentryStore even when the delta
> list
> >>>>   has hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the
> >>> hole
> >>>> is
> >>>>   at the front of list, return full snapshot)
> >>>>
> >>>> + Relative small changes. Verified working with good performance when
> >>> there
> >>>> is no transaction failure
> >>>>
> >>>> + When the hole in delta list is temporary (transaction in flight),
> >>> return
> >>>> the continuous delta list is effective to deal with the hole. Most
> >>> likely,
> >>>> the hole will disappear next time HDFS requests for changes.
> >>>>
> >>>> - When there is transaction failure (the hole in changeID is
> permanent),
> >>>> sends back full snapshot, which is expensive. If we can detect
> permanent
> >>>> hole, then we don't need to send full snapshot, which is very
> expensive,
> >>>> and may exhaust memory for big customer
> >>>>
> >>>> - When there is a temporary hole at the start of the list, even
> sending
> >>>> full snapshot could skip the transaction in flight. It does not
> function
> >>>> correctly.
> >>>>
> >>>>  For example, the list is 1,2,3,5,6, and transaction with changeID 4
> is
> >>>> not done.
> >>>>
> >>>>  First, Sentry returns 1,2,3.
> >>>>
> >>>>  next round, 4 is still not done, so sentry gets list 5,6 with a hole
> >>> at
> >>>> the front of the list. Sentry sends back the full snapshot with
> >>> changeID 6.
> >>>>
> >>>>  then next time, HDFS will ask for changes after 6, and transaction
> >>> with
> >>>> changeID 4 will be skipped even thought it may have done.
> >>>>
> >>>> 2.3) Use timestamp to sort the changes
> >>>>
> >>>> a) use timestamp, such as MSentryPermChange.createTimeMs or
> >>>> MSentryPathChange.createTimeMs to sort the entries. If there are more
> >>> than
> >>>> one entry having same timestamp, use changeID to break the tie.
> >>>> b) HDFS asks for updates using these timestamp values instead of
> >>> changeID.
> >>>> Sentry server sends back changes at and after this timestamp. HDFS
> keeps
> >>>> the list of changeIDs associated with the requesting timestamp and
> skip
> >>>> entries already processed. This is to handle the situation when more
> >>> than
> >>>> one entry having same timestamp, and some are sent in previous
> request,
> >>> and
> >>>> some need to be send in next request.
> >>>> c) changeID is primary key, only used to uniquely identify the entry,
> >>> and
> >>>> not required to be sequential nor consecutive.
> >>>> d) Purge the change entries in DB using timestamp instead of changeID.
> >>> For
> >>>> example, keep 3 polling intervals of entries to allow HDFS getting the
> >>>> changes before they are purged.
> >>>> + Sentry only sends full snapshot to HDFS at the first time when HDFS
> >>>> starts, and then always sends delta changes to HDFS. For path changes,
> >>> if
> >>>> Sentry got a new full snapshot from HMS, it sends full path snapshot.
> >>>> + High concurrency. Scale well with large number of clients
> >>>> - Relative big code change for API between sentry server and sentry
> >>> plugin
> >>>> at HDFS.
> >>>> - No easy way to detect that HDFS has received and processed all
> updates
> >>>> - If DB admin changes the DB server time backwards, this will mess up
> >>> the
> >>>> ordering using timestamp
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Lina
> >>>>
> >>>> On Fri, Jul 21, 2017 at 5:53 PM, Alexander Kolbasov <
> ak...@cloudera.com
> >>>>
> >>>> wrote:
> >>>>
> >>>>> Lina, Thank you for sending a detailed summary!
> >>>>>
> >>>>>
> >>>>>> On Jul 22, 2017, at 12:29 AM, Na Li <lina...@cloudera.com <mailto:
> >>>>> lina...@cloudera.com>> wrote:
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> Can you provide your feedback on this issue? We need to make
> >>> decision
> >>>>> soon.
> >>>>>>
> >>>>>> 1) Summary
> >>>>>>
> >>>>>> The current approach updates changeID (primary key for permission
> >>>> change
> >>>>> and path change) manually. In a single transaction, the code reads
> the
> >>>> max
> >>>>> changeID from DB, increases it by one, and save the value in new
> >>> change
> >>>>> entry. If two threads are adding changes into DB at the same time,
> >>>>> collision happens (primary key does not allow multiple entries having
> >>> the
> >>>>> same value), and one transaction fails. Then the failed transaction
> >>> goes
> >>>>> through multiple re-tries. If retry count reaches max value, the
> >>>>> transaction fails.
> >>>>>>
> >>>>>> In our stress testing on a single sentry server, with 15 clients
> >>> doing
> >>>>> grant/revoke operations concurrently, we saw multiple transaction
> >>>> failures,
> >>>>> and the exponential-back-off retry increases the latency of every
> >>>>> transaction in sentry. We have serious performance issue on saving
> >>>>> permission and path updates.
> >>>>>
> >>>>> How important is the performance issue? Do we expect high rate of
> >>>>> permission updates in realistic scenarios? Do we have any idea about
> >>>>> current supported rate of permission updates?
> >>>>>
> >>>>> Note that in our single sentry server case the code that updates the
> >>> DB
> >>>> is
> >>>>> serialized, and, thus, we shouldn’t see any collisions at the DB
> >>> level at
> >>>>> all. This means that there is some other bug hiding here, still
> >>>>> undiscovered. The retries and exponential backoff shouldn’t apply in
> >>> the
> >>>>> single-server case at all.
> >>>>>
> >>>>>>
> >>>>>> 2) Potential solutions
> >>>>>>
> >>>>>> 2.1) Find out why we have collision on a single sentry server with
> >>>>> synchronization on saving updates. Once find the cause, fix it.
> >>>>>
> >>>>> +1 on that
> >>>>>
> >>>>>> + Follow existing approach. Does not introduce big change to the
> >>> code
> >>>>> base.
> >>>>>> - Need time to investigate why synchronization at application level
> >>> on
> >>>> a
> >>>>> single sentry server does not prevent key collision.
> >>>>>> - Does not scale. All updates are serialized, not much concurrency.
> >>>>>> - Still have key collision exception and transaction failure when
> >>> more
> >>>>> than one Sentry servers are deployed.
> >>>>>> - Transaction failure at collision increases time to execute a
> >>>>> transaction.
> >>>>>> - It is confusing to customer that there are transaction failure in
> >>>>> normal operation. Increase support cases
> >>>>>>
> >>>>>> 2.2) CDH-transactionFailure.01-cdh5-1.5.1.patch
> >>>>>> The patch that achieves 5 times or more performance increase than
> >>> the
> >>>>> current approach.
> >>>>>> It contains the following changes
> >>>>>> revert sentry-1795 (so the changeID is auto-increment. This avoids
> >>> key
> >>>>> collision. This is main reason we have performance improvement)
> >>>>>> revert sentry-1824 (no need to synchronize when changeID is
> >>>>> auto-increment)
> >>>>>> get continuous delta list from SentryStore even when the delta list
> >>> has
> >>>>> hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the hole
> >>> is at
> >>>>> the front of list, return full snapshot)
> >>>>>
> >>>>> Please document the algorithm you are suggesting in more detail. How
> >>> do
> >>>>> you propose to handle synchronization between NN and Sentry server
> >>> using
> >>>>> auto-incremented IDs?
> >>>>>
> >>>>> Performance is an important consideration once we achieve
> >>> correctness. It
> >>>>> would be great to have *correct* solution that also performs well.
> >>>>>> + Relative small changes. Verified working with good performance
> >>> when
> >>>>> there is no transaction failure
> >>>>>> + When the hole in delta list is temporary (transaction in flight),
> >>>>> return the continuous delta list is effective to deal with the hole.
> >>> Most
> >>>>> likely, the hole will disappear next time HDFS requests for changes.
> >>>>>>
> >>>>>> - When there is transaction failure (the hole in changeID is
> >>>> permanent),
> >>>>> sends back full snapshot, which is expensive. If we can detect
> >>> permanent
> >>>>> hole, then we don't need to send full snapshot, which is very
> >>> expensive,
> >>>>> and may exhaust memory for big customer
> >>>>>>
> >>>>>
> >>>>> Right, sending full snapshot is extremely expensive. How do you
> >>> propose
> >>>> to
> >>>>> detect such permanent hole?
> >>>>>> 2.3) Use timestamp to sort the changes
> >>>>>>
> >>>>>> a) use timestamp, such as MSentryPermChange.createTimeMs or
> >>>>> MSentryPathChange.createTimeMs to sort the entries. If there are more
> >>>> than
> >>>>> one entry having same timestamp, use changeID to break the tie.
> >>>>>
> >>>>> IMO we should use either timestamps or changeIDs, not both.
> >>>>>
> >>>>>> b) HDFS asks for updates using these timestamp values instead of
> >>>>> changeID.
> >>>>>
> >>>>> I think that this approach may cause some updates to be lost.
> >>>>>
> >>>>>
> >>>>>> c) changeID is primary key, only used to uniquely identify the
> >>> entry,
> >>>>> and not required to be sequential nor consecutive.
> >>>>>> d) Purge the change entries in DB using timestamp instead of
> >>> changeID.
> >>>>> For example, keep 3 polling intervals of entries to allow HDFS
> getting
> >>>> the
> >>>>> changes before they are purged.
> >>>>>
> >>>>> Suppose that you do this purging. Later you restart NN and it asks
> for
> >>>> all
> >>>>> updates since time T, but a bunch of updates since this time were
> >>> purged,
> >>>>> so they never reach NN and are permanently lost. And it is impossible
> >>> to
> >>>>> detect that this happened.
> >>>>>
> >>>>>> + Sentry only sends full snapshot to HDFS at the first time when
> >>> HDFS
> >>>>> starts, and then always sends delta changes to HDFS
> >>>>>
> >>>>> What if HDFS wasn’t able to talk to Sentry for some time and in the
> >>> mean
> >>>>> time some entries were purged?
> >>>>>
> >>>>>
> >>>>>> + High concurrency. Scale well with large number of clients
> >>>>>> - Relative big code change for API between sentry server and sentry
> >>>>> plugin at HDFS.
> >>>>>> - No easy way to detect that HDFS has received and processed all
> >>>> updates
> >>>>>>
> >>>>>> 3) Decisions
> >>>>>>
> >>>>>> My suggestion is that we take approach 2.2) for short term and take
> >>> the
> >>>>> hit of full snapshot when there is transaction failure. And take
> >>> approach
> >>>>> 2.3) as long term solution.
> >>>>>
> >>>>> We can’t make the decision since we do not have enough information
> >>> about
> >>>>> approach 2.2 - how does it actually work? This may be a really good
> >>>>> solution if we can prove that it works.
> >>>>>
> >>>>> - Sasha
> >>>>>
> >>>>>> On Jul 22, 2017, at 12:42 AM, Na Li <lina...@cloudera.com> wrote:
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> Can you provide your feedback on this issue? We need to make
> >>> decision
> >>>>> soon.
> >>>>>>
> >>>>>> 1) Summary
> >>>>>>
> >>>>>> The current approach updates changeID (primary key for permission
> >>>> change
> >>>>>> and path change) manually. In a single transaction, the code reads
> >>> the
> >>>>> max
> >>>>>> changeID from DB, increases it by one, and save the value in new
> >>> change
> >>>>>> entry. If two threads are adding changes into DB at the same time,
> >>>>>> collision happens (primary key does not allow multiple entries
> >>> having
> >>>> the
> >>>>>> same value), and one transaction fails. Then the failed transaction
> >>>> goes
> >>>>>> through multiple re-tries. If retry count reaches max value, the
> >>>>>> transaction fails.
> >>>>>>
> >>>>>> In our stress testing on a single sentry server, with 15 clients
> >>> doing
> >>>>>> grant/revoke operations concurrently, we saw multiple transaction
> >>>>> failures,
> >>>>>> and the exponential-back-off retry increases the latency of every
> >>>>>> transaction in sentry. We have serious performance issue on saving
> >>>>>> permission and path updates.
> >>>>>>
> >>>>>> 2) Potential solutions
> >>>>>>
> >>>>>> 2.1) Find out why we have collision on a single sentry server with
> >>>>>> synchronization on saving updates. Once find the cause, fix it.
> >>>>>> + Follow existing approach. Does not introduce big change to the
> >>> code
> >>>>> base.
> >>>>>> - Need time to investigate why synchronization at application level
> >>> on
> >>>> a
> >>>>>> single sentry server does not prevent key collision.
> >>>>>> - Does not scale. All updates are serialized, not much concurrency.
> >>>>>> - Still have key collision exception and transaction failure when
> >>> more
> >>>>> than
> >>>>>> one Sentry servers are deployed.
> >>>>>> - Transaction failure at collision increases time to execute a
> >>>>> transaction.
> >>>>>> - It is confusing to customer that there are transaction failure in
> >>>>> normal
> >>>>>> operation. Increase support cases
> >>>>>>
> >>>>>> 2.2) Auto-increment changeID and send delta changes as much as
> >>> possible
> >>>>>>
> >>>>>> The patch that achieves 5 times or more performance increase than
> >>> the
> >>>>>> current approach.
> >>>>>>
> >>>>>> It contains the following changes
> >>>>>>
> >>>>>>  - revert sentry-1795 (so the changeID is auto-increment. This
> >>> avoids
> >>>>> key
> >>>>>>  collision. This is main reason we have performance improvement)
> >>>>>>  - revert sentry-1824 (no need to synchronize when changeID is
> >>>>>>  auto-increment)
> >>>>>>  - get continuous delta list from SentryStore even when the delta
> >>> list
> >>>>>>  has hole (for example, the list is 1,2,3,5,6, return 1,2,3. If the
> >>>>> hole is
> >>>>>>  at the front of list, return full snapshot)
> >>>>>>
> >>>>>> + Relative small changes. Verified working with good performance
> >>> when
> >>>>> there
> >>>>>> is no transaction failure
> >>>>>>
> >>>>>> + When the hole in delta list is temporary (transaction in flight),
> >>>>> return
> >>>>>> the continuous delta list is effective to deal with the hole. Most
> >>>>> likely,
> >>>>>> the hole will disappear next time HDFS requests for changes.
> >>>>>>
> >>>>>> - When there is transaction failure (the hole in changeID is
> >>>> permanent),
> >>>>>> sends back full snapshot, which is expensive. If we can detect
> >>>> permanent
> >>>>>> hole, then we don't need to send full snapshot, which is very
> >>>> expensive,
> >>>>>> and may exhaust memory for big customer
> >>>>>>
> >>>>>> 2.3) Use timestamp to sort the changes
> >>>>>>
> >>>>>> a) use timestamp, such as MSentryPermChange.createTimeMs or
> >>>>>> MSentryPathChange.createTimeMs to sort the entries. If there are
> >>> more
> >>>>> than
> >>>>>> one entry having same timestamp, use changeID to break the tie.
> >>>>>> b) HDFS asks for updates using these timestamp values instead of
> >>>>> changeID.
> >>>>>> Sentry server sends back changes at and after this timestamp. HDFS
> >>>> keeps
> >>>>>> the list of changeIDs associated with the requesting timestamp and
> >>> skip
> >>>>>> entries already processed. This is to handle the situation when more
> >>>> than
> >>>>>> one entry having same timestamp, and some are sent in previous
> >>> request,
> >>>>> and
> >>>>>> some need to be send in next request.
> >>>>>> c) changeID is primary key, only used to uniquely identify the
> >>> entry,
> >>>> and
> >>>>>> not required to be sequential nor consecutive.
> >>>>>> d) Purge the change entries in DB using timestamp instead of
> >>> changeID.
> >>>>> For
> >>>>>> example, keep 3 polling intervals of entries to allow HDFS getting
> >>> the
> >>>>>> changes before they are purged.
> >>>>>> + Sentry only sends full snapshot to HDFS at the first time when
> >>> HDFS
> >>>>>> starts, and then always sends delta changes to HDFS
> >>>>>> + High concurrency. Scale well with large number of clients
> >>>>>> - Relative big code change for API between sentry server and sentry
> >>>>> plugin
> >>>>>> at HDFS.
> >>>>>> - No easy way to detect that HDFS has received and processed all
> >>>> updates
> >>>>>>
> >>>>>> 3) Decisions
> >>>>>>
> >>>>>> My suggestion is that we take approach 2.2) for short term and take
> >>> the
> >>>>> hit
> >>>>>> of full snapshot when there is transaction failure. And take
> >>> approach
> >>>>> 2.3)
> >>>>>> as long term solution.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Lina
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
>
>

Re: SENTRY-1855: PERM/PATH transactions can fail to commit to the sentry database under load

Reply via email to