[ https://issues.apache.org/jira/browse/IGNITE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729634#comment-16729634 ]
Vladimir Ozerov edited comment on IGNITE-9322 at 12/27/18 2:16 PM: ------------------------------------------------------------------- Hi. We conducted review of current design with [~Pavlukhin], [~gvvinblade], [~agoncharuk], [~amashenkov] and [~agura]. Here is the outcome of this discussion: # We decided that deadlock detection should be triggered after new "deadlock detection timeout" is expired. It will be added to {{TransactionConfiguration}}. Default value of this timeout will be equal to "failure detection timeout", which is 10s. # We will not trigger deadlock detection on normal transaction timeout as it is done with current deadlock detector. The reason is that our new algorithm do not know whether deadlock has happened until it receives a probe with a cycle. As this may never happen, we would need to wait for some additional timeout *in addition* to transaction timeout to be more or less confident that we are not in a deadlock. This is not convenient. Alternative - to notify probe originator that probe failed to find a deadlock explicitly. But unfortunately it is not possible with our algorithm either - probe may travel multi-way path in case of concurrent lock acquisition (e.g. {{GridNearTxQueryEnlistRequest}}), so we do not know exactly when probing is completely finished. # We need to add a counter to locking engine - when the next key is locked, counter is advanced. Current value of the counter should be added to probe. If cycle is detected and probe returns to us, we must double-check whether original lock counter is still the same. If yes - start rollback. Otherwise - ignore probe. This is needed to avoid rollback of presumably deadlocked transaction, while in reality it is no longer in a deadlock. One case when it may happen - concurrent timeout of rollback of another deadlocked transaction. # When probe travels between nodes, we must collect info about all transactions we passed through. This is needed to detect cycle when probe originator is not a part of deadlock. E.g. consider a situation when nodes A, B, C are in a deadlock. Then node D tries to acquire a key locked by A and eventually triggers a probe. In this case it will travel (D -> A => B -> C -> A). Info of all passed transactions is needed to stop the probe once we reach A for the second time. # Info about transaction we must collect: #* {{xid}} - to know which transaction to kill #* {{near_node_id}} - to know what node to ask for a kill #* {{lock_counter}} - lock counter of current transaction to avoid false-positive rollbacks as explained in p.3 #* {{start_time}} - start time of current transaction, to choose a victim as explained below # Victim transaction should be chosen based on either "youngest dies" or "oldest dies" policy. We need to investigate literature to find out what algorithm is better. In any case,victim will be chosen via comparison of TX infos. First we compare {{start_time}}. If it is equal, we compare {{xid}} as it is both comparable and unique inside a cluster. Note that due to a clock drift this algorithm may choose a transaction which is not actually youngest/oldest. Instead, this is best effort approach which will work correctly most of the time. But this approach has another critical property: given N different transactions (t1, t2 ... tN), every node will choose the same victim, thus avoiding killing more than one transaction. # Consider optimization: delay probe start for the given {{(xid, lock_counter)}} pair if another probe passed through this pair recently. This way we will decrease number of message when several nodes participating in a deadlock want to start probing nearly simultaneously. To achieve this we must save a timestamp of a last observed probe for the given {{(xid, lock_counter)}}. When counter is advanced, this timestamp is cleared. Future improvements which we consider out of scope of this ticket: # Add SQL view and JMX beans to track deadlock detection process: number of probes sent, number of probes received, number of detected deadlocks, etc. # Add detailed deadlock info to exception and/or node logs to simplify debug for users. This may contain transaction labels, locked keys (might be unsafe from security perspective), locked statements, etc. # Investigate other victim selection algorithms. E.g. "least/most acquired locks" or so # It seems that we can reuse the same mechanism for "classical" caches and deprecate existing deadlock detector. This is a good candidate for AI 3.0 scope. was (Author: vozerov): Hi. We conducted review of current design with [~Pavlukhin], [~gvvinblade], [~agoncharuk], [~amashenkov] and [~agura]. Here is the outcome of this discussion: # We decided that deadlock detection should be triggered after new "deadlock detection timeout" is expired. It will be added to {{TransactionConfiguration}}. Default value of this timeout will be equal to "failure detection timeout", which is 10s. # We will not trigger deadlock detection on normal transaction timeout as it is done with current deadlock detector. The reason is that our new algorithm do not know whether deadlock has happened until it receives a probe with a cycle. As this may never happen, we would need to wait for some additional timeout *in addition* to transaction timeout to be more or less confident that we are no in a deadlock. This is not convenient. Alternative - to notify probe originator that probe failed to find a deadlock explicitly. But unfortunately it is not possible with our algorithm either - probe may travel multi-way path in case of concurrent lock acquisition (e.g. {{GridNearTxQueryEnlistRequest}}), so we do not know exactly when probing is completely finished. # We need to add a counter to locking engine - when the next key is locked, counter is advanced. Current value of the counter should be added to probe. If cycle is detected and probe returns to us, we must double-check whether original lock counter is still the same. If yes - start rollback. Otherwise - ignore probe. This is needed to avoid rollback of presumably deadlocked transaction, while in reality it is no longer in a deadlock. One case when it may happen - concurrent timeout of rollback of another deadlocked transaction. # When probe travels between nodes, we must collect info about all transactions we passed through. This is needed to detect cycle when probe originator is not a part of deadlock. E.g. consider a situation when nodes A, B, C are in a deadlock. Then node D tries to acquire a key locked by A and eventually triggers a probe. In this case it will travel (D -> A => B -> C -> A). Info of all passed transactions is needed to stop the probe once we reach A for the second time. # Info about transaction we must collect: #* {{xid}} - to know which transaction to kill #* {{near_node_id}} - to know what node to ask for a kill #* {{lock_counter}} - lock counter of current transaction to avoid false-positive rollbacks as explained in p.3 #* {{start_time}} - start time of current transaction, to choose a victim as explained below # Victim transaction should be chosen based on either "youngest dies" or "oldest dies" policy. We need to investigate literature to find out what algorithm is better. In any case,victim will be chosen via comparison of TX infos. First we compare {{start_time}}. If it is equal, we compare {{xid}} as it is both comparable and unique inside a cluster. Note that due to a clock drift this algorithm may choose a transaction which is not actually youngest/oldest. Instead, this is best effort approach which will work correctly most of the time. But this approach has another critical property: given N different transactions (t1, t2 ... tN), every node will choose the same victim, thus avoiding killing more than one transaction. # Consider optimization: delay probe start for the given {{(xid, lock_counter)}} pair if another probe passed through this pair recently. This way we will decrease number of message when several nodes participating in a deadlock want to start probing nearly simultaneously. To achieve this we must save a timestamp of a last observed probe for the given {{(xid, lock_counter)}}. When counter is advanced, this timestamp is cleared. Future improvements which we consider out of scope of this ticket: # Add SQL view and JMX beans to track deadlock detection process: number of probes sent, number of probes received, number of detected deadlocks, etc. # Add detailed deadlock info to exception and/or node logs to simplify debug for users. This may contain transaction labels, locked keys (might be unsafe from security perspective), locked statements, etc. # Investigate other victim selection algorithms. E.g. "least/most acquired locks" or so # It seems that we can reuse the same mechanism for "classical" caches and deprecate existing deadlock detector. This is a good candidate for AI 3.0 scope. > MVCC: implement deadlock detector > --------------------------------- > > Key: IGNITE-9322 > URL: https://issues.apache.org/jira/browse/IGNITE-9322 > Project: Ignite > Issue Type: Task > Components: mvcc > Reporter: Vladimir Ozerov > Assignee: Ivan Pavlukhin > Priority: Major > > Deadlocks are not uncommon during SQL execution. > We need to implement distributed deadlock detection protocol for MVCC. > Essentially, nodes should exchange some map of tx wait lists, and try to find > a loop. If loop is found, then one of problematic transactions should be > rolled back. -- This message was sent by Atlassian JIRA (v7.6.3#76005)