[
https://issues.apache.org/jira/browse/IGNITE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200154#comment-17200154
]
Mirza Aliev edited comment on IGNITE-12451 at 9/22/20, 3:29 PM:
----------------------------------------------------------------
I see that your solution is based on a _tryLock_ idea, we also came to that
because real-time locks graph analyzing brought a significant performance drop.
I see that you introduced _hasLockCollisions_ method, but IMHO it gives a lot
of false-positive results, this method does not search for cycles in locks
acquisition order. Also, it does not guarantee that system won't hang because
there is a chance that the retry-lock-acquisition loop also will lead to
deadlocks.
So, I believe that the right way to handle deadlock and let the system continue
to work is the following: we need to throw an exception when _tryLock_ timeouts
and a user will see _CachePartialUpdateCheckedException_. This solution is
almost developed in my PR and will be ready in a few days.
[~alex_pl] what do you think?
was (Author: maliev):
I see that your solution is based on a _tryLock_ idea, we also came to that
because real-time locks graph analyzing brought a significant performance drop.
I see that you introduced _hasLockCollisions_ method, but IMHO it gives a lot
of false-positive results, this method does not search for cycles in locks
acquisition order. Also, it does not guarantee that system won't hang because
there is a chance that the retry-lock-acquisition loop also will lead to
deadlocks.
So, I believe that the right way to handle deadlock and let the system continue
to work is the following: we need to throw an exception when _tryLock_ timeouts
and a user will see _CachePartialUpdateCheckedException_. __ This solution is
almost developed in my PR and will be ready in a few days.
[~alex_pl] what do you think?
> Introduce deadlock detection for cache entry reentrant locks
> ------------------------------------------------------------
>
> Key: IGNITE-12451
> URL: https://issues.apache.org/jira/browse/IGNITE-12451
> Project: Ignite
> Issue Type: Improvement
> Affects Versions: 2.7.6
> Reporter: Ivan Rakov
> Assignee: Mirza Aliev
> Priority: Major
> Fix For: 2.10
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Aside from IGNITE-12365, we still have possible threat of cache-entry-level
> deadlock in case of careless usage of JCache mass operations (putAll,
> removeAll):
> 1. If two different user threads will perform putAll on the same two keys in
> reverse order (primary node for which is the same), there's a chance that
> sys-stripe threads will be deadlocked.
> 2. Even without direct contract violation from user side, HashMap can be
> passed as argument for putAll. Even if user threads have called mass
> operations with two keys in the same order, HashMap iteration order is not
> strictly defined, which may cause the same deadlock.
> Local deadlock detection should mitigate this issue. We can create a wrapper
> for ReentrantLock with logic that performs cycle detection in wait-for graph
> in case we are waiting for lock acquisition for too long. Exception will be
> thrown from one of the threads in such case, failing user operation, but
> letting the system make progress.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)