[
https://issues.apache.org/jira/browse/IGNITE-19142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706112#comment-17706112
]
Roman Puchkovskiy commented on IGNITE-19142:
--------------------------------------------
Thanks!
> IncomingSnapshotCopier.cancel() blocks forever if called from multiple threads
> ------------------------------------------------------------------------------
>
> Key: IGNITE-19142
> URL: https://issues.apache.org/jira/browse/IGNITE-19142
> Project: Ignite
> Issue Type: Bug
> Reporter: Roman Puchkovskiy
> Assignee: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
> Attachments: threads_report.txt
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> One of the test runs hang forever. Thread dump and process state analysis
> shown the following:
> # Some thread A invoked IncomingSnapshotCopier.cancel() (twice) on the same
> copier, which made its busyLock block any operations
> # Another thread B, trying to process an InstallSnapshotRequest, found that
> the leader has changed (probably, due to the cluster being shut down) and
> triggered 'interrupt download snapshots'
> # As a result, this thread B called IncomingSnapshotCopier.cancel() on the
> same copier, but its busyLock.block() (which internally just takes a write
> lock) blocks this thread forever (as the lock is taken by thread A on step 1)
> # Before trying to cancel the snapshot downloading, thread B took
> Node.writeLock. As the thread is now blocked forever, it cannot release it,
> so every operation on the JRaft Node is blocked, including shutdown
> # So the cluster hangs forever on stop, making the tests hang forever as well
> We should make IncomingSnapshotCopier.cancel() idempotent even when called
> from different threads.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)