[ 
https://issues.apache.org/jira/browse/IGNITE-19239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Shishkov updated IGNITE-19239:
-----------------------------------
    Description: 
There may be possible error messages about checkpoint read lock acquisition 
timeouts and critical threads blocking during snapshot restore process (just 
after caches start):
{quote} 
[2023-04-06T10:55:46,561][ERROR]\[ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock]
 Checkpoint read lock acquisition has been timed out. 
{quote} 

{quote} 
[2023-04-06T10:55:47,487][ERROR]\[tcp-disco-msg-worker-[crd]\-#23%node%\-#446%node%][G]
 Blocked system-critical thread has been detected. This can lead to 
cluster-wide undefined behaviour \[workerName=db-checkpoint-thread, 
threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%,
 {color:red}blockedFor=100s{color}] 
{quote} 

Also there are active exchange process, which finishes with such timings 
(timing will be approximatelly equal to blocking time of threads): 
{quote} 
[2023-04-06T10:55:52,211][INFO 
]\[exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange 
timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], 
resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in 
exchange queue" (0 ms), ..., stage="Restore partition states" 
({color:red}100163 ms{color}), ..., stage="Total time" ({color:red}100334 
ms{color})] 
{quote} 
 
Most of time such errors and long lasting threads blocking tells that cluster 
is in emergency state or will crash very soon.

So, there are two possible ways to solve problem:
# If these errors do not affect restoring from snapshot and are false positive 
ones, they can confuse, so we should remove them from logs.
# If these errors are not false positive, root cause of them have to be 
investigated and solved.

 

How to reproduce:
 # Set checkpoint frequency less than failure detection timeout.
 # Ensure, that cache groups partitions states restoring lasts more than 
failure detection timeout, i.e. it is actual to sufficiently large caches.

Reproducer: [^BlockingThreadsOnSnapshotRestoreReproducerTest.patch]

  was:
There may be possible error messages about checkpoint read lock acquisition 
timeouts and critical threads blocking during snapshot restore process (just 
after caches start):
{quote} 
[2023-04-06T10:55:46,561][ERROR]\[ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock]
 Checkpoint read lock acquisition has been timed out. 
{quote} 

{quote} 
[2023-04-06T10:55:47,487][ERROR]\[tcp-disco-msg-worker-[crd]\-#23%node%\-#446%node%][G]
 Blocked system-critical thread has been detected. This can lead to 
cluster-wide undefined behaviour \[workerName=db-checkpoint-thread, 
threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%,
 {color:red}blockedFor=100s{color}] 
{quote} 

Also there are active exchange process, which finishes with such timings 
(timing will be approximatelly equal to blocking time of threads): 
{quote} 
[2023-04-06T10:55:52,211][INFO 
]\[exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange 
timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], 
resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in 
exchange queue" (0 ms), ..., stage="Restore partition states" 
({color:red}100163 ms{color}), ..., stage="Total time" ({color:red}100334 
ms{color})] 
{quote} 
 
Most of time such errors and long lasting threads blocking tells that cluster 
is in emergency state or will crash very soon.

So, there are two possible ways to solve problem:
# If these errors do not affect restoring from snapshot and are false positive 
ones, they can confuse, so we should remove them from logs.
# If these errors are not false positive, problem have to be investigated and 
solved.

 

How to reproduce:
 # Set checkpoint frequency less than failure detection timeout.
 # Ensure, that cache groups partitions states restoring lasts more than 
failure detection timeout, i.e. it is actual to sufficiently large caches.

Reproducer: [^BlockingThreadsOnSnapshotRestoreReproducerTest.patch]


> Checkpoint read lock acquisition timeouts during snapshot restore
> -----------------------------------------------------------------
>
>                 Key: IGNITE-19239
>                 URL: https://issues.apache.org/jira/browse/IGNITE-19239
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ilya Shishkov
>            Priority: Minor
>              Labels: iep-43, ise
>         Attachments: BlockingThreadsOnSnapshotRestoreReproducerTest.patch
>
>
> There may be possible error messages about checkpoint read lock acquisition 
> timeouts and critical threads blocking during snapshot restore process (just 
> after caches start):
> {quote} 
> [2023-04-06T10:55:46,561][ERROR]\[ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock]
>  Checkpoint read lock acquisition has been timed out. 
> {quote} 
> {quote} 
> [2023-04-06T10:55:47,487][ERROR]\[tcp-disco-msg-worker-[crd]\-#23%node%\-#446%node%][G]
>  Blocked system-critical thread has been detected. This can lead to 
> cluster-wide undefined behaviour \[workerName=db-checkpoint-thread, 
> threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%,
>  {color:red}blockedFor=100s{color}] 
> {quote} 
> Also there are active exchange process, which finishes with such timings 
> (timing will be approximatelly equal to blocking time of threads): 
> {quote} 
> [2023-04-06T10:55:52,211][INFO 
> ]\[exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange 
> timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], 
> resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in 
> exchange queue" (0 ms), ..., stage="Restore partition states" 
> ({color:red}100163 ms{color}), ..., stage="Total time" ({color:red}100334 
> ms{color})] 
> {quote} 
>  
> Most of time such errors and long lasting threads blocking tells that cluster 
> is in emergency state or will crash very soon.
> So, there are two possible ways to solve problem:
> # If these errors do not affect restoring from snapshot and are false 
> positive ones, they can confuse, so we should remove them from logs.
> # If these errors are not false positive, root cause of them have to be 
> investigated and solved.
>  
> How to reproduce:
>  # Set checkpoint frequency less than failure detection timeout.
>  # Ensure, that cache groups partitions states restoring lasts more than 
> failure detection timeout, i.e. it is actual to sufficiently large caches.
> Reproducer: [^BlockingThreadsOnSnapshotRestoreReproducerTest.patch]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to