[ 
https://issues.apache.org/jira/browse/IGNITE-12523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey N. Gura updated IGNITE-12523:
------------------------------------
    Description: 
A lot of threads (hundreds) build indexes. checkpoint-thread tries acquire 
write lock but can’t because some threads hold read lock. Moreover, some 
threads try to acquire read lock too. Failure types SYSTEM_WORKER_BLOCKED and 
SYSTEM_CRITICAL_OPERATION_TIMEOUT are ignored.

checkpoint-thread treated as blocked critical system worker. So failure 
processor gets thread dump. 

Threads  that waiting on read lock reports about 
SYSTEM_CRITICAL_OPERATION_TIMEOUT and also get thread dump.

Thread dump generation takes from 500 to 1000 ms.

All this activity leads to stop-the-world pause and triggers other timeouts. It 
could take long time because many threads are active and half time is thread 
dump generation.

Root cause problem here is checkpoint read-write lock. Discussed with 
[~agoncharuk] and it seems only implementation of fuzzy checkpoint could solve 
the problem. But it requires big effort.

*Solution*

- New system property IGNITE_DUMP_THREADS_ON_FAILURE_THROTTLING_TIMEOUT added.  
Default value is failure detection timeout.

- Each call of FailureProcessor#process(FailureContext, FailureHandler) method 
checka throttling timeout before thread dump generation.

- There is no need to check that failure type is ignored. Throttling will be 
useful for all cases when context is not invalidated 
(FailureProcessor.failureCtx != null).

 - For throttled thread dump we log info message  “Thread dump is hidden due to 
throttling settings. Set IGNITE_DUMP_THREADS_ON_FAILURE_THROTTLING_TIMEOUT 
property to 0 to see all thread dumps".

  was:
A lot of threads (hundreds) build indexes. checkpoint-thread tries acquire 
write lock but can’t because some threads hold read lock. Moreover, some 
threads try to acquire read lock too. Failure types SYSTEM_WORKER_BLOCKED and 
SYSTEM_CRITICAL_OPERATION_TIMEOUT are ignored.

checkpoint-thread treated as blocked critical system worker. So failure 
processor gets thread dump. 

Threads  that waiting on read lock reports about 
SYSTEM_CRITICAL_OPERATION_TIMEOUT and also get thread dump.

Thread dump generation takes from 500 to 1000 ms.

All this activity leads to stop-the-world pause and triggers other timeouts. It 
could take long time because many threads are active and half time is thread 
dump generation.

Root cause problem here is checkpoint read-write lock. Discussed with 
[~agoncharuk]Alexey Goncharuk and it seems only implementation of fuzzy 
checkpoint could solve the problem. But it requires big effort.

*Solution*

- New system property IGNITE_DUMP_THREADS_ON_FAILURE_THROTTLING_TIMEOUT added.  
Default value is failure detection timeout.

- Each call of FailureProcessor#process(FailureContext, FailureHandler) method 
checka throttling timeout before thread dump generation.

- There is no need to check that failure type is ignored. Throttling will be 
useful for all cases when context is not invalidated 
(FailureProcessor.failureCtx != null).

 - For throttled thread dump we log info message  “Thread dump is hidden due to 
throttling settings. Set IGNITE_DUMP_THREADS_ON_FAILURE_THROTTLING_TIMEOUT 
property to 0 to see all thread dumps".


> Continuously generated thread dumps in failure processor slow down the whole 
> system
> -----------------------------------------------------------------------------------
>
>                 Key: IGNITE-12523
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12523
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Andrey N. Gura
>            Assignee: Andrey N. Gura
>            Priority: Major
>             Fix For: 2.9
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> A lot of threads (hundreds) build indexes. checkpoint-thread tries acquire 
> write lock but can’t because some threads hold read lock. Moreover, some 
> threads try to acquire read lock too. Failure types SYSTEM_WORKER_BLOCKED and 
> SYSTEM_CRITICAL_OPERATION_TIMEOUT are ignored.
> checkpoint-thread treated as blocked critical system worker. So failure 
> processor gets thread dump. 
> Threads  that waiting on read lock reports about 
> SYSTEM_CRITICAL_OPERATION_TIMEOUT and also get thread dump.
> Thread dump generation takes from 500 to 1000 ms.
> All this activity leads to stop-the-world pause and triggers other timeouts. 
> It could take long time because many threads are active and half time is 
> thread dump generation.
> Root cause problem here is checkpoint read-write lock. Discussed with 
> [~agoncharuk] and it seems only implementation of fuzzy checkpoint could 
> solve the problem. But it requires big effort.
> *Solution*
> - New system property IGNITE_DUMP_THREADS_ON_FAILURE_THROTTLING_TIMEOUT 
> added.  Default value is failure detection timeout.
> - Each call of FailureProcessor#process(FailureContext, FailureHandler) 
> method checka throttling timeout before thread dump generation.
> - There is no need to check that failure type is ignored. Throttling will be 
> useful for all cases when context is not invalidated 
> (FailureProcessor.failureCtx != null).
>  - For throttled thread dump we log info message  “Thread dump is hidden due 
> to throttling settings. Set IGNITE_DUMP_THREADS_ON_FAILURE_THROTTLING_TIMEOUT 
> property to 0 to see all thread dumps".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to