Andrey N. Gura created IGNITE-12523:
---------------------------------------
Summary: Continuously generated thread dumps in failure processor
slow down the whole system
Key: IGNITE-12523
URL: https://issues.apache.org/jira/browse/IGNITE-12523
Project: Ignite
Issue Type: Improvement
Reporter: Andrey N. Gura
Assignee: Andrey N. Gura
Fix For: 2.9
A lot of threads (hundreds) build indexes. checkpoint-thread tries acquire
write lock but can’t because some threads hold read lock. Moreover, some
threads try to acquire read lock too. Failure types SYSTEM_WORKER_BLOCKED and
SYSTEM_CRITICAL_OPERATION_TIMEOUT are ignored.
checkpoint-thread treated as blocked critical system worker. So failure
processor gets thread dump.
Threads that waiting on read lock reports about
SYSTEM_CRITICAL_OPERATION_TIMEOUT and also get thread dump.
Thread dump generation takes from 500 to 1000 ms.
All this activity leads to stop-the-world pause and triggers other timeouts. It
could take long time because many threads are active and half time is thread
dump generation.
Root cause problem here is checkpoint read-write lock. Discussed with
[~agoncharuk]Alexey Goncharuk and it seems only implementation of fuzzy
checkpoint could solve the problem. But it requires big effort.
*Solution*
Andrey Gura
December 20, 2019, 3:18 PM
Edited
Final solution and implementation:
- New system property IGNITE_DUMP_THREADS_ON_FAILURE_THROTTLING_TIMEOUT added.
Default value is failure detection timeout.
- Each call of FailureProcessor#process(FailureContext, FailureHandler) method
checka throttling timeout before thread dump generation.
- There is no need to check that failure type is ignored. Throttling will be
useful for all cases when context is not invalidated
(FailureProcessor.failureCtx != null).
- For throttled thread dump we log info message “Thread dump is hidden due to
throttling settings. Set IGNITE_DUMP_THREADS_ON_FAILURE_THROTTLING_TIMEOUT
property to 0 to see all thread dumps".
--
This message was sent by Atlassian Jira
(v8.3.4#803005)