[
https://issues.apache.org/jira/browse/IGNITE-15398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409526#comment-17409526
]
Mirza Aliev edited comment on IGNITE-15398 at 9/7/21, 7:31 AM:
---------------------------------------------------------------
I've made some investigation and come up with the following:
* {{JRaft-Request-Processor}} leakage was fixed before and currently we had
only NioEventLoopGroup leakage
* {{NioEventLoopGroup}} from {{ClientHandlerModule}} weren't cleared properly
because we missed the module in the stop flow
* {{NioEventLoopGroup}} form {{RestModule}} weren't cleared properly because
RestModule#stop didn't contain Netty channel stopping mechainsm
* In general, OOM is not connected to leakage of the threads, the main reason
is that we do not stop {{MetaStorageManager}} properly, namely we do not stop
{{MetaStorageServiceImpl$WatchProcessor$Watcher}}, hence a lot of is held by
{{Watcher}} thread and GC couldn't collect them. Ticket for
{{MetaStorageManager}} stop https://issues.apache.org/jira/browse/IGNITE-15444
!screenshot-1.png! !screenshot-2.png!
was (Author: maliev):
I've made some investigation and come up with the following:
* {{JRaft-Request-Processor}} leakage was fixed before and currently we had
only NioEventLoopGroup leakage
* {{NioEventLoopGroup}} from {{ClientHandlerModule}} weren't cleared properly
because we missed the module in the stop flow
* {{NioEventLoopGroup}} form {{RestModule}} weren't cleared properly because
RestModule#stop didn't contain Netty channel stopping mechainsm
* In general, OOM is not connected to leakage of the threads, the main reason
is that we do not stop {{MetaStorageManager}} properly, namely we do not stop
{{MetaStorageServiceImpl$WatchProcessor$Watcher}}, hence a lot of is held by
{{Watcher}} thread and GC couldn't collect them. Ticket for
{{MetaStorageManager}} https://issues.apache.org/jira/browse/IGNITE-15444
!screenshot-1.png! !screenshot-2.png!
> NioEventLoopGroup threads leakage
> ---------------------------------
>
> Key: IGNITE-15398
> URL: https://issues.apache.org/jira/browse/IGNITE-15398
> Project: Ignite
> Issue Type: Bug
> Reporter: Andrey Mashenkov
> Assignee: Mirza Aliev
> Priority: Blocker
> Labels: ignite-3
> Fix For: 3.0.0-alpha3
>
> Attachments: screenshot-1.png, screenshot-2.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> I've run a simple test and face OOM on 7 of 100 iterations.
> Seems, thread leakage is a reason.
> Use JVM arg `-Xmx512M` to run the test, otherwise more iterations may be
> required.
> {code:java}
> @RepeatedTest(100)
> public void nodeRestart100Test() throws Exception {
> List<Ignite> grid = startGrid();
> IgniteUtils.closeAll(Lists.reverse(grid));
> }
> {code}
> Thread dump shows a huge number of parked NioEventLoopGroup and
> JRaft-Request-Processor.
> Further investigation shows most of NioEventLoopGroup threads are acceptor
> threads created in startEndpoint() method of RestModule and ClientModule
> classes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)