[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only

Guus der Kinderen (JIRA) Tue, 21 May 2019 05:01:39 -0700


    [ 
https://issues.apache.org/jira/browse/DIRMINA-1111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844762#comment-16844762
 ]


Guus der Kinderen commented on DIRMINA-1111:
--------------------------------------------

Although we're not exactly sure how, we've "resolved" the issue by modifying 
the snippet below:
{code:java|title=Original code for 
org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue}
// The first unwritten empty buffer must be
// forwarded to the filter chain.
if (buf.hasRemaining()) {
    buf.reset();
    failedRequests.add(req);
} else {
    IoFilterChain filterChain = session.getFilterChain();
    filterChain.fireMessageSent(req);
}{code}
As said, this triggered {{InvalidMarkExceptions}}. After we added a simple 
{{try/catch}} around the reset, the CPU spikes go away. We're still suffering 
from a different issue (all clients reconnecting periodically), but we're now 
working on the assumption that the CPU spike is a result, not the cause of this.

Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that 
this is often called from exception handlers. With that, the state of the 
buffer that it's operating on, is likely unpredictable. The call to {{reset()}} 
assumes that a mark is set, but there are various scenario's where this is 
untrue:
 * the buffer could have been completely unused (a buffer fresh from the 
constructor will cause {{InvalidMarkException}} here
 * the buffer might have been flipped, but not yet read.

We're uncertain if the reset is needed, but if it is, we suggest to explicitly 
check if the mark has been set. If not, then we don't believe a reset is needed.
{code:title=proposed fix for 
org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue|java}
// The first unwritten empty buffer must be
// forwarded to the filter chain.
if (buf.hasRemaining()) {
    if (buf.markValue() >= -1) {
        buf.reset();
    }
    failedRequests.add(req);
} else {
    IoFilterChain filterChain = session.getFilterChain();
    filterChain.fireMessageSent(req);
}{code}

> 100% CPU (epoll bug) on 2.1.x, Linux only
> -----------------------------------------
>
>                 Key: DIRMINA-1111
>                 URL: https://issues.apache.org/jira/browse/DIRMINA-1111
>             Project: MINA
>          Issue Type: Bug
>    Affects Versions: 2.1.2
>            Reporter: Guus der Kinderen
>            Priority: Major
>         Attachments: image-2019-05-21-11-37-41-931.png
>
>
> We're getting reports 
> [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13]
>  of a bug that causes 100% CPU usage on Linux (the problem does not occur on 
> Windows). 
> This problem occurs in 2.1.2. but does _not_ occur in 2.0.21.
> Is this a regression of the epoll selector bug in DIRMINA-678 ?
> A stack trace of one of the spinning threads:
> {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x00007f0408002000 nid=0x2a6a 
> runnable [0x00007f0476dee000]
>    java.lang.Thread.State: RUNNABLE
>       at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>       at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>       at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>       at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>       - locked <0x00000004c486b748> (a sun.nio.ch.Util$3)
>       - locked <0x00000004c486b738> (a java.util.Collections$UnmodifiableSet)
>       - locked <0x00000004c420ccb0> (a sun.nio.ch.EPollSelectorImpl)
>       at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>       at 
> org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112)
>       at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616)
>       at 
> org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
>       - <0x00000004c4d03530> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker){code}
> More info is available at 
> https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only

Reply via email to