[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only

Guus der Kinderen (JIRA) Tue, 21 May 2019 05:05:53 -0700


    [ 
https://issues.apache.org/jira/browse/DIRMINA-1111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844762#comment-16844762
 ]


Guus der Kinderen edited comment on DIRMINA-1111 at 5/21/19 12:04 PM:
----------------------------------------------------------------------

Although we're not exactly sure how, we've "resolved" the issue by modifying 
the snippet below:
{code:java|title=Original code for 
org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue}
// The first unwritten empty buffer must be
// forwarded to the filter chain.
if (buf.hasRemaining()) {
    buf.reset();
    failedRequests.add(req);
} else {
    IoFilterChain filterChain = session.getFilterChain();
    filterChain.fireMessageSent(req);
}{code}
As said, this triggered {{InvalidMarkExceptions}}. After we added a simple 
{{try/catch}} around the reset, the CPU spikes go away. We have not 
investigated why this resolves the CPU spikes, but we assume that the thrown 
exception prevents keys from being consumed.

We're still suffering from a different issue (all clients unexpectedly 
reconnect periodically), but we're now working on the assumption that the CPU 
spike is a result, not the cause, of this.

Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that 
this is often called from exception handlers. With that, the state of the 
buffer that it's operating on, is likely unpredictable. The call to {{reset()}} 
assumes that a mark is set, but there are various scenario's where this is 
untrue:
 * the buffer could have been completely unused (a buffer fresh from the 
constructor will cause {{InvalidMarkException}} here
 * the buffer might have been flipped, but not yet read.

We're uncertain if the reset is needed, but if it is, we suggest to explicitly 
check if the mark has been set. If not, then we don't believe a reset is needed.
{code:java|title=Proposed fix for 
org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue}
// The first unwritten empty buffer must be
// forwarded to the filter chain.
if (buf.hasRemaining()) {
    if (buf.markValue() >= -1) {
        buf.reset();
    }
    failedRequests.add(req);
} else {
    IoFilterChain filterChain = session.getFilterChain();
    filterChain.fireMessageSent(req);
}{code}


was (Author: guus.der.kinderen):
Although we're not exactly sure how, we've "resolved" the issue by modifying 
the snippet below:
{code:java|title=Original code for 
org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue}
// The first unwritten empty buffer must be
// forwarded to the filter chain.
if (buf.hasRemaining()) {
    buf.reset();
    failedRequests.add(req);
} else {
    IoFilterChain filterChain = session.getFilterChain();
    filterChain.fireMessageSent(req);
}{code}
As said, this triggered {{InvalidMarkExceptions}}. After we added a simple 
{{try/catch}} around the reset, the CPU spikes go away. We have not 
investigated why this resolves the CPU spikes, but we assume that the thrown 
exception prevents keys from being consumed.

We're still suffering from a different issue (all clients unexpectedly 
reconnect periodically), but we're now working on the assumption that the CPU 
spike is a result, not the cause, of this.

Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that 
this is often called from exception handlers. With that, the state of the 
buffer that it's operating on, is likely unpredictable. The call to {{reset()}} 
assumes that a mark is set, but there are various scenario's where this is 
untrue:
 * the buffer could have been completely unused (a buffer fresh from the 
constructor will cause {{InvalidMarkException}} here
 * the buffer might have been flipped, but not yet read.

We're uncertain if the reset is needed, but if it is, we suggest to explicitly 
check if the mark has been set. If not, then we don't believe a reset is needed.
{code:java|title=proposed fix for 
org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue}
// The first unwritten empty buffer must be
// forwarded to the filter chain.
if (buf.hasRemaining()) {
    if (buf.markValue() >= -1) {
        buf.reset();
    }
    failedRequests.add(req);
} else {
    IoFilterChain filterChain = session.getFilterChain();
    filterChain.fireMessageSent(req);
}{code}

> 100% CPU (epoll bug) on 2.1.x, Linux only
> -----------------------------------------
>
>                 Key: DIRMINA-1111
>                 URL: https://issues.apache.org/jira/browse/DIRMINA-1111
>             Project: MINA
>          Issue Type: Bug
>    Affects Versions: 2.1.2
>            Reporter: Guus der Kinderen
>            Priority: Major
>         Attachments: image-2019-05-21-11-37-41-931.png
>
>
> We're getting reports 
> [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13]
>  of a bug that causes 100% CPU usage on Linux (the problem does not occur on 
> Windows). 
> This problem occurs in 2.1.2. but does _not_ occur in 2.0.21.
> Is this a regression of the epoll selector bug in DIRMINA-678 ?
> A stack trace of one of the spinning threads:
> {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x00007f0408002000 nid=0x2a6a 
> runnable [0x00007f0476dee000]
>    java.lang.Thread.State: RUNNABLE
>       at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>       at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>       at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>       at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>       - locked <0x00000004c486b748> (a sun.nio.ch.Util$3)
>       - locked <0x00000004c486b738> (a java.util.Collections$UnmodifiableSet)
>       - locked <0x00000004c420ccb0> (a sun.nio.ch.EPollSelectorImpl)
>       at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>       at 
> org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112)
>       at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616)
>       at 
> org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
>       - <0x00000004c4d03530> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker){code}
> More info is available at 
> https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (DIRMINA-1111) 100% CPU (epoll bug) on 2.1.x, Linux only

Reply via email to