[
https://issues.apache.org/jira/browse/DIRMINA-1111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844762#comment-16844762
]
Guus der Kinderen commented on DIRMINA-1111:
--------------------------------------------
Although we're not exactly sure how, we've "resolved" the issue by modifying
the snippet below:
{code:java|title=Original code for
org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue}
// The first unwritten empty buffer must be
// forwarded to the filter chain.
if (buf.hasRemaining()) {
buf.reset();
failedRequests.add(req);
} else {
IoFilterChain filterChain = session.getFilterChain();
filterChain.fireMessageSent(req);
}{code}
As said, this triggered {{InvalidMarkExceptions}}. After we added a simple
{{try/catch}} around the reset, the CPU spikes go away. We're still suffering
from a different issue (all clients reconnecting periodically), but we're now
working on the assumption that the CPU spike is a result, not the cause of this.
Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that
this is often called from exception handlers. With that, the state of the
buffer that it's operating on, is likely unpredictable. The call to {{reset()}}
assumes that a mark is set, but there are various scenario's where this is
untrue:
* the buffer could have been completely unused (a buffer fresh from the
constructor will cause {{InvalidMarkException}} here
* the buffer might have been flipped, but not yet read.
We're uncertain if the reset is needed, but if it is, we suggest to explicitly
check if the mark has been set. If not, then we don't believe a reset is needed.
{code:title=proposed fix for
org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue|java}
// The first unwritten empty buffer must be
// forwarded to the filter chain.
if (buf.hasRemaining()) {
if (buf.markValue() >= -1) {
buf.reset();
}
failedRequests.add(req);
} else {
IoFilterChain filterChain = session.getFilterChain();
filterChain.fireMessageSent(req);
}{code}
> 100% CPU (epoll bug) on 2.1.x, Linux only
> -----------------------------------------
>
> Key: DIRMINA-1111
> URL: https://issues.apache.org/jira/browse/DIRMINA-1111
> Project: MINA
> Issue Type: Bug
> Affects Versions: 2.1.2
> Reporter: Guus der Kinderen
> Priority: Major
> Attachments: image-2019-05-21-11-37-41-931.png
>
>
> We're getting reports
> [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13]
> of a bug that causes 100% CPU usage on Linux (the problem does not occur on
> Windows).
> This problem occurs in 2.1.2. but does _not_ occur in 2.0.21.
> Is this a regression of the epoll selector bug in DIRMINA-678 ?
> A stack trace of one of the spinning threads:
> {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x00007f0408002000 nid=0x2a6a
> runnable [0x00007f0476dee000]
> java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
> - locked <0x00000004c486b748> (a sun.nio.ch.Util$3)
> - locked <0x00000004c486b738> (a java.util.Collections$UnmodifiableSet)
> - locked <0x00000004c420ccb0> (a sun.nio.ch.EPollSelectorImpl)
> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
> at
> org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112)
> at
> org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616)
> at
> org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
> - <0x00000004c4d03530> (a
> java.util.concurrent.ThreadPoolExecutor$Worker){code}
> More info is available at
> https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)