[
https://issues.apache.org/jira/browse/DIRMINA-1111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844762#comment-16844762
]
Guus der Kinderen edited comment on DIRMINA-1111 at 5/21/19 12:04 PM:
----------------------------------------------------------------------
Although we're not exactly sure how, we've "resolved" the issue by modifying
the snippet below:
{code:java|title=Original code for
org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue}
// The first unwritten empty buffer must be
// forwarded to the filter chain.
if (buf.hasRemaining()) {
buf.reset();
failedRequests.add(req);
} else {
IoFilterChain filterChain = session.getFilterChain();
filterChain.fireMessageSent(req);
}{code}
As said, this triggered {{InvalidMarkExceptions}}. After we added a simple
{{try/catch}} around the reset, the CPU spikes go away. We have not
investigated why this resolves the CPU spikes, but we assume that the thrown
exception prevents keys from being consumed.
We're still suffering from a different issue (all clients unexpectedly
reconnect periodically), but we're now working on the assumption that the CPU
spike is a result, not the cause, of this.
Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that
this is often called from exception handlers. With that, the state of the
buffer that it's operating on, is likely unpredictable. The call to {{reset()}}
assumes that a mark is set, but there are various scenario's where this is
untrue:
* the buffer could have been completely unused (a buffer fresh from the
constructor will cause {{InvalidMarkException}} here
* the buffer might have been flipped, but not yet read.
We're uncertain if the reset is needed, but if it is, we suggest to explicitly
check if the mark has been set. If not, then we don't believe a reset is needed.
{code:java|title=Proposed fix for
org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue}
// The first unwritten empty buffer must be
// forwarded to the filter chain.
if (buf.hasRemaining()) {
if (buf.markValue() >= -1) {
buf.reset();
}
failedRequests.add(req);
} else {
IoFilterChain filterChain = session.getFilterChain();
filterChain.fireMessageSent(req);
}{code}
was (Author: guus.der.kinderen):
Although we're not exactly sure how, we've "resolved" the issue by modifying
the snippet below:
{code:java|title=Original code for
org.apache.mina.core.polling.AbstractPollingIoProcessor.Processor#clearWriteRequestQueue}
// The first unwritten empty buffer must be
// forwarded to the filter chain.
if (buf.hasRemaining()) {
buf.reset();
failedRequests.add(req);
} else {
IoFilterChain filterChain = session.getFilterChain();
filterChain.fireMessageSent(req);
}{code}
As said, this triggered {{InvalidMarkExceptions}}. After we added a simple
{{try/catch}} around the reset, the CPU spikes go away. We have not
investigated why this resolves the CPU spikes, but we assume that the thrown
exception prevents keys from being consumed.
We're still suffering from a different issue (all clients unexpectedly
reconnect periodically), but we're now working on the assumption that the CPU
spike is a result, not the cause, of this.
Looking further into the {{clearWriteRequestQueue}} snippet, we noticed that
this is often called from exception handlers. With that, the state of the
buffer that it's operating on, is likely unpredictable. The call to {{reset()}}
assumes that a mark is set, but there are various scenario's where this is
untrue:
* the buffer could have been completely unused (a buffer fresh from the
constructor will cause {{InvalidMarkException}} here
* the buffer might have been flipped, but not yet read.
We're uncertain if the reset is needed, but if it is, we suggest to explicitly
check if the mark has been set. If not, then we don't believe a reset is needed.
{code:java|title=proposed fix for
org.apache.mina.core.polling.abstractpollingioprocessor.processor#clearwriterequestqueue}
// The first unwritten empty buffer must be
// forwarded to the filter chain.
if (buf.hasRemaining()) {
if (buf.markValue() >= -1) {
buf.reset();
}
failedRequests.add(req);
} else {
IoFilterChain filterChain = session.getFilterChain();
filterChain.fireMessageSent(req);
}{code}
> 100% CPU (epoll bug) on 2.1.x, Linux only
> -----------------------------------------
>
> Key: DIRMINA-1111
> URL: https://issues.apache.org/jira/browse/DIRMINA-1111
> Project: MINA
> Issue Type: Bug
> Affects Versions: 2.1.2
> Reporter: Guus der Kinderen
> Priority: Major
> Attachments: image-2019-05-21-11-37-41-931.png
>
>
> We're getting reports
> [reports|https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13]
> of a bug that causes 100% CPU usage on Linux (the problem does not occur on
> Windows).
> This problem occurs in 2.1.2. but does _not_ occur in 2.0.21.
> Is this a regression of the epoll selector bug in DIRMINA-678 ?
> A stack trace of one of the spinning threads:
> {code}"NioProcessor-3" #55 prio=5 os_prio=0 tid=0x00007f0408002000 nid=0x2a6a
> runnable [0x00007f0476dee000]
> java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
> - locked <0x00000004c486b748> (a sun.nio.ch.Util$3)
> - locked <0x00000004c486b738> (a java.util.Collections$UnmodifiableSet)
> - locked <0x00000004c420ccb0> (a sun.nio.ch.EPollSelectorImpl)
> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
> at
> org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:112)
> at
> org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:616)
> at
> org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
> - <0x00000004c4d03530> (a
> java.util.concurrent.ThreadPoolExecutor$Worker){code}
> More info is available at
> https://discourse.igniterealtime.org/t/openfire-4-3-2-cpu-goes-at-100-after-a-few-hours-on-linux/85119/13
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)