Corentin Chary created CASSANDRA-13651:
------------------------------------------
Summary: Large amount of CPU used by epoll_wait(.., .., .., 0)
Key: CASSANDRA-13651
URL: https://issues.apache.org/jira/browse/CASSANDRA-13651
Project: Cassandra
Issue Type: Bug
Reporter: Corentin Chary
Fix For: 4.x
I was trying to profile Cassandra under my workload and I kept seeing this
backtrace:
{code}
epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms
io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java
(native)
io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int)
Native.java:111
io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) EpollEventLoop.java:230
io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254
io.netty.util.concurrent.SingleThreadEventExecutor$5.run()
SingleThreadEventExecutor.java:858
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run()
DefaultThreadFactory.java:138
java.lang.Thread.run() Thread.java:745
{code}
At fist I though that the profiler might not be able to profile native code
properly, but I wen't further and I realized that most of the CPU was used by
epoll_wait() calls with a timeout of zero.
Here is the output of perf on this system, which confirms that most of the
overhead was with timeout == 0.
{code}
Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.):
11594448
Overhead Trace output
◆
90.06% epfd: 0x00000047, events: 0x7f5588c0c000, maxevents: 0x00002000,
timeout: 0x00000000
▒
5.77% epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000,
timeout: 0x00000000
▒
1.98% epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000,
timeout: 0x000003e8
▒
0.04% epfd: 0x00000003, events: 0x2f6af77b9c00, maxevents: 0x00000020,
timeout: 0x00000000
▒
0.04% epfd: 0x0000002b, events: 0x121ebf63ac00, maxevents: 0x00000040,
timeout: 0x00000000
▒
0.03% epfd: 0x00000026, events: 0x7f51f80019c0, maxevents: 0x00000020,
timeout: 0x00000000
▒
0.02% epfd: 0x00000003, events: 0x7fe4d80019d0, maxevents: 0x00000020,
timeout: 0x00000000
{code}
Running this time with perf record -ag for call traces:
{code}
# Children Self sys usr Trace output
# ........ ........ ........ ........
....................................................................................
#
8.61% 8.61% 0.00% 8.61% epfd: 0x000000a7, events:
0x7fca452d6000, maxevents: 0x00001000, timeout: 0x00000000
|
---0x1000200af313
|
--8.61%--0x7fca6117bdac
0x7fca60459804
epoll_wait
2.98% 2.98% 0.00% 2.98% epfd: 0x000000a7, events:
0x7fca452d6000, maxevents: 0x00001000, timeout: 0x000003e8
|
---0x1000200af313
0x7fca6117b830
0x7fca60459804
epoll_wait
{code}
That looks like a lot of CPU used to wait for nothing. I'm not sure if pref
reports a per-CPU percentage or a per-system percentage, but that would be
still be 10% of the total CPU usage of Cassandra at the minimum.
I went further and found the code of all that: We schedule a lot of
Message::Flusher with a deadline of 10 usec (5 per messages I think) but
netty+epoll only support timeouts above the milliseconds and will convert
everything bellow to 0.
I added some traces to netty (4.1):
{code}
diff --git
a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
index 909088fde..8734bbfd4 100644
---
a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
+++
b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
@@ -208,10 +208,15 @@ final class EpollEventLoop extends SingleThreadEventLoop {
long currentTimeNanos = System.nanoTime();
long selectDeadLineNanos = currentTimeNanos +
delayNanos(currentTimeNanos);
for (;;) {
- long timeoutMillis = (selectDeadLineNanos - currentTimeNanos +
500000L) / 1000000L;
+ long timeoutNanos = selectDeadLineNanos - currentTimeNanos +
500000L;
+ long timeoutMillis = timeoutNanos / 1000000L;
+ System.out.printf("timeoutNanos: %d, timeoutMillis: %d | deadline:
%d - now: %d | hastask: %d\n",
+ timeoutNanos, timeoutMillis,
+ selectDeadLineNanos, currentTimeNanos, hasTasks() ? 1 : 0);
if (timeoutMillis <= 0) {
if (selectCnt == 0) {
int ready = Native.epollWait(epollFd.intValue(), events,
0);
+ System.out.printf("ready: %d\n", ready);
if (ready > 0) {
return ready;
}
{code}
And this gives :
{code}
timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782341816510 -
now: 2001781341816510 | hastask: 0
timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782342087239 -
now: 2001781342087239 | hastask: 0
timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782342166947 -
now: 2001781342166947 | hastask: 0
timeoutNanos: 508459, timeoutMillis: 0 | deadline: 2001781342297987 - now:
2001781342289528 | hastask: 0
ready: 0
timeoutNanos: 508475, timeoutMillis: 0 | deadline: 2001781342357719 - now:
2001781342349244 | hastask: 0
ready: 0
timeoutNanos: 509327, timeoutMillis: 0 | deadline: 2001781342394822 - now:
2001781342385495 | hastask: 0
ready: 0
timeoutNanos: 509339, timeoutMillis: 0 | deadline: 2001781342430192 - now:
2001781342420853 | hastask: 0
ready: 0
timeoutNanos: 509510, timeoutMillis: 0 | deadline: 2001781342461588 - now:
2001781342452078 | hastask: 0
ready: 0
timeoutNanos: 509493, timeoutMillis: 0 | deadline: 2001781342495044 - now:
2001781342485551 | hastask: 0
ready: 0
{code}
The nanosecond timeout all come from {{eventLoop.schedule(this, 10000,
TimeUnit.NANOSECONDS);}} in {{Message::Flusher}}.
Knowing that, I'm not sure what would be best to do, and I have a hard time
understanding Message::Flusher, but to me it looks like trying to schedule less
tasks would probably help and I didn't think anything obvious that could be
done with netty.
Changing {{if (++runsWithNoWork > 5)}} to 2 seems to help a little bit, but
that isn't really significant.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]