[ 
https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16468728#comment-16468728
 ] 

Michael Burman commented on CASSANDRA-13651:
--------------------------------------------

I've done a follow up patch to this also, available here:

[https://github.com/burmanm/cassandra/commit/eb8af9a47b00a836fbb4e155aa64024250470ba7]

I could not find a use-case where (at least using epoll) the timer would be 
beneficial. It's going to eat CPU all the time in any case, and I verified this 
by creating a small looping C program that did the same thing using epoll & 
timerfd directly (around ~4% CPU usage from the C program and ~6% from the 
Netty version when both are only rescheduling the job). So, I did the following 
to Flusher:
 * Removed automatic rescheduling by the Flusher
 * Instead, after every write we schedule the flush during the next eventloop 
invocation. If multiple writes are successful before that, they're flushed 
together

In my tests, this both improves the throughput slightly, but more importantly, 
it also decreases latency. On Azure D8s_V3 instances, the latency is decreased 
by 0.3ms with mediocre and more punishing loads. Obviously, the profiler shows 
that we spent more time in the writevAddresses(), but then EpollWait0 is 
reduced to next to nothing during stress run.

At this point I would suggest removing the batching completely and then return 
to this issue once we can sanely batch the work of encode also (and make that 
pipeline then flush at the end). 

 

> Large amount of CPU used by epoll_wait(.., .., .., 0)
> -----------------------------------------------------
>
>                 Key: CASSANDRA-13651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13651
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Corentin Chary
>            Assignee: Corentin Chary
>            Priority: Major
>             Fix For: 4.x
>
>         Attachments: cpu-usage.png
>
>
> I was trying to profile Cassandra under my workload and I kept seeing this 
> backtrace:
> {code}
> epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms
> io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java 
> (native)
> io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) 
> Native.java:111
> io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) 
> EpollEventLoop.java:230
> io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run() 
> SingleThreadEventExecutor.java:858
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() 
> DefaultThreadFactory.java:138
> java.lang.Thread.run() Thread.java:745
> {code}
> At fist I though that the profiler might not be able to profile native code 
> properly, but I wen't further and I realized that most of the CPU was used by 
> {{epoll_wait()}} calls with a timeout of zero.
> Here is the output of perf on this system, which confirms that most of the 
> overhead was with timeout == 0.
> {code}
> Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): 
> 11594448
> Overhead  Trace output                                                        
>                                                                               
>                                                      ◆
>   90.06%  epfd: 0x00000047, events: 0x7f5588c0c000, maxevents: 0x00002000, 
> timeout: 0x00000000                                                           
>                                                         ▒
>    5.77%  epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000, 
> timeout: 0x00000000                                                           
>                                                         ▒
>    1.98%  epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000, 
> timeout: 0x000003e8                                                           
>                                                         ▒
>    0.04%  epfd: 0x00000003, events: 0x2f6af77b9c00, maxevents: 0x00000020, 
> timeout: 0x00000000                                                           
>                                                         ▒
>    0.04%  epfd: 0x0000002b, events: 0x121ebf63ac00, maxevents: 0x00000040, 
> timeout: 0x00000000                                                           
>                                                         ▒
>    0.03%  epfd: 0x00000026, events: 0x7f51f80019c0, maxevents: 0x00000020, 
> timeout: 0x00000000                                                           
>                                                         ▒
>    0.02%  epfd: 0x00000003, events: 0x7fe4d80019d0, maxevents: 0x00000020, 
> timeout: 0x00000000
> {code}
> Running this time with perf record -ag for call traces:
> {code}
> # Children      Self       sys       usr  Trace output                        
>                                                 
> # ........  ........  ........  ........  
> ....................................................................................
> #
>      8.61%     8.61%     0.00%     8.61%  epfd: 0x000000a7, events: 
> 0x7fca452d6000, maxevents: 0x00001000, timeout: 0x00000000
>             |
>             ---0x1000200af313
>                |          
>                 --8.61%--0x7fca6117bdac
>                           0x7fca60459804
>                           epoll_wait
>      2.98%     2.98%     0.00%     2.98%  epfd: 0x000000a7, events: 
> 0x7fca452d6000, maxevents: 0x00001000, timeout: 0x000003e8
>             |
>             ---0x1000200af313
>                0x7fca6117b830
>                0x7fca60459804
>                epoll_wait
> {code}
> That looks like a lot of CPU used to wait for nothing. I'm not sure if pref 
> reports a per-CPU percentage or a per-system percentage, but that would be 
> still be 10% of the total CPU usage of Cassandra at the minimum.
> I went further and found the code of all that: We schedule a lot of 
> {{Message::Flusher}} with a deadline of 10 usec (5 per messages I think) but 
> netty+epoll only support timeouts above the milliseconds and will convert 
> everything bellow to 0.
> I added some traces to netty (4.1):
> {code}
> diff --git 
> a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
>  
> b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
> index 909088fde..8734bbfd4 100644
> --- 
> a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
> +++ 
> b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
> @@ -208,10 +208,15 @@ final class EpollEventLoop extends 
> SingleThreadEventLoop {
>          long currentTimeNanos = System.nanoTime();
>          long selectDeadLineNanos = currentTimeNanos + 
> delayNanos(currentTimeNanos);
>          for (;;) {
> -            long timeoutMillis = (selectDeadLineNanos - currentTimeNanos + 
> 500000L) / 1000000L;
> +            long timeoutNanos = selectDeadLineNanos - currentTimeNanos + 
> 500000L;
> +            long timeoutMillis =  timeoutNanos / 1000000L;
> +            System.out.printf("timeoutNanos: %d, timeoutMillis: %d | 
> deadline: %d - now: %d | hastask: %d\n",
> +                    timeoutNanos, timeoutMillis,
> +                    selectDeadLineNanos, currentTimeNanos, hasTasks() ? 1 : 
> 0);
>              if (timeoutMillis <= 0) {
>                  if (selectCnt == 0) {
>                      int ready = Native.epollWait(epollFd.intValue(), events, 
> 0);
> +                    System.out.printf("ready: %d\n", ready);
>                      if (ready > 0) {
>                          return ready;
>                      }
> {code}
> And this gives :
> {code}
> timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782341816510 - 
> now: 2001781341816510 | hastask: 0
> timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782342087239 - 
> now: 2001781342087239 | hastask: 0
> timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782342166947 - 
> now: 2001781342166947 | hastask: 0
> timeoutNanos: 508459, timeoutMillis: 0 | deadline: 2001781342297987 - now: 
> 2001781342289528 | hastask: 0
> ready: 0
> timeoutNanos: 508475, timeoutMillis: 0 | deadline: 2001781342357719 - now: 
> 2001781342349244 | hastask: 0
> ready: 0
> timeoutNanos: 509327, timeoutMillis: 0 | deadline: 2001781342394822 - now: 
> 2001781342385495 | hastask: 0
> ready: 0
> timeoutNanos: 509339, timeoutMillis: 0 | deadline: 2001781342430192 - now: 
> 2001781342420853 | hastask: 0
> ready: 0
> timeoutNanos: 509510, timeoutMillis: 0 | deadline: 2001781342461588 - now: 
> 2001781342452078 | hastask: 0
> ready: 0
> timeoutNanos: 509493, timeoutMillis: 0 | deadline: 2001781342495044 - now: 
> 2001781342485551 | hastask: 0
> ready: 0
> {code}
> The nanosecond timeout all come from {{eventLoop.schedule(this, 10000, 
> TimeUnit.NANOSECONDS);}} in {{Message::Flusher}}.
> Knowing that, I'm not sure what would be best to do, and I have a hard time 
> understanding Message::Flusher, but to me it looks like trying to schedule 
> less tasks would probably help and I didn't think anything obvious that could 
> be done with netty.
> Changing {{if (++runsWithNoWork > 5)}} to 2 seems to help a little bit, but 
> that isn't really significant.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to