Yuto Kawamura created KAFKA-7504:
------------------------------------

             Summary: Broker performance degradation caused by call of sendfile 
reading disk in network thread
                 Key: KAFKA-7504
                 URL: https://issues.apache.org/jira/browse/KAFKA-7504
             Project: Kafka
          Issue Type: Improvement
          Components: core
    Affects Versions: 0.10.2.1
            Reporter: Yuto Kawamura
            Assignee: Yuto Kawamura


h2. Environment

OS: CentOS6

Kernel version: 2.6.32-XX
Kafka version: 0.10.2.1, 0.11.1.2 (but reproduces with latest build from trunk 
(2.2.0-SNAPSHOT)
h2. Phenomenon

Response time of Produce request (99th ~ 99.9th %ile) degrading to 50x ~ 100x 
more than usual.
Normally 99th %ile is lower than 20ms, but when this issue occurs it marks 50ms 
to 200ms.

At the same time we could see two more things in metrics:

1. Disk read coincidence from the volume assigned to log.dirs.
2. Raise in network threads utilization (by 
`kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent`)

As we didn't see increase of requests in metrics, we suspected blocking in 
event loop ran by network thread as the cause of raising network thread 
utilization.
Reading through Kafka broker source code, we understand that the only disk IO 
performed in network thread is reading log data through calling sendfile(2) 
(via FileChannel#transferTo).
To probe that the calls of sendfile(2) are blocking network thread for some 
moments, I ran following SystemTap script to inspect duration of sendfile 
syscalls.
{code:java}
# Systemtap script to measure syscall duration
global s
global records

probe syscall.$1 {
    s[tid()] = gettimeofday_us()
}

probe syscall.$1.return {
    elapsed = gettimeofday_us() - s[tid()]
    delete s[tid()]
    records <<< elapsed
}

probe end {
    print(@hist_log(records))
}{code}
{code:java}
$ stap -v syscall-duration.stp sendfile
# value (us)
value |---------------------------------------- count
    0 |                                             0
    1 |                                            71
    2 |@@@                                       6171
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  29472
   32 |@@@                                       3418
 2048 |                                             0
...
 8192 |                                             3{code}
As you can see there were some cases taking more than few milliseconds, implies 
that it blocks network thread for that long and applying the same latency for 
all other request/response processing.
h2. Hypothesis

Gathering the above observations, I made the following hypothesis.

Let's say network-thread-1 multiplexing 3 connections.
- producer-A
- follower-B (broker replica fetch)
- consumer-C

Broker receives requests from each of those clients, [Produce, FetchFollower, 
FetchConsumer].


They are processed well by request handler threads, and now the response queue 
of the network-thread contains 3 responses in following order: [FetchConsumer, 
Produce, FetchFollower].

network-thread-1 takes 3 responses and processes them sequentially 
([https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/SocketServer.scala#L632]).
Ideally processing of these 3 responses completes in microseconds as in it just 
copies ready responses into client socket's buffer with non-blocking manner.
However, Kafka uses sendfile(2) for transferring log data to client sockets. 
The target data might be in page cache, but old data which has written a bit 
far ago and never read since then, are likely not.
If the target data isn't in page cache, kernel first needs to load the target 
page into cache. This takes more than few milliseconds to tens of milliseconds 
depending on disk hardware and current load being applied to it.
Linux kernel doesn't considers the moment loading data from disk into page 
cache as "blocked", hence it awaits completion of target data loading rather 
than returning EAGAIN.
Thus, if this occurs when processing the first FetchConsumer request, the 
latency from reading disk applies to all following responses as well. Not just 
those which has been received by network thread but also those in response 
queue of that network thread.
h2. Experiment

For reproducing this issue, I did experiment on our test cluster.

Kafka cluster consist of 3 broker machines with following spec:
CPU: Intel(R) Xeon(R) 2.20GHz x 20 cores (HT) * 2
Memory: 256GiB
Network: 10Gbps
Disk: HDD x 12 RAID 1+0

Essential broker configs are below:
{code:java}
num.network.threads=10
num.io.threads=30
num.replica.fetchers=3{code}
And created two topics both with replicas=3, 
[retention.ms|http://retention.ms/]=2days and min.insync.replicas=2:
- {{large-topic}}: partitions=60
- {{regular-topic}}: partitions=30

 

Ran producer which produces 1kb data into {{large-topic}} and accumulated to 
have around 3GiB data for each partition, 60GiB data in total for each broker 
and stop the producer.

To simulate the situation that old data dropped from page cache, run {{echo 1 > 
/proc/sys/vm/drop_caches}} in each broker machine.

Start another producer which produces 512 bytes data into topic 
{{regular-topic}}.
Throughput is 56k messages/sec. 99.9th %ile response time for Produce is very 
good:

!https://wiki.linecorp.com/download/attachments/1273840268/image2018-10-14_2-3-1.png?version=1&modificationDate=1539450181748&api=v2|height=208!

!https://wiki.linecorp.com/download/attachments/1273840268/image2018-10-14_2-3-24.png?version=1&modificationDate=1539450205125&api=v2|height=209!

 

Then start consumer which reads from head of {{large-topic}} which is already 
not in page cache.

Response time of Produce degrade and throughput decreases:

!https://wiki.linecorp.com/download/attachments/1273840268/image2018-10-14_2-6-13.png?version=1&modificationDate=1539450373678&api=v2|height=250!
   
!https://wiki.linecorp.com/download/attachments/1273840268/image2018-10-14_2-8-33.png?version=1&modificationDate=1539450513480&api=v2|height=216!

Disk read is occurring and network threads got busier:

!https://wiki.linecorp.com/download/attachments/1273840268/image2018-10-14_2-9-47.png?version=1&modificationDate=1539450588087&api=v2|height=206!
   
!https://wiki.linecorp.com/download/attachments/1273840268/image2018-10-14_2-9-58.png?version=1&modificationDate=1539450598971&api=v2|height=250!
h2. Solution

For fixing this issue, we've considered several ideas, but finally decided to 
take following approach.

The essential problem is blocking in the middle of event loop. If it does 
something which is not purely computation (like blocking IO), the latency from 
it directly applies for other irrelevant requests, leaving CPU cores stale 
which can do work for other requests while awaiting completion of disk IO.
To fix this, there should be two possible ways:
A. Make call of sendfile(2) not to wait even on waiting pages loaded from disks 
to memory.
B. Make sure that data is always available in memory when the response object 
passed to network-thread.

For A. Make call of sendfile(2) not to wait even on waiting pages loaded from 
disks to memory, we must have to have support from kernel.
Actually FreeBSD recently added such, nonblocking style sendfile system call 
cooperating with community: 
[https://www.nginx.com/blog/nginx-and-netflix-contribute-new-sendfile2-to-freebsd/]
 
It might be useful for case like this, but this time it wasn't an option 
because:
1. Linux doesn't have such support in syscall
2. Even with kernel support, we can't use it until JDK supports it. Otherwise 
we have to go towards introducing JNI code.

So we sought approach B and tried some ways.
The basic idea is to read the target data once in request handler thread so the 
page cache of target data gets populated and stays for a while until it gets 
transferred to client socket through sendfile(2) called from network thread.

The easiest way is to do read(2) on target data. However this is risky from the 
overhead point of view as in it would leads overhead of copying data from 
kernel to userland (which is what Kafka trying to avoid leveraging 
sendfile(2)). Also allocated buffer for read()ing data would cause heap 
pressure unless we implement it in optimal way (perhaps using off-heap).

While we tried several other ways (like calling readahead(2) syscall with 
checking completion of load calling mincore(2) in loop through JNI), we finally 
turned out that calling sendfile for target data with setting destination to 
/dev/null does exactly what we want.
When calling it setting the destination fd which is pointing /dev/null, linux 
kernel loads data from disk into page cache, but skips (not exactly the whole 
part) copying loaded data to the destination device.
To implement this, we just need Java code as in we can expect 
FileChannel#transferTo calls sendfile internally.

We patched our kafka broker to do this warmup, and it perfectly fixed this 
problem. The patched version is based on 0.10.2.1 but I'm now testing the same 
patch on latest build from trunk(905f8135078127e08633400277c5829b10971d42) and 
seems that it works effectively for now.

To avoid additional overhead calling one more syscall (sendfile for /dev/null) 
when processing all Fetch requests, I added bit optimization that is skipping 
the last log segment of topic partitions assuming it contains only newly 
written data which are still hot and has its data in page cache.
By this optimization, since most of Fetch requests are requesting data of the 
tail of topic-partition in our case, just few of Fetch requests less than 1% 
are the subject of this warmup operation. Hence we haven't seen any overhead in 
CPU utilization since we deployed this patch to our production system.

 

To test its effect in the last experiment scenario, I deployed our patch to 
experiment cluster and tested again.

When starting catch up read consumer, network thread got busier again, but 
utilization was bit lower. And per-client request-time metric showed that this 
time producer (normal-client) didn't affected:

!https://wiki.linecorp.com/download/attachments/1273840268/image2018-10-14_2-24-48.png?version=1&modificationDate=1539451488884&api=v2|height=250!
   
!https://wiki.linecorp.com/download/attachments/1273840268/image2018-10-14_2-26-29.png?version=1&modificationDate=1539451590155&api=v2|height=202!

The same amount of disk read was occurring. 
!https://wiki.linecorp.com/download/attachments/1273840268/image2018-10-14_2-28-30.png?version=1&modificationDate=1539451711083&api=v2|height=210!

However there was no Produce time degradation and throughput was stable: 
!https://wiki.linecorp.com/download/attachments/1273840268/image2018-10-14_2-30-2.png?version=1&modificationDate=1539451802301&api=v2|height=204!
   
!https://wiki.linecorp.com/download/attachments/1273840268/image2018-10-14_2-30-25.png?version=1&modificationDate=1539451825198&api=v2|height=234!
h2. Suggestion

The performance degradation explained in this issue can happen very easy in any 
Kafka deployments. Either one of following clients can cause Fetch for old 
data, and almost all clients connecting to the same network thread which 
processes that Fetch can be affected. 
- Consumers delaying and processing offset far behind the latest offset.
- Consumers which starts periodically and consumes accumulated topic data at 
once.
- Follower broker which is attempting to restore its replica from current 
leaders.

This patch can fix performance degradation in this scenario. However we still 
have one concern. While the whole patch completes in pure Java code, its 
semantics heavily assumes underlying implementation in kernel. We haven't 
tested this in any OSes others than linux, so not sure if this effects well in 
other platforms as well.
Hence I think its better to add configuration which controls behavior of kafka 
broker wether to allow to do this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to