[
https://issues.apache.org/jira/browse/KAFKA-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488366#comment-13488366
]
Jay Kreps commented on KAFKA-545:
---------------------------------
The initial set of results show the following:
Big impacts on the write path are
1. flush
2. FileChannel overhead
3. CRC calculation
Big impacts on the read path are
1. FileMessageSet.searchFor
2. MessageSet iteration (mostly an artifact of the test)
Notable is that the index lookup doesn't show up at all.
Since our read path is an order of magnitude faster than our write path, it
makes sense to focus on writes first. The most important thing here is to
handle flush better.
The important thing to know is that on Linux fsync holds a global lock on
writes to the file, so effectively flush blocks all appends (even though we
only really want to flush the data already written). We are effectively
targeting linux here since it is so common.
Originally we thought that we could just disable flush and depend on
replication for durability with the flush to disk happening via Linux's
background pdflush process. However recovery demands a "known stable point"
from which to recover. Currently we guarantee that all but the last log segment
have been flushed. The problem with this is that disabling the time or size
based flush effectively just prolongs the time of the global lock until the
segment is full, but under high load this could mean suddenly blocking on
writing out 2GB of data--a huge multi-second latency spike.
There are a number of possible solutions to this dilemma:
1. Memory map the FileMessageSet. It is possible that the locking
characteristics for mmap/msync are different than fsync. This would have the
advantage of also getting rid of the overhead of the write call which in java
is pretty high and thus making FileMessageSet.searchFor and iterator much
faster. There are two disadvantages of this. First we would have to
pre-allocate file segments. This would likely confuse people a bit. We are
already doing it with index files, but those are smaller. Second memory map
eats up process address space. This would likely mean that running on a 32 bit
OS would be infeasible (since you would only have 2GB).
2. Change the recovery to recover from an arbitrary point in the log, and write
out a "high water mark" in a similar way to how we do for the ISR. This would
let us just avoid syncing the active segment.
3. Move the flush into a background thread. This wouldn't help if one was using
a time-based or size-based flush but would help for the flush at the time of
segment roll since at that point there is guaranteed to be no more writes on
the segment. This would require recovering the last two segments. This is kind
of hacky but easy to implement.
So there are two questions this raises:
1. Does mmap/msync actually have different locking than fsync? Mmap might be
the preferable solution since it solves lots of performance bottlenecks all at
once.
2. Does fsync block reads? This is a question Jun raised. We don't actually
care to much about adding a little bit of latency to consumers, but with
synchronous replication fetch latency is a lower bound on produce latency. So
blocking reads may be as bad as blocking writes. It is unlikely that an in-sync
replica would be a full segment behind, but nonetheless.
I started to investigate these two questions.
I started with (2). It doesn't seem that fsync blocks reads. This is intuitive.
To test this I wrote a test that has one thread continually append to a file,
one thread call fsync every 2 seconds, and one thread do random reads. I
measured the max read time over 100k reads. Here is a trace of the output:
flushing
flush completed in 0.35281
0.226049
1.347807
0.02298
1.562114
0.041638
2.119723
0.027258
5.329019
flushing
8.32156
flush completed in 81.123215
0.04145
1.473818
0.06444
1.733412
0.050216
1.437777
0.04984
1.612728
0.026001
1.858957
0.041096
0.390903
flushing
12.527883
flush completed in 67.416953
0.055656
1.450987
0.029861
1.469376
0.047733
1.313674
0.024264
1.71214
0.027112
0.023717
1.228905
0.029688
1.215998
flushing
flush completed in 59.289193
18.217726
1.549095
0.029295
1.367316
0.047124
1.389574
0.034049
0.030982
1.129182
0.048443
1.070381
0.040149
1.07179
flushing
flush completed in 59.340792
18.668898
0.933095
0.041071
1.197376
0.035512
1.34228
0.042432
0.024397
0.835786
0.026552
1.496774
0.036751
1.148597
flushing
flush completed in 59.308117
16.34416
0.854841
0.053005
1.013405
0.08081
0.051634
1.218344
0.015086
1.447114
0.019883
1.128675
0.041854
1.148591
flushing
15.110585
flush completed in 47.303732
1.018977
0.015041
0.036324
1.293796
0.051184
1.291538
0.013544
1.211112
0.014241
1.520512
0.027815
1.246593
flushing
0.016121
flush completed in 59.38031
22.635984
0.051233
0.054701
0.712837
0.01345
1.004364
0.017261
1.216081
0.019825
^C[jkreps@jkreps-ld kafka-jbod]$ java -server -Xmx128M -Xms128M
-XX:+UseConcMarkSweepGC -cp
project/boot/scala-2.8.0/lib/scala-library.jar:core/target/scala_2.8.0/test-classes
kafka.TestFileChannelReadLocking 100000
flushing
flush completed in 0.528026
0.217155
1.620644
0.034775
0.032913
1.867401
0.026142
2.367951
0.05301
flushing
3.941371
flush completed in 252.11533
17.678379
1.226629
0.043844
2.254627
0.05084
1.654637
0.028077
0.029319
1.217127
0.029779
1.251271
0.281076
0.919776
flushing
0.050014
1.156282
flush completed in 238.852776
5.953432
0.038438
0.966883
0.048641
0.907416
0.037052
1.595778
0.023821
0.923264
0.047909
0.921312
0.058346
0.058062
flushing
0.906503
0.090477
flush completed in 239.282906
1.504453
0.014805
1.276596
0.051536
1.200947
0.052367
0.068161
1.585333
0.051904
1.052337
0.063165
1.502294
0.01266
flushing
0.839178
0.048978
flush completed in 290.738005
1.414586
0.069402
0.056503
1.0008
0.050265
0.955949
0.050486
1.014454
0.048935
1.210959
0.054599
1.313663
0.058152
flushing
0.062286
1.036941
flush completed in 242.879275
11.401024
1.390022
0.07374
0.92633
0.013332
1.015606
0.04448
1.687692
0.014552
0.018272
1.339258
0.051723
^C
As you can see there is some locking happening. But it is not for the duration
of the flush. I tried varying the amount of data being written and the max read
time remains constant. My guess is that what is happening is that the locking
is at the page level, which is what we see with pdflush. This should be
acceptable as the latency is bounded to the time to flush one page regardless
of the flush size.
I am working on testing mmap.
> Add a Performance Suite for the Log subsystem
> ---------------------------------------------
>
> Key: KAFKA-545
> URL: https://issues.apache.org/jira/browse/KAFKA-545
> Project: Kafka
> Issue Type: New Feature
> Affects Versions: 0.8
> Reporter: Jay Kreps
> Priority: Blocker
> Labels: features
> Attachments: KAFKA-545-draft.patch
>
>
> We have had several performance concerns or potential improvements for the
> logging subsystem. To conduct these in a data-driven way, it would be good to
> have a single-machine performance test that isolated the performance of the
> log.
> The performance optimizations we would like to evaluate include
> - Special casing appends in a follower which already have the correct offset
> to avoid decompression and recompression
> - Memory mapping either all or some of the segment files to improve the
> performance of small appends and lookups
> - Supporting multiple data directories and avoiding RAID
> Having a standalone tool is nice to isolate the component and makes profiling
> more intelligible.
> This test would drive load against Log/LogManager controlled by a set of
> command line options. These command line program could then be scripted up
> into a suite of tests that covered variations in message size, message set
> size, compression, number of partitions, etc.
> Here is a proposed usage for the tool:
> ./bin/kafka-log-perf-test.sh
> Option Description
> ------ -----------
> --partitions The number of partitions to write to
> --dir The directory in which to write the log
> --message-size The size of the messages
> --set-size The number of messages per write
> --compression Compression alg
> --messages The number of messages to write
> --readers The number of reader threads reading the data
> The tool would capture latency and throughput for the append() and read()
> operations.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira