[jira] [Commented] (KAFKA-545) Add a Performance Suite for the Log subsystem

Jay Kreps (JIRA) Wed, 31 Oct 2012 16:53:13 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488366#comment-13488366
 ]


Jay Kreps commented on KAFKA-545:
---------------------------------

The initial set of results show the following:

Big impacts on the write path are 
1. flush
2. FileChannel overhead
3. CRC calculation

Big impacts on the read path are
1. FileMessageSet.searchFor
2. MessageSet iteration (mostly an artifact of the test)

Notable is that the index lookup doesn't show up at all.

Since our read path is an order of magnitude faster than our write path, it 
makes sense to focus on writes first. The most important thing here is to 
handle flush better.

The important thing to know is that on Linux fsync holds a global lock on 
writes to the file, so effectively flush blocks all appends (even though we 
only really want to flush the data already written). We are effectively 
targeting linux here since it is so common.

Originally we thought that we could just disable flush and depend on 
replication for durability with the flush to disk happening via Linux's 
background pdflush process. However recovery demands a "known stable point" 
from which to recover. Currently we guarantee that all but the last log segment 
have been flushed. The problem with this is that disabling the time or size 
based flush effectively just prolongs the time of the global lock until the 
segment is full, but under high load this could mean suddenly blocking on 
writing out 2GB of data--a huge multi-second latency spike.

There are a number of possible solutions to this dilemma:
1. Memory map the FileMessageSet. It is possible that the locking 
characteristics for mmap/msync are different than fsync. This would have the 
advantage of also getting rid of the overhead of the write call which in java 
is pretty high and thus making FileMessageSet.searchFor and iterator much 
faster. There are two disadvantages of this. First we would have to 
pre-allocate file segments. This would likely confuse people a bit. We are 
already doing it with index files, but those are smaller. Second memory map 
eats up process address space. This would likely mean that running on a 32 bit 
OS would be infeasible (since you would only have 2GB).
2. Change the recovery to recover from an arbitrary point in the log, and write 
out a "high water mark" in a similar way to how we do for the ISR. This would 
let us just avoid syncing the active segment.
3. Move the flush into a background thread. This wouldn't help if one was using 
a time-based or size-based flush but would help for the flush at the time of 
segment roll since at that point there is guaranteed to be no more writes on 
the segment. This would require recovering the last two segments. This is kind 
of hacky but easy to implement.

So there are two questions this raises:
1. Does mmap/msync actually have different locking than fsync? Mmap might be 
the preferable solution since it solves lots of performance bottlenecks all at 
once.
2. Does fsync block reads? This is a question Jun raised. We don't actually 
care to much about adding a little bit of latency to consumers, but with 
synchronous replication fetch latency is a lower bound on produce latency. So 
blocking reads may be as bad as blocking writes. It is unlikely that an in-sync 
replica would be a full segment behind, but nonetheless.

I started to investigate these two questions.

I started with (2). It doesn't seem that fsync blocks reads. This is intuitive. 
To test this I wrote a test that has one thread continually append to a file, 
one thread call fsync every 2 seconds, and one thread do random reads. I 
measured the max read time over 100k reads. Here is a trace of the output:
flushing
flush completed in 0.35281
0.226049
1.347807
0.02298
1.562114
0.041638
2.119723
0.027258
5.329019
flushing
8.32156
flush completed in 81.123215
0.04145
1.473818
0.06444
1.733412
0.050216
1.437777
0.04984
1.612728
0.026001
1.858957
0.041096
0.390903
flushing
12.527883
flush completed in 67.416953
0.055656
1.450987
0.029861
1.469376
0.047733
1.313674
0.024264
1.71214
0.027112
0.023717
1.228905
0.029688
1.215998
flushing
flush completed in 59.289193
18.217726
1.549095
0.029295
1.367316
0.047124
1.389574
0.034049
0.030982
1.129182
0.048443
1.070381
0.040149
1.07179
flushing
flush completed in 59.340792
18.668898
0.933095
0.041071
1.197376
0.035512
1.34228
0.042432
0.024397
0.835786
0.026552
1.496774
0.036751
1.148597
flushing
flush completed in 59.308117
16.34416
0.854841
0.053005
1.013405
0.08081
0.051634
1.218344
0.015086
1.447114
0.019883
1.128675
0.041854
1.148591
flushing
15.110585
flush completed in 47.303732
1.018977
0.015041
0.036324
1.293796
0.051184
1.291538
0.013544
1.211112
0.014241
1.520512
0.027815
1.246593
flushing
0.016121
flush completed in 59.38031
22.635984
0.051233
0.054701
0.712837
0.01345
1.004364
0.017261
1.216081
0.019825
^C[jkreps@jkreps-ld kafka-jbod]$ java -server -Xmx128M -Xms128M 
-XX:+UseConcMarkSweepGC -cp 
project/boot/scala-2.8.0/lib/scala-library.jar:core/target/scala_2.8.0/test-classes
 kafka.TestFileChannelReadLocking 100000
flushing
flush completed in 0.528026
0.217155
1.620644
0.034775
0.032913
1.867401
0.026142
2.367951
0.05301
flushing
3.941371
flush completed in 252.11533
17.678379
1.226629
0.043844
2.254627
0.05084
1.654637
0.028077
0.029319
1.217127
0.029779
1.251271
0.281076
0.919776
flushing
0.050014
1.156282
flush completed in 238.852776
5.953432
0.038438
0.966883
0.048641
0.907416
0.037052
1.595778
0.023821
0.923264
0.047909
0.921312
0.058346
0.058062
flushing
0.906503
0.090477
flush completed in 239.282906
1.504453
0.014805
1.276596
0.051536
1.200947
0.052367
0.068161
1.585333
0.051904
1.052337
0.063165
1.502294
0.01266
flushing
0.839178
0.048978
flush completed in 290.738005
1.414586
0.069402
0.056503
1.0008
0.050265
0.955949
0.050486
1.014454
0.048935
1.210959
0.054599
1.313663
0.058152
flushing
0.062286
1.036941
flush completed in 242.879275
11.401024
1.390022
0.07374
0.92633
0.013332
1.015606
0.04448
1.687692
0.014552
0.018272
1.339258
0.051723
^C

As you can see there is some locking happening. But it is not for the duration 
of the flush. I tried varying the amount of data being written and the max read 
time remains constant. My guess is that what is happening is that the locking 
is at the page level, which is what we see with pdflush. This should be 
acceptable as the latency is bounded to the time to flush one page regardless 
of the flush size.

I am working on testing mmap.
                
> Add a Performance Suite for the Log subsystem
> ---------------------------------------------
>
>                 Key: KAFKA-545
>                 URL: https://issues.apache.org/jira/browse/KAFKA-545
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.8
>            Reporter: Jay Kreps
>            Priority: Blocker
>              Labels: features
>         Attachments: KAFKA-545-draft.patch
>
>
> We have had several performance concerns or potential improvements for the 
> logging subsystem. To conduct these in a data-driven way, it would be good to 
> have a single-machine performance test that isolated the performance of the 
> log.
> The performance optimizations we would like to evaluate include
> - Special casing appends in a follower which already have the correct offset 
> to avoid decompression and recompression
> - Memory mapping either all or some of the segment files to improve the 
> performance of small appends and lookups
> - Supporting multiple data directories and avoiding RAID
> Having a standalone tool is nice to isolate the component and makes profiling 
> more intelligible.
> This test would drive load against Log/LogManager controlled by a set of 
> command line options. These command line program could then be scripted up 
> into a suite of tests that covered variations in message size, message set 
> size, compression, number of partitions, etc.
> Here is a proposed usage for the tool:
> ./bin/kafka-log-perf-test.sh
> Option                   Description                            
> ------                       -----------                            
> --partitions             The number of partitions to write to
> --dir                       The directory in which to write the log
> --message-size      The size of the messages
> --set-size               The number of messages per write
> --compression        Compression alg
> --messages            The number of messages to write
> --readers                The number of reader threads reading the data
> The tool would capture latency and throughput for the append() and read() 
> operations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-545) Add a Performance Suite for the Log subsystem

Reply via email to