On Thu, Sep 13, 2012 at 12:25:36AM +0200, Mark Nelson wrote:
> On 09/12/2012 03:08 PM, Dieter Kasper wrote:
> > On Mon, Sep 10, 2012 at 10:39:58PM +0200, Mark Nelson wrote:
> >> On 09/10/2012 03:15 PM, Mike Ryan wrote:
> >>> *Disclaimer*: these results are an investigation into potential
> >>> bottlenecks in RADOS.
> > I appreciate this investigation very much !
> >
> >>> The test setup is wholly unrealistic, and these
> >>> numbers SHOULD NOT be used as an indication of the performance of OSDs,
> >>> messaging, RADOS, or ceph in general.
> >>>
> >>>
> >>> Executive summary: rados bench has some internal bottleneck. Once that's
> >>> cleared up, we're still having some issues saturating a single
> >>> connection to an OSD. Having 2-3 connection in parallel alleviates that
> >>> (either by having> 1 OSD or having multiple bencher clients).
> >>>
> >>>
> >>> I've run three separate tests: msbench, smalliobench, and rados bench.
> >>> In all cases I was trying to determine where bottleneck(s) exist. All
> >>> the tests were run on a machine with 192 GB of RAM. The backing stores
> >>> for all OSDs and journals are RAMdisks. The stores are running XFS.
> >>>
> >>> smalliobench: I ran tests varying the number of OSDs and bencher
> >>> clients. In all cases, the number of PG's per OSD is 100.
> >>>
> >>> OSD Bencher Throughput (mbyte/sec)
> >>> 1 1 510
> >>> 1 2 800
> >>> 1 3 850
> >>> 2 1 640
> >>> 2 2 660
> >>> 2 3 670
> >>> 3 1 780
> >>> 3 2 820
> >>> 3 3 870
> >>> 4 1 850
> >>> 4 2 970
> >>> 4 3 990
> >>>
> >>> Note: these numbers are fairly fuzzy. I eyeballed them and they're only
> >>> really accurate to about 10 mbyte/sec. The small IO bencher was run with
> >>> 100 ops in flight, 4 mbyte io's, 4 mbyte files.
> >>>
> >>> msbench: ran tests trying to determine max throughput of raw messaging
> >>> layer. Varied the number of concurrently connected msbench clients and
> >>> measured aggregate throughput. Take-away: a messaging client can very
> >>> consistently push 400-500 mbytes/sec through a single socket.
> >>>
> >>> Clients Throughput (mbyte/sec)
> >>> 1 520
> >>> 2 880
> >>> 3 1300
> >>> 4 1900
> >>>
> >>> Finally, rados bench, which seems to have its own bottleneck. Running
> >>> varying numbers of these, each client seems to get 250 mbyte/sec up till
> >>> the aggregate rate is around 1000 mbyte/sec (appx line speed as measured
> >>> by iperf). These were run on a pool with 100 PGs/OSD.
> >>>
> >>> Clients Throughput (mbyte/sec)
> >>> 1 250
> >>> 2 500
> >>> 3 750
> >>> 4 1000 (very fuzzy, probably 1000 +/- 75)
> >>> 5 1000, seems to level out here
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to [email protected]
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
> >> Hi guys,
> >>
> >> Some background on all of this:
> >>
> >> We've been doing some performance testing at Inktank and noticed that
> >> performance with a single rados bench instance was plateauing at between
> >> 600-700MB/s.
> >
> > 4-nodes with 10GbE interconnect; journals in RAM-Disk; replica=2
> >
> > # rados bench -p pbench 20 write
> > Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
> > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
> > 0 0 0 0 0 0 - 0
> > 1 16 288 272 1087.81 1088 0.051123 0.0571643
> > 2 16 579 563 1125.85 1164 0.045729 0.0561784
> > 3 16 863 847 1129.19 1136 0.042012 0.0560869
> > 4 16 1150 1134 1133.87 1148 0.05466 0.0559281
> > 5 16 1441 1425 1139.87 1164 0.036852 0.0556809
> > 6 16 1733 1717 1144.54 1168 0.054594 0.0556124
> > 7 16 2007 1991 1137.59 1096 0.04454 0.0556698
> > 8 16 2290 2274 1136.88 1132 0.046777 0.0560103
> > 9 16 2580 2564 1139.44 1160 0.073328 0.0559353
> > 10 16 2871 2855 1141.88 1164 0.034091 0.0558576
> > 11 16 3158 3142 1142.43 1148 0.250688 0.0558404
> > 12 16 3445 3429 1142.88 1148 0.046941 0.0558071
> > 13 16 3726 3710 1141.42 1124 0.054092 0.0559
> > 14 16 4014 3998 1142.17 1152 0.03531 0.0558533
> > 15 16 4298 4282 1141.75 1136 0.040005 0.0559383
> > 16 16 4582 4566 1141.39 1136 0.048431 0.0559162
> > 17 16 4859 4843 1139.42 1108 0.045805 0.0559891
> > 18 16 5145 5129 1139.66 1144 0.046805 0.0560177
> > 19 16 5422 5406 1137.99 1108 0.037295 0.0561341
> > 2012-09-08 14:36:32.460311min lat: 0.029503 max lat: 0.47757 avg lat:
> > 0.0561424
> > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
> > 20 16 5701 5685 1136.89 1116 0.041493 0.0561424
> > Total time run: 20.197129
> > Total writes made: 5702
> > Write size: 4194304
> > Bandwidth (MB/sec): 1129.269
> >
> > Stddev Bandwidth: 23.7487
> > Max bandwidth (MB/sec): 1168
> > Min bandwidth (MB/sec): 1088
> > Average Latency: 0.0564675
> > Stddev Latency: 0.0327582
> > Max latency: 0.47757
> > Min latency: 0.029503
> >
> >
> > Best Regards,
> > -Dieter
> >
>
> Well look at that! :) Now I've gotta figure out what the difference is.
> How fast are the CPUs in your rados bench machine there?
One CPU socket in each node:
model name : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Logial CPUs: 12
MemTotal: 32856332 kB
>
> Also, I should mention that at these speeds, we noticed that crc32c
> calculations were actually having a pretty big effect.
perf report
Events: 39K cycles
+ 26.29% ceph-osd ceph-osd [.] 0x45e60b
+ 4.74% ceph-osd [kernel.kallsyms] [k]
copy_user_generic_string
+ 3.37% ceph-mon ceph-mon [.]
MHeartbeat::decode_payload()
+ 2.88% ceph-osd [kernel.kallsyms] [k] futex_wake
+ 2.61% swapper [kernel.kallsyms] [k] intel_idle
+ 2.34% ceph-osd [kernel.kallsyms] [k] __memcpy
+ 1.71% ceph-osd libc-2.11.3.so [.] memcpy
+ 1.70% ceph-osd [kernel.kallsyms] [k]
__copy_user_nocache
+ 1.66% ceph-osd [kernel.kallsyms] [k] futex_requeue
+ 1.33% ceph-mon ceph-mon [.]
MOSDOpReply::~MOSDOpReply()
+ 1.18% ceph-mon libc-2.11.3.so [.] memcpy
+ 1.16% ceph-mon ceph-mon [.]
MOSDPGInfo::decode_payload()
+ 0.97% ceph-osd [kernel.kallsyms] [k] futex_wake_op
+ 0.86% ceph-mon ceph-mon [.]
MExportDirDiscoverAck::print(std::ostream&) const
+ 0.79% ceph-osd [kernel.kallsyms] [k] _raw_spin_lock
+ 0.74% ceph-mon ceph-mon [.]
MOSDPing::decode_payload()
+ 0.52% ceph-osd libtcmalloc.so.0.3.0 [.] operator
new(unsigned long)
+ 0.51% ceph-mon ceph-mon [.]
MDiscover::print(std::ostream&) const
+ 0.48% ceph-osd [xfs] [k]
xfs_bmap_add_extent
+ 0.43% ceph-mon [kernel.kallsyms] [k]
copy_user_generic_string
+ 0.39% ceph-osd [kernel.kallsyms] [k]
iov_iter_fault_in_readable
Regards,
-Dieter
> Turning them off
> gave us a 10% performance boost. We're looking at faster
> implementations now.
>
> Mark
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html