Thanks for the tuning tips Bob, I'll play with them after solidifying some
of my other fixes (another 24-48 hours before my migration to 1024
placement groups is finished).

Glad you enjoy ridewithgps, shoot me an email if you have any
questions/ideas/needs :)

On Fri, Feb 5, 2016 at 4:42 PM, Bob R <b...@drinksbeer.org> wrote:

> Cullen,
>
> We operate a cluster with 4 nodes, each has 2xE5-2630, 64gb ram, 10x4tb
> spinners. We've recently replaced 2xm550 journals with a single p3700 nvme
> drive per server and didn't see the performance gains we were hoping for.
> After making the changes below we're now seeing significantly better 4k
> performance. Unfortunately we pushed all of these at once so I wasn't able
> to break down the performance improvement per option but you might want to
> take a look at some of these.
>
> before:
> [cephuser@ceph03 ~]$ rados -p one bench 120 rand -t 64
> Total time run:       120.001910
> Total reads made:     1530642
> Read size:            4096
> Bandwidth (MB/sec):   49.8
> Average IOPS:         12755
> Stddev IOPS:          1272
> Max IOPS:             14087
> Min IOPS:             8165
> Average Latency:      0.005
> Max latency:          0.307
> Min latency:          0.000411
>
> after:
> [cephuser@ceph03 ~]$ rados -p one bench 120 rand -t 64
> Total time run:       120.004069
> Total reads made:     4285054
> Read size:            4096
> Bandwidth (MB/sec):   139
> Average IOPS:         35707
> Stddev IOPS:          6282
> Max IOPS:             40917
> Min IOPS:             3815
> Average Latency:      0.00178
> Max latency:          1.73
> Min latency:          0.000239
>
> [bobr@bobr ~]$ diff ceph03-before ceph03-after
> 6,8c6,8
> <     "debug_lockdep": "0\/1",
> <     "debug_context": "0\/1",
> <     "debug_crush": "1\/1",
> ---
> >     "debug_lockdep": "0\/0",
> >     "debug_context": "0\/0",
> >     "debug_crush": "0\/0",
> 15,17c15,17
> <     "debug_buffer": "0\/1",
> <     "debug_timer": "0\/1",
> <     "debug_filer": "0\/1",
> ---
> >     "debug_buffer": "0\/0",
> >     "debug_timer": "0\/0",
> >     "debug_filer": "0\/0",
> 19,21c19,21
> <     "debug_objecter": "0\/1",
> <     "debug_rados": "0\/5",
> <     "debug_rbd": "0\/5",
> ---
> >     "debug_objecter": "0\/0",
> >     "debug_rados": "0\/0",
> >     "debug_rbd": "0\/0",
> 26c26
> <     "debug_osd": "0\/5",
> ---
> >     "debug_osd": "0\/0",
> 29c29
> <     "debug_filestore": "1\/3",
> ---
> >     "debug_filestore": "0\/0",
> 31,32c31,32
> <     "debug_journal": "1\/3",
> <     "debug_ms": "0\/5",
> ---
> >     "debug_journal": "0\/0",
> >     "debug_ms": "0\/0",
> 34c34
> <     "debug_monc": "0\/10",
> ---
> >     "debug_monc": "0\/0",
> 36,37c36,37
> <     "debug_tp": "0\/5",
> <     "debug_auth": "1\/5",
> ---
> >     "debug_tp": "0\/0",
> >     "debug_auth": "0\/0",
> 39,41c39,41
> <     "debug_finisher": "1\/1",
> <     "debug_heartbeatmap": "1\/5",
> <     "debug_perfcounter": "1\/5",
> ---
> >     "debug_finisher": "0\/0",
> >     "debug_heartbeatmap": "0\/0",
> >     "debug_perfcounter": "0\/0",
> 132c132
> <     "ms_dispatch_throttle_bytes": "104857600",
> ---
> >     "ms_dispatch_throttle_bytes": "1048576000",
> 329c329
> <     "objecter_inflight_ops": "1024",
> ---
> >     "objecter_inflight_ops": "10240",
> 506c506
> <     "osd_op_threads": "4",
> ---
> >     "osd_op_threads": "20",
> 510c510
> <     "osd_disk_threads": "4",
> ---
> >     "osd_disk_threads": "1",
> 697c697
> <     "filestore_max_inline_xattr_size": "0",
> ---
> >     "filestore_max_inline_xattr_size": "254",
> 701c701
> <     "filestore_max_inline_xattrs": "0",
> ---
> >     "filestore_max_inline_xattrs": "6",
> 708c708
> <     "filestore_max_sync_interval": "5",
> ---
> >     "filestore_max_sync_interval": "10",
> 721,724c721,724
> <     "filestore_queue_max_ops": "1000",
> <     "filestore_queue_max_bytes": "209715200",
> <     "filestore_queue_committing_max_ops": "1000",
> <     "filestore_queue_committing_max_bytes": "209715200",
> ---
> >     "filestore_queue_max_ops": "500",
> >     "filestore_queue_max_bytes": "1048576000",
> >     "filestore_queue_committing_max_ops": "5000",
> >     "filestore_queue_committing_max_bytes": "1048576000",
> 758,761c758,761
> <     "journal_max_write_bytes": "10485760",
> <     "journal_max_write_entries": "100",
> <     "journal_queue_max_ops": "300",
> <     "journal_queue_max_bytes": "33554432",
> ---
> >     "journal_max_write_bytes": "1048576000",
> >     "journal_max_write_entries": "1000",
> >     "journal_queue_max_ops": "3000",
> >     "journal_queue_max_bytes": "1048576000",
>
> Good luck,
> Bob
>
> PS. thanks for ridewithgps :)
>
>
> On Thu, Feb 4, 2016 at 7:56 PM, Christian Balzer <ch...@gol.com> wrote:
>
>>
>> Hello,
>>
>> On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote:
>>
>> > Replies in-line:
>> >
>> > On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer
>> > <c-bal...@fusioncom.co.jp> wrote:
>> >
>> > >
>> > > Hello,
>> > >
>> > > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
>> > >
>> > > > Hello,
>> > > >
>> > > > I've been trying to nail down a nasty performance issue related to
>> > > > scrubbing. I am mostly using radosgw with a handful of buckets
>> > > > containing millions of various sized objects. When ceph scrubs, both
>> > > > regular and deep, radosgw blocks on external requests, and my
>> > > > cluster has a bunch of requests that have blocked for > 32 seconds.
>> > > > Frequently OSDs are marked down.
>> > > >
>> > > From my own (painful) experiences let me state this:
>> > >
>> > > 1. When your cluster runs out of steam during deep-scrubs, drop what
>> > > you're doing and order more HW (OSDs).
>> > > Because this is a sign that it would also be in trouble when doing
>> > > recoveries.
>> > >
>> >
>> > When I've initiated recoveries from working on the hardware the cluster
>> > hasn't had a problem keeping up. It seems that it only has a problem
>> with
>> > scrubbing, meaning it feels like the IO pattern is drastically
>> > different. I would think that with scrubbing I'd see something closer to
>> > bursty sequential reads, rather than just thrashing the drives with a
>> > more random IO pattern, especially given our low cluster utilization.
>> >
>> It's probably more pronounced when phasing in/out entire OSDs, where it
>> also has to read the entire (primary) data off it.
>>
>> >
>> > >
>> > > 2. If you cluster is inconvenienced by even mere scrubs, you're really
>> > > in trouble.
>> > > Threaten the penny pincher with bodily violence and have that new HW
>> > > phased in yesterday.
>> > >
>> >
>> > I am the penny pincher, biz owner, dev and ops guy for
>> > http://ridewithgps.com :) More hardware isn't an issue, it just feels
>> > pretty crazy to have this low of performance on a 12 OSD system.
>> Granted,
>> > that feeling isn't backed by anything concrete! In general, I like to
>> > understand the problem before I solve it with hardware, though I am
>> > definitely not averse to it. I already ordered 6 more 4tb drives along
>> > with the new journal SSDs, anticipating the need.
>> >
>> > As you can see from the output of ceph status, we are not space hungry
>> by
>> > any means.
>> >
>>
>> Well, in Ceph having just one OSD pegged to max will impact (eventually)
>> everything when they need to read/write primary PGs on it.
>>
>> More below.
>>
>> >
>> > >
>> > > > According to atop, the OSDs being deep scrubbed are reading at only
>> > > > 5mb/s to 8mb/s, and a scrub of a 6.4gb placement group takes 10-20
>> > > > minutes.
>> > > >
>> > > > Here's a screenshot of atop from a node:
>> > > > https://s3.amazonaws.com/rwgps/screenshots/DgSSRyeF.png
>> > > >
>> > > This looks familiar.
>> > > Basically at this point in time the competing read request for all the
>> > > objects clash with write requests and completely saturate your HD
>> > > (about 120 IOPS and 85% busy according to your atop screenshot).
>> > >
>> >
>> > In your experience would the scrub operation benefit from a bigger
>> > readahead? Meaning is it more sequential than random reads? I already
>> > bumped /sys/block/sd{x}/queue/read_ahead_kb to 512kb.
>> >
>> I played with that long time ago (in benchmark scenarios) and didn't see
>> any noticeable improvement.
>> Deep-scrub might (fragmentation could hurt it though), regular scrub not
>> so
>> much.
>>
>> > About half of our reads are on objects with an average size of 40kb (map
>> > thumbnails), and the other half are on photo thumbs with a size between
>> > 10kb and 150kb.
>> >
>>
>> Noted, see below.
>>
>> > After doing a little more researching, I came across this:
>> >
>> >
>> http://tracker.ceph.com/projects/ceph/wiki/Optimize_Newstore_for_massive_small_object_storage
>> >
>> > Sounds like I am probably running into issues with lots of random read
>> > IO, combined with known issues around small files. To give an idea, I
>> > have about 15 million small map thumbnails stored in my two largest
>> > buckets, and I am pushing out about 30 requests per second right now
>> > from those two buckets.
>> >
>> This is certainly a factor, but that knowledge of a future improvement
>> won't help you with your current problem of course. ^_-
>>
>> >
>> >
>> > > There are ceph configuration options that can mitigate this to some
>> > > extend and which I don't see in your config, like
>> > > "osd_scrub_load_threshold" and "osd_scrub_sleep" along with the
>> > > various IO priority settings.
>> > > However the points above still stand.
>> > >
>> >
>> > Yes, I have a running series of notes of config options to try out, just
>> > wanted to touch base with other community members before shooting in the
>> > dark.
>> >
>> osd_scrub_sleep is probably the most effective immediately available
>> option for you to prevent slow, stalled IO.
>> At the obvious cost of scrubs taking even longer.
>> There is of course also the option to disable scrubs entirely until your
>> HW
>> has been upgraded.
>>
>> >
>> > >
>> > > XFS defragmentation might help, significantly if your FS is badly
>> > > fragmented. But again, this is only a temporary band-aid.
>> > >
>> > > > First question: is this a reasonable speed for scrubbing, given a
>> > > > very lightly used cluster? Here's some cluster details:
>> > > >
>> > > > deploy@drexler:~$ ceph --version
>> > > > ceph version 0.94.1-5-g85a68f9
>> > > > (85a68f9a8237f7e74f44a1d1fbbd6cb4ac50f8e8)
>> > > >
>> > > >
>> > > > 2x Xeon E5-2630 per node, 64gb of ram per node.
>> > > >
>> > > More memory can help by keeping hot objects in the page cache (so the
>> > > actual disks need not be read and can write at their full IOPS
>> > > capacity). A lot of memory (and the correct sysctl settings) will also
>> > > allow for a large SLAB space, keeping all those directory entries and
>> > > other bits in memory without having to go to disk to get them.
>> > >
>> > > You seem to be just fine CPU wise.
>> > >
>> >
>> > I thought about bumping each node up to 128gb of ram as another cheap
>> > insurance policy. I'll try that after the other changes. I'd like to
>> know
>> > why so I'll try and change one thing at a time, though I am also just
>> > eager to have this thing stable.
>> >
>>
>> For me everything was sweet and dandy as long all the really hot objects
>> did fit in the page cache and the FS bits where all in SLAB (no need to
>> go to disk for a "ls -R").
>>
>> Past the point it all went to molasses land "quickly".
>>
>> >
>> > >
>> > > >
>> > > > deploy@drexler:~$ ceph status
>> > > >     cluster 234c6825-0e2b-4256-a710-71d29f4f023e
>> > > >      health HEALTH_WARN
>> > > >             118 requests are blocked > 32 sec
>> > > >      monmap e1: 3 mons at {drexler=
>> > > > 10.0.0.36:6789/0,lucy=10.0.0.38:6789/0,paley=10.0.0.34:6789/0}
>> > > >             election epoch 296, quorum 0,1,2 paley,drexler,lucy
>> > > >      mdsmap e19989: 1/1/1 up {0=lucy=up:active}, 1 up:standby
>> > > >      osdmap e1115: 12 osds: 12 up, 12 in
>> > > >       pgmap v21748062: 1424 pgs, 17 pools, 3185 GB data, 20493
>> > > > kobjects 10060 GB used, 34629 GB / 44690 GB avail
>> > > >                 1422 active+clean
>> > > >                    1 active+clean+scrubbing+deep
>> > > >                    1 active+clean+scrubbing
>> > > >   client io 721 kB/s rd, 33398 B/s wr, 53 op/s
>> > > >
>> > > You want to avoid having scrubs going on willy-nilly in parallel and
>> at
>> > > high peek times, even IF your cluster is capable of handling them.
>> > >
>> > > Depending on how busy your cluster is and its usage pattern, you may
>> do
>> > > what I did.
>> > > Kick off a deep scrub of all OSDs "ceph osd deep-scrub \*" like 01:00
>> > > on a Saturday morning.
>> > > If your cluster is fast enough, it will finish before 07:00 (without
>> > > killing your client performance) and all regular scrubs will now
>> > > happen in that time frame as well (given default settings).
>> > > If your cluster isn't fast enough, see my initial 2 points. ^o^
>> > >
>> >
>> > The problem is our cluster is the image and upload store for our site
>> > which is a reasonably busy site international site. We have about 60% of
>> > our customers in North America, and 30% or so in Europe and Asia. We
>> > definitely would be better off with more scrubs between 11pm and 7am -8
>> > to 0 GMT, though we can't afford to slam the cluster.
>> >
>> > I suppose that our cluster is a much more random mix of reads than many
>> > others using ceph as a RBD store. Operating systems probably have a
>> > stronger mix of sequential reads, whereas our users are concurrently
>> > viewing different pages with different images, a more random workload.
>> >
>> > It sounds like we have to maintain a cluster storage capacity of less
>> > than 25% in order to have reasonable performance. I guess this makes
>> > sense, we have much more random IO needs than storage needs.
>> >
>> In your use case (and most others) random IOPS tends to be the bottleneck
>> long long before either space or sequential bandwidth becomes and issues.
>>
>> More spindles, more IOPS. See below. ^o^
>>
>> >
>> > >
>> > > > deploy@drexler:~$ ceph osd tree
>> > > > ID WEIGHT   TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> > > > -1 43.67999 root default
>> > > > -2 14.56000     host paley
>> > > >  0  3.64000         osd.0         up  1.00000          1.00000
>> > > >  3  3.64000         osd.3         up  1.00000          1.00000
>> > > >  6  3.64000         osd.6         up  1.00000          1.00000
>> > > >  9  3.64000         osd.9         up  1.00000          1.00000
>> > > > -3 14.56000     host lucy
>> > > >  1  3.64000         osd.1         up  1.00000          1.00000
>> > > >  4  3.64000         osd.4         up  1.00000          1.00000
>> > > >  7  3.64000         osd.7         up  1.00000          1.00000
>> > > > 11  3.64000         osd.11        up  1.00000          1.00000
>> > > > -4 14.56000     host drexler
>> > > >  2  3.64000         osd.2         up  1.00000          1.00000
>> > > >  5  3.64000         osd.5         up  1.00000          1.00000
>> > > >  8  3.64000         osd.8         up  1.00000          1.00000
>> > > > 10  3.64000         osd.10        up  1.00000          1.00000
>> > > >
>> > > >
>> > > > My OSDs are 4tb 7200rpm Hitachi DeskStars, using XFS, with Samsung
>> > > > 850 Pro journals (very slow, ordered s3700 replacements, but
>> > > > shouldn't pose problems for reading as far as I understand things).
>> > >
>> > > Just to make sure, these are genuine DeskStars?
>> > > I'm asking both because AFAIK they are out of production and their
>> > > direct successors, the Toshiba DT drives (can) have a nasty firmware
>> > > bug that totally ruins their performance (from ~8 hours per week to
>> > > permanently until power-cycled).
>> > >
>> >
>> > These are original deskstars. Didn't realize they weren't in production,
>> > I just grabbed 6 more of the Hitachi DeskStar NAS edition 4tb drives,
>> > which are readily available. I probably should have ordered 6tb drives,
>> > as I'd end up with better seek times due to them not being fully
>> > utilized - the data would reside closer to the center of the platters.
>> >
>> Ah, Deskstar NAS, yes, these still are in production.
>>
>> I'd get more, smaller, faster HDDs instead.
>> HW cache on your controller can also help (depends on the model/FW if it
>> is used efficiently in JBOD mode).
>>
>> And since your space utilization is small (though of course that can and
>> will change over time of course), you may very well benefit from going
>> SSD.
>>
>> SSD pools if you think you can fit (economically) a set of your high
>> access data like the thumbnails on it.
>>
>> SSD cache tiers are a bit more dubious when comes to rewards, but that
>> depends a lot on the hot data set.
>> Plenty of discussion in here about that.
>>
>> Regards,
>>
>> Christian
>> >
>> > >
>> > > Regards,
>> > >
>> > > Christian
>> > > > MONs are co-located
>> > > > with OSD nodes, but the nodes are fairly beefy and has very low
>> load.
>> > > > Drives are on a expanding backplane, with an LSI SAS3008 controller.
>> > > >
>> > > > I have a fairly standard config as well:
>> > > >
>> > > > https://gist.github.com/kingcu/aae7373eb62ceb7579da
>> > > >
>> > > > I know that I don't have a ton of OSDs, but I'd expect a little
>> > > > better performance than this. Checkout munin of my three nodes:
>> > > >
>> > > >
>> > >
>> http://munin.ridewithgps.com/ridewithgps.com/drexler.ridewithgps.com/index.html#disk
>> > > >
>> > >
>> http://munin.ridewithgps.com/ridewithgps.com/paley.ridewithgps.com/index.html#disk
>> > > >
>> > >
>> http://munin.ridewithgps.com/ridewithgps.com/lucy.ridewithgps.com/index.html#disk
>> > > >
>> > > >
>> > > > Any input would be appreciated, before I start trying to
>> > > > micro-optimize config params, as well as upgrading to Infernalis.
>> > > >
>> > > >
>> > > > Cheers,
>> > > >
>> > > > Cullen
>> > >
>> > >
>> > > --
>> > > Christian Balzer        Network/Systems Engineer
>> > > ch...@gol.com           Global OnLine Japan/Rakuten Communications
>> > > http://www.gol.com/
>> > >
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> ch...@gol.com           Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to