Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

Wade Holler Fri, 24 Jun 2016 12:46:29 -0700

Not reasonable as you say :

vm.min_free_kbytes = 90112


we're in recovery post expansion (48->54 OSDs) right now but free -t is:

#free -t

              total        used        free      shared  buff/cache   available

Mem:      693097104   378383384    36870080      369292   277843640   250931372

Swap:       1048572         956     1047616

Total:    694145676   378384340    37917696


On Fri, Jun 24, 2016 at 12:24 PM, Warren Wang - ISD
<[email protected]> wrote:
> Oops, that reminds me, do you have min_free_kbytes set to something
> reasonable like at least 2-4GB?
>
> Warren Wang
>
>
>
> On 6/24/16, 10:23 AM, "Wade Holler" <[email protected]> wrote:
>
>>On the vm.vfs_cace_pressure = 1 :   We had this initially and I still
>>think it is the best choice for most configs.  However with our large
>>memory footprint, vfs_cache_pressure=1 increased the likelihood of
>>hitting an issue where our write response time would double; then a
>>drop of caches would return response time to normal.  I don't claim to
>>totally understand this and I only have speculation at the moment.
>>Again thanks for this suggestion, I do think it is best for boxes that
>>don't have very large memory.
>>
>>@ Christian - reformatting to btrfs or ext4 is an option in my test
>>cluster.  I thought about that but needed to sort xfs first. (thats
>>what production will run right now) You all have helped me do that and
>>thank you again.  I will circle back and test btrfs under the same
>>conditions.  I suspect that it will behave similarly but it's only a
>>day and half's work or so to test.
>>
>>Best Regards,
>>Wade
>>
>>
>>On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <[email protected]>
>>wrote:
>>> Oops , typo , 128 GB :-)...
>>>
>>> -----Original Message-----
>>> From: Christian Balzer [mailto:[email protected]]
>>> Sent: Thursday, June 23, 2016 5:08 PM
>>> To: [email protected]
>>> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph
>>>Development
>>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>>of objects in pool
>>>
>>>
>>> Hello,
>>>
>>> On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
>>>
>>>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
>>>> *pin* inode/dentries in memory. We are using that for long now (with
>>>> 128 TB node memory) and it seems helping specially for the random
>>>> write workload and saving xattrs read in between.
>>>>
>>> 128TB node memory, really?
>>> Can I have some of those, too? ^o^
>>> And here I was thinking that Wade's 660GB machines were on the
>>>excessive side.
>>>
>>> There's something to be said (and optimized) when your storage nodes
>>>have the same or more RAM as your compute nodes...
>>>
>>> As for Warren, well spotted.
>>> I personally use vm.vfs_cache_pressure = 1, this avoids the potential
>>>fireworks if your memory is really needed elsewhere, while keeping
>>>things in memory normally.
>>>
>>> Christian
>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:[email protected]] On Behalf
>>>> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
>>>> To: Wade Holler; Blair Bethwaite
>>>> Cc: Ceph Development; [email protected]
>>>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>>> of objects in pool
>>>>
>>>> vm.vfs_cache_pressure = 100
>>>>
>>>> Go the other direction on that. You易ll want to keep it low to help
>>>> keep inode/dentry info in memory. We use 10, and haven易t had a problem.
>>>>
>>>>
>>>> Warren Wang
>>>>
>>>>
>>>>
>>>>
>>>> On 6/22/16, 9:41 PM, "Wade Holler" <[email protected]> wrote:
>>>>
>>>> >Blairo,
>>>> >
>>>> >We'll speak in pre-replication numbers, replication for this pool is
>>>>3.
>>>> >
>>>> >23.3 Million Objects / OSD
>>>> >pg_num 2048
>>>> >16 OSDs / Server
>>>> >3 Servers
>>>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>>>> >vm.vfs_cache_pressure = 100
>>>> >
>>>> >Workload is native librados with python.  ALL 4k objects.
>>>> >
>>>> >Best Regards,
>>>> >Wade
>>>> >
>>>> >
>>>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
>>>> ><[email protected]> wrote:
>>>> >> Wade, good to know.
>>>> >>
>>>> >> For the record, what does this work out to roughly per OSD? And how
>>>> >> much RAM and how many PGs per OSD do you have?
>>>> >>
>>>> >> What's your workload? I wonder whether for certain workloads (e.g.
>>>> >> RBD) it's better to increase default object size somewhat before
>>>> >> pushing the split/merge up a lot...
>>>> >>
>>>> >> Cheers,
>>>> >>
>>>> >> On 23 June 2016 at 11:26, Wade Holler <[email protected]> wrote:
>>>> >>> Based on everyones suggestions; The first modification to 50 / 16
>>>> >>> enabled our config to get to ~645Mill objects before the behavior
>>>> >>> in question was observed (~330 was the previous ceiling).
>>>> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1
>>>> >>> Billion+
>>>> >>>
>>>> >>> Thank you all very much for your support and assistance.
>>>> >>>
>>>> >>> Best Regards,
>>>> >>> Wade
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <[email protected]>
>>>> >>>wrote:
>>>> >>>>
>>>> >>>> Hello,
>>>> >>>>
>>>> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>>>> >>>>
>>>> >>>>> Sorry, late to the party here. I agree, up the merge and split
>>>> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
>>>> >>>>>here.
>>>> >>>>> One of those things you just have to find out as an operator
>>>> >>>>>since it's  not well documented :(
>>>> >>>>>
>>>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>> >>>>>
>>>> >>>>> We have over 200 million objects in this cluster, and it's still
>>>> >>>>>doing  over 15000 write IOPS all day long with 302 spinning
>>>> >>>>>drives
>>>> >>>>>+ SATA SSD  journals. Having enough memory and dropping your
>>>> >>>>>vfs_cache_pressure  should also help.
>>>> >>>>>
>>>> >>>> Indeed.
>>>> >>>>
>>>> >>>> Since it was asked in that bug report and also my first
>>>> >>>>suspicion, it  would probably be good time to clarify that it
>>>> >>>>isn't the splits that cause  the performance degradation, but the
>>>> >>>>resulting inflation of dir entries  and exhaustion of SLAB and
>>>> >>>>thus having to go to disk for things that  normally would be in
>>>>memory.
>>>> >>>>
>>>> >>>> Looking at Blair's graph from yesterday pretty much makes that
>>>> >>>>clear, a  purely split caused degradation should have relented
>>>> >>>>much quicker.
>>>> >>>>
>>>> >>>>
>>>> >>>>> Keep in mind that if you change the values, it won't take effect
>>>> >>>>> immediately. It only merges them back if the directory is under
>>>> >>>>> the calculated threshold and a write occurs (maybe a read, I
>>>> >>>>> forget).
>>>> >>>>>
>>>> >>>> If it's a read a plain scrub might do the trick.
>>>> >>>>
>>>> >>>> Christian
>>>> >>>>> Warren
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> From: ceph-users
>>>> >>>>>
>>>>
>>>>>>>>><[email protected]<mailto:ceph-users-bounces@lists.
>>>> >>>>>cep
>>>> >>>>>h.com>>
>>>> >>>>> on behalf of Wade Holler
>>>> >>>>> <[email protected]<mailto:[email protected]>> Date:
>>>> >>>>>Monday, June  20, 2016 at 2:48 PM To: Blair Bethwaite
>>>> >>>>><[email protected]<mailto:[email protected]>>,
>>>> >>>>>Wido den  Hollander <[email protected]<mailto:[email protected]>> Cc:
>>>> >>>>>Ceph Development
>>>> >>>>><[email protected]<mailto:[email protected]>>,
>>>> >>>>> "[email protected]<mailto:[email protected]>"
>>>> >>>>> <[email protected]<mailto:[email protected]>>
>>>> >>>>>Subject:
>>>> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>>>> >>>>>objects  in pool
>>>> >>>>>
>>>> >>>>> Thanks everyone for your replies.  I sincerely appreciate it. We
>>>> >>>>> are testing with different pg_num and filestore_split_multiple
>>>> >>>>> settings. Early indications are .... well not great. Regardless
>>>> >>>>> it is nice to understand the symptoms better so we try to design
>>>> >>>>> around it.
>>>> >>>>>
>>>> >>>>> Best Regards,
>>>> >>>>> Wade
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>>> >>>>><[email protected]<mailto:[email protected]>>
>>>>wrote:
>>>> >>>>>On
>>>> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
>>>> >>>>><[email protected]<mailto:[email protected]>>
>>>>wrote:
>>>> >>>>> > slow request issues). If you watch your xfs stats you'll
>>>> >>>>> > likely get further confirmation. In my experience
>>>> >>>>> > xs_dir_lookups balloons
>>>> >>>>>(which
>>>> >>>>> > means directory lookups are missing cache and going to disk).
>>>> >>>>>
>>>> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
>>>> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit this
>>>> >>>>> very problem we had only ephemerally set the new filestore
>>>> >>>>> merge/split values - oops. Here's what started happening when we
>>>> >>>>> upgraded and restarted a bunch of OSDs:
>>>> >>>>>
>>>> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
>>>> >>>>>_d
>>>> >>>>>ir_
>>>> >>>>>lookup.png
>>>> >>>>>
>>>> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>>> >>>>> 12:30, then still took a while to settle.
>>>> >>>>>
>>>> >>>>> --
>>>> >>>>> Cheers,
>>>> >>>>> ~Blairo
>>>> >>>>>
>>>> >>>>> This email and any files transmitted with it are confidential
>>>> >>>>>and intended solely for the individual or entity to whom they are
>>>> >>>>>addressed.
>>>> >>>>> If you have received this email in error destroy it immediately.
>>>> >>>>>***  Walmart Confidential ***
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> Christian Balzer        Network/Systems Engineer
>>>> >>>> [email protected]           Global OnLine Japan/Rakuten Communications
>>>> >>>> http://www.gol.com/
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Cheers,
>>>> >> ~Blairo
>>>>
>>>> This email and any files transmitted with it are confidential and
>>>> intended solely for the individual or entity to whom they are
>>>>addressed.
>>>> If you have received this email in error destroy it immediately. ***
>>>> Walmart Confidential ***
>>>> _______________________________________________
>>>> ceph-users mailing list [email protected]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
>>>> The information contained in this electronic mail message is intended
>>>> only for the use of the designated recipient(s) named above. If the
>>>> reader of this message is not the intended recipient, you are hereby
>>>> notified that you have received this message in error and that any
>>>> review, dissemination, distribution, or copying of this message is
>>>> strictly prohibited. If you have received this communication in error,
>>>> please notify the sender by telephone or e-mail (as shown above)
>>>> immediately and destroy any and all copies of this message in your
>>>> possession (whether hard copies or electronically stored copies).
>>>> _______________________________________________ ceph-users mailing
>>>> list [email protected]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>>
>>> --
>>> Christian Balzer        Network/Systems Engineer
>>> [email protected]   Global OnLine Japan/Rakuten Communications
>>> http://www.gol.com/
>>> PLEASE NOTE: The information contained in this electronic mail message
>>>is intended only for the use of the designated recipient(s) named above.
>>>If the reader of this message is not the intended recipient, you are
>>>hereby notified that you have received this message in error and that
>>>any review, dissemination, distribution, or copying of this message is
>>>strictly prohibited. If you have received this communication in error,
>>>please notify the sender by telephone or e-mail (as shown above)
>>>immediately and destroy any and all copies of this message in your
>>>possession (whether hard copies or electronically stored copies).
>
>
> This email and any files transmitted with it are confidential and intended 
> solely for the individual or entity to whom they are addressed. If you have 
> received this email in error destroy it immediately. *** Walmart Confidential 
> ***
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

Reply via email to