Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

Sean Sullivan Thu, 18 Dec 2014 20:45:07 -0800

Thanks for the reply Gegory,

Sorry if this is in the wrong direction or something. Maybe I do not
understand


To test uploads I either use bash time and either python-swiftclient or
boto key.set_contents_from_filename to the radosgw. I was unaware that
radosgw had any type of throttle settings in the configuration (I can't
seem to find any either).  As for rbd mounts I test by creating a 1TB
mount and writing a file to it through time+cp or dd. Not the most
accurate test but I think should be good enough as a quick functionality
test. So for writes, it's more for functionality than performance. I
would think a basic functionality test should yield more than 8mb/s though.

As for checking admin sockets: I have actually, I set the 3rd gateways
debug_civetweb to 10 as well as debug_rgw to 5 but I still do not see
anything that stands out. The snippet of the log I pasted has these
values set. I did the same for an osd that is marked as slow (1112). All
I can see in the log for the osd are ticks and heartbeat responses
though, nothing that shows any issues. Finally I did it for the primary
monitor node to see if I would see anything there with debug_mon set to
5 (http://pastebin.com/hhnaFac1). I do not really see anything that
would stand out as a failure (like a fault or timeout error).

What kind of throttler limits do you mean? I didn't/don't see any
mention of rgw throttler limits in the ceph.com docs or admin socket
just osd/ filesystem throttle like inode/flusher limits, do you mean
these? I have not messed with these limits yet on this cluster, do you
think it would help?

On 12/18/2014 10:24 PM, Gregory Farnum wrote:
> What kind of uploads are you performing? How are you testing?
> Have you looked at the admin sockets on any daemons yet? Examining the
> OSDs to see if they're behaving differently on the different requests
> is one angle of attack. The other is look into is if the RGW daemons
> are hitting throttler limits or something that the RBD clients aren't.
> -Greg
> On Thu, Dec 18, 2014 at 7:35 PM Sean Sullivan <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Hello Yall!
>
>     I can't figure out why my gateways are performing so poorly and I
>     am not
>     sure where to start looking. My RBD mounts seem to be performing fine
>     (over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s
>     (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
>     with nuttcp shows that I can transfer from a client with 10G interface
>     to any node on the ceph cluster at the full 10G and ceph can transfer
>     close to 20G between itself. I am not really sure where to start
>     looking
>     as outside of another issue which I will mention below I am clueless.
>
>     I have a weird setup
>     [osd nodes]
>     60 x 4TB 7200 RPM SATA Drives
>     12 x  400GB s3700 SSD drives
>     3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly
>     across
>     the 3 cards)
>     512 GB of RAM
>     2 x CPU E5-2670 v2 @ 2.50GHz
>     2 x 10G interfaces  LACP bonded for cluster traffic
>     2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
>     ports)
>
>     [monitor nodes and gateway nodes]
>     4 x 300G 1500RPM SAS drives in raid 10
>     1 x SAS 2208
>     64G of RAM
>     2 x CPU E5-2630 v2
>     2 x 10G interfaces LACP bonded for public traffic (total of 2 10G
>     ports)
>
>
>     Here is a pastebin dump of my details, I am running ceph giant 0.87
>     (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel
>     3.13.0-40-generic
>     across the entire cluster.
>
>     http://pastebin.com/XQ7USGUz -- ceph health detail
>     http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
>     http://pastebin.com/BC3gzWhT -- ceph osd tree
>     http://pastebin.com/eRyY4H4c --
>     /var/log/radosgw/client.radosgw.rgw03.log
>     http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't
>     let me)
>
>
>     We ran into a few issues with density (conntrack limits, pid
>     limit, and
>     number of open files) all of which I adjusted by bumping the
>     ulimits in
>     /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
>     signs of these limits being hit so I have not included my limits or
>     sysctl conf. If you like this as well let me know and I can
>     include it.
>
>     One of the issues I am seeing is that OSDs have started to flop/ be
>     marked as slow. The cluster was HEALTH_OK with all of the disks added
>     for over 3 weeks before this behaviour started. RBD transfers seem
>     to be
>     fine for the most part which makes me think that this has little
>     baring
>     on the gateway issue but it may be related. Rebooting the OSD seems to
>     fix this issue.
>
>     I would like to figure out the root cause of both of these issues and
>     post the results back here if possible (perhaps it can help other
>     people). I am really looking for a place to start looking at as the
>     gateway just outputs that it is posting data and all of the logs
>     (outside of the monitors reporting down osds) seem to show a fully
>     functioning cluster.
>
>     Please help. I am in the #ceph room on OFTC every day as
>     'seapasulli' as
>     well.
>     _______________________________________________
>     ceph-users mailing list
>     [email protected] <mailto:[email protected]>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

Reply via email to