Hi,

Do you get slow requests during the slowness incidents? What about monitor 
elections?
Are your MDSs using a lot of CPU? did you try tuning anything in the MDS (I 
think the default config is still conservative, and there are options to cache 
more entries, etc…)
What about iostat on the OSDs — are your OSD disks busy reading or writing 
during these incidents?
What are you using for OSD journals?
Also check the CPU usage for the mons and osds...

Does your hardware provide enough IOPS for what your users need? (e.g. what is 
the op/s from ceph -w)

If disabling deep scrub helps, then it might be that something else is reading 
the disks heavily. One thing to check is updatedb — we had to disable it from 
indexing /var/lib/ceph on our OSDs.

Best Regards,
Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --


On 20 Aug 2014, at 16:39, Hugo Mills <h.r.mi...@reading.ac.uk> wrote:

>   We have a ceph system here, and we're seeing performance regularly
> descend into unusability for periods of minutes at a time (or longer).
> This appears to be triggered by writing large numbers of small files.
> 
>   Specifications:
> 
> ceph 0.80.5
> 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
> 2 machines running primary and standby MDS
> 3 monitors on the same machines as the OSDs
> Infiniband to about 8 CephFS clients (headless, in the machine room)
> Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
>   machines, in the analysis lab)
> 
>   The cluster stores home directories of the users and a larger area
> of scientific data (approx 15 TB) which is being processed and
> analysed by the users of the cluster.
> 
>   We have a relatively small number of concurrent users (typically
> 4-6 at most), who use GUI tools to examine their data, and then
> complex sets of MATLAB scripts to process it, with processing often
> being distributed across all the machines using Condor.
> 
>   It's not unusual to see the analysis scripts write out large
> numbers (thousands, possibly tens or hundreds of thousands) of small
> files, often from many client machines at once in parallel. When this
> happens, the ceph cluster becomes almost completely unresponsive for
> tens of seconds (or even for minutes) at a time, until the writes are
> flushed through the system. Given the nature of modern GUI desktop
> environments (often reading and writing small state files in the
> user's home directory), this means that desktop interactiveness and
> responsiveness for all the other users of the cluster suffer.
> 
>   1-minute load on the servers typically peaks at about 8 during
> these events (on 4-core machines). Load on the clients also peaks
> high, because of the number of processes waiting for a response from
> the FS. The MDS shows little sign of stress -- it seems to be entirely
> down to the OSDs. ceph -w shows requests blocked for more than 10
> seconds, and in bad cases, ceph -s shows up to many hundreds of
> requests blocked for more than 32s.
> 
>   We've had to turn off scrubbing and deep scrubbing completely --
> except between 01.00 and 04.00 every night -- because it triggers the
> exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
> up to 7 PGs being scrubbed, as it did on Monday, it's completely
> unusable.
> 
>   Is this problem something that's often seen? If so, what are the
> best options for mitigation or elimination of the problem? I've found
> a few references to issue #6278 [1], but that seems to be referencing
> scrub specifically, not ordinary (if possibly pathological) writes.
> 
>   What are the sorts of things I should be looking at to work out
> where the bottleneck(s) are? I'm a bit lost about how to drill down
> into the ceph system for identifying performance issues. Is there a
> useful guide to tools somewhere?
> 
>   Is an upgrade to 0.84 likely to be helpful? How "development" are
> the development releases, from a stability / dangerous bugs point of
> view?
> 
>   Thanks,
>   Hugo.
> 
> [1] http://tracker.ceph.com/issues/6278
> 
> -- 
> Hugo Mills :: IT Services, University of Reading
> Specialist Engineer, Research Servers
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to