Hello,

On Wed, 20 Aug 2014 15:39:11 +0100 Hugo Mills wrote:

>    We have a ceph system here, and we're seeing performance regularly
> descend into unusability for periods of minutes at a time (or longer).
> This appears to be triggered by writing large numbers of small files.
> 
>    Specifications:
> 
> ceph 0.80.5
> 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2
> threads) 
> 2 machines running primary and standby MDS
> 3 monitors on the same machines as the OSDs
> Infiniband to about 8 CephFS clients (headless, in the machine room)
> Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
>    machines, in the analysis lab)
> 
Please let us know the CPU and memory specs of the OSD nodes as well.
And the replication factor, I presume 3 if you value that data.
Also the PG and PGP values for the pool(s) you're using.

>    The cluster stores home directories of the users and a larger area
> of scientific data (approx 15 TB) which is being processed and
> analysed by the users of the cluster.
> 
>    We have a relatively small number of concurrent users (typically
> 4-6 at most), who use GUI tools to examine their data, and then
> complex sets of MATLAB scripts to process it, with processing often
> being distributed across all the machines using Condor.
> 
>    It's not unusual to see the analysis scripts write out large
> numbers (thousands, possibly tens or hundreds of thousands) of small
> files, often from many client machines at once in parallel. When this
> happens, the ceph cluster becomes almost completely unresponsive for
> tens of seconds (or even for minutes) at a time, until the writes are
> flushed through the system. Given the nature of modern GUI desktop
> environments (often reading and writing small state files in the
> user's home directory), this means that desktop interactiveness and
> responsiveness for all the other users of the cluster suffer.
> 
>    1-minute load on the servers typically peaks at about 8 during
> these events (on 4-core machines). Load on the clients also peaks
> high, because of the number of processes waiting for a response from
> the FS. The MDS shows little sign of stress -- it seems to be entirely
> down to the OSDs. ceph -w shows requests blocked for more than 10
> seconds, and in bad cases, ceph -s shows up to many hundreds of
> requests blocked for more than 32s.
> 
>    We've had to turn off scrubbing and deep scrubbing completely --
> except between 01.00 and 04.00 every night -- because it triggers the
> exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
> up to 7 PGs being scrubbed, as it did on Monday, it's completely
> unusable.
> 
Note that I know nothing about CephFS and while there are probably
tunables the slow requests you're seeing and the hardware up there
definitely suggests slow OSDs.

Now with a replication factor of 3, your total cluster performance
(sustained) is that of just 6 disks and 4TB ones are never any speed
wonders. Minus the latency overheads from the network, which should be
minimal in your case though.

Your old NFS (cluster?) had twice the spindles you wrote, so if that means
36 disks it was quite a bit faster.

A cluster I'm just building with 3 nodes, 4 journal SSDs and 8 OSD HDDs
per node can do about 7000 write IOPS (4KB), so I would expect yours to be
worse off.

Having the journals on dedicated partitions instead of files on the rootfs
would not only be faster (though probably not significantly so), but also
prevent any potential failures based on FS corruption.

The SSD journals will compensate for some spikes of high IOPS, but 250000
files is clearly beyond that.

Putting lots of RAM (relatively cheap these days) into the OSD nodes has
the big benefit that reads of hot objects will not have to go to disk and
thus compete with write IOPS.

>    Is this problem something that's often seen? If so, what are the
> best options for mitigation or elimination of the problem? I've found
> a few references to issue #6278 [1], but that seems to be referencing
> scrub specifically, not ordinary (if possibly pathological) writes.
> 
You need to match your cluster to your workload.
Aside from tuning things (which tends to have limited effects), you can
either scale out by adding more servers or scale up by using faster
storage and/or a cache pool.

>    What are the sorts of things I should be looking at to work out
> where the bottleneck(s) are? I'm a bit lost about how to drill down
> into the ceph system for identifying performance issues. Is there a
> useful guide to tools somewhere?
> 
Reading/scouring this ML can be quite helpful. 

Watch your OSD nodes (all of them!) with iostat or preferably atop (which
will also show you how your CPUs and network is doing) while running the
below stuff. 

To get a baseline do:
"rados -p <pool-in-question> bench 60 write -t 64"
This will test your throughput most of all and due to the 4MB block size
spread the load very equally amongst the OSDs.
During that test you should see all OSDs more or less equally busy. 
If one (or more) are busy all the time while others are bored and the
throughput (cur MB/s column in the rados bench output) drops to zero (or
close to it) during that time, you have found a slow OSD, probably a wonky
disk.

If the above pans out (no suspiciously slow OSD), you can proceed with
something that is likely to trigger your cluster into having a conniption.
Aside from your tarball, something like this should nicely:
"rados -p <pool-in-question> bench 60 write -t 64 -b 4096"
Or on a client:
"fio --size=800M --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=randwrite --name=fiojob --blocksize=4k --iodepth=64"

Lots of small writes will stress your CPUs as well.
Due to the small block size, a more uneven load of the OSDs is to be
expected. Still, watch out for ones that have vastly higher avio (atop) or
svctm (iostat) values than others.

On an otherwise idle cluster, you shouldn't get any slow requests from
this, but what I suspect is happening that those writes are competing with
reads and that contention slows the spinning rust down even more. 
See above about lots of memory for pagecache on the OSD nodes.

>    Is an upgrade to 0.84 likely to be helpful? How "development" are
> the development releases, from a stability / dangerous bugs point of
> view?
>
No. 
Akin to sticking your appendage(s) into a blender. 
Ceph itself is a 0.x level software, using development releases outside a
test/tinkering cluster is very much not advised.
That's why distribution releases only have the stable versions.

Christian
 
>    Thanks,
>    Hugo.
> 
> [1] http://tracker.ceph.com/issues/6278
> 


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to