On 01/26/2017 06:16 AM, Florent B wrote:
On 01/24/2017 07:26 PM, Mark Nelson wrote:

My first thought is that PGs are splitting.  You only appear to have
168PGs for 9 OSDs, that's not nearly enough.  Beyond the poor data
distribution and associated performance imbalance, your PGs will split
very quickly since by default PGs start splitting at 320 Objects each.
Typically this is less of a problem with RBD since by default it uses
4MB objects (and thus there are fewer bigger objects), but with only
168 PGs you are likely to be heavily splitting by the time you hit
~218GB of data (make sure to take replication into account).

Normally PG splitting shouldn't be terribly expensive since it's
basically just reading a directory xattr, readdir on a small
directory, then a bunch of link/unlink operations.  When SELinux is
enable it appears that link/unlink might require an xattr read on each
object/file to determine if the link/unlink can happen.  That's a ton
of extra seek overhead.  On spinning disks this is especially bad with
XFS since subdirs may not be in the same AG as a parent dir, so after
subsequent splits, the directories become fragmented and those reads
happen all over the disk (not as much of a problem with SSDs though).

Anyway, that's my guess as to what's going on, but it could be
something else.  blktrace and/or xfs's kernel debugging stuff would
probably lend some supporting evidence if this is what's going on.

Mark

Hi Mark,

You're right, it seems that was the problem. I set more pg_num & pgp_num
to every pool and no more blocked requests ! (after few days of
backfillings)

Maybe monitors should set a warning in Ceph status when this situation
occurs, no ? It already exists "too few PGs per OSD" warning, but I
never hit it in my cluster.

Thank you a lot Mark !

No problem. Just a warning though: It's possible you are only delaying the problem until you hit the split/merge thresholds again. If you are only doing RBD it's possible you'll never hit them (depending on the size of your RBD volumes, block/object sizes, and replication levels). More PGs does usually mean that PG splitting will be more spread out, but it also means there are more PGs to split in total at some point in the future.

One of the things Josh was working on was to let you pre-split PGs. It means a bit slower behavior unfront, but avoids a lot of initial splitting and should improve performance with lots of objects. Also, once bluestore is production ready, it avoids all of this entirely since it doesn't store objects in directory hierarchies on a traditional filesystem like filestore does.

Mark

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to