[ceph-users] Maintaining write performance under a steady intake of small objects

Patrick Dinnen Mon, 01 May 2017 11:07:34 -0700

Hello Ceph-users,

Florian has been helping with some issues on our proof-of-conceptcluster, where we've been experiencing these issues. Thanks for thereplies so far. I wanted to jump in with some extra details.

All of our testing has been with scrubbing turned off, to remove that asa factor.

Our use case requires a Ceph cluster to indefinitely store ~10 billionfiles 20-60KB in size. We’ll begin with 4 billion files migrated from alegacy storage system. Ongoing writes will be handled by ~10 clientmachines and come in at a fairly steady 10-20 million files/day. Everyfile (excluding the legacy 4 billion) will be read once by a singleclient within hours of it’s initial write to the cluster. Future fileread requests will come from a single server and with a long-taildistribution, with popular files read thousands of times a year but mostread never or virtually never.

Our “production” design has 6-nodes, 24-OSDs (expandable to 48 OSDs).SSD journals at a 1:4 ratio with HDDs, Each node looks like this:


 *
   2 x E5-2660 8-core Xeons
 *
   64GB RAM DDR-3 PC1600
 *
   10Gb ceph-internal network (SFP+)
 *
   LSI 9210-8i controller (IT mode)
 *
   4 x OSD 8TB HDDs, mix of two types
     o
       Seagate ST8000DM002
     o
       HGST HDN728080ALE604
     o
       Mount options = xfs (rw,noatime,attr2,inode64,noquota)
 *
   1 x SSD journal Intel 200GB DC S3700

Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done with areplication level 2. We’re using rados bench to shotgun a lot of filesinto our test pools. Specifically following these two steps:ceph osd pool create poolofhopes 2048 2048 replicated ""replicated_ruleset 500000000

rados -p poolofhopes bench -t 32 -b 20000 30000000 write --no-cleanup

We leave the bench running for days at a time and watch the objects incluster count. We see performance that starts off decent and degradesover time. There’s a very brief initial surge in write performance afterwhich things settle into the downward trending pattern.


1st hour - 2 million objects/hour
20th hour - 1.9 million objects/hour
40th hour - 1.7 million objects/hour

This performance is not encouraging for us. We need to be writing 40million objects per day (20 million files, duplicated twice). The rateswe’re seeing at the 40th hour of our bench would be suffecient toachieve that. Those write rates are still falling though and we’re onlyat a fraction of the number of objects in cluster that we need tohandle. So, the trends in performance suggests we shouldn’t count onhaving the write performance we need for too long.

If we repeat the process of creating a new pool and running the benchthe same pattern holds, good initial performance that gradually degrades.


https://postimg.org/image/ovymk7n2d/ <https://postimg.org/image/ovymk7n2d/>

[caption:90 million objects written to a brand new, pre-split pool(poolofhopes). There are already 330 million objects on the cluster inother pools.]

Our working theory is that the degradation over time may be related toinode or dentry lookups that miss cache and lead to additional diskreads and seek activity. There’s a suggestion that filestore directorysplitting may exacerbate that problem as additional/longer disk seeksoccur related to what’s in which XFS assignment group. We have foundpre-split pools useful in one major way, they avoid periods of near-zerowrite performance that we have put down to the active splitting ofdirectories (the "thundering herd" effect). The overall downward curveseems to remain the same whether we pre-split or not.

The thundering herd seems to be kept in check by an appropriatepre-split. Bluestore may or may not be a solution, but uncertainty andstability within our fairly tight timeline don't recommend it to us.Right now our big question is "how can we avoid the gradual degradationin write performance over time?".


Thank you, Patrick

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Maintaining write performance under a steady intake of small objects

Reply via email to