Hi Patrick, You could add more RAM to the servers witch will not increase the cost too much, probably.
You could change swappiness value or use something like https://hoytech.com/vmtouch/ to pre-cache inodes entries. You could tarball the smaller files before loading them into Ceph maybe. How are the ten clients accessing Ceph by the way? > On May 1, 2017, at 14:23, Patrick Dinnen <pdin...@gmail.com> wrote: > > One additional detail, we also did filestore testing using Jewel and saw > substantially similar results to those on Kraken. > >> On Mon, May 1, 2017 at 2:07 PM, Patrick Dinnen <pdin...@gmail.com> wrote: >> Hello Ceph-users, >> >> Florian has been helping with some issues on our proof-of-concept cluster, >> where we've been experiencing these issues. Thanks for the replies so far. I >> wanted to jump in with some extra details. >> >> All of our testing has been with scrubbing turned off, to remove that as a >> factor. >> >> Our use case requires a Ceph cluster to indefinitely store ~10 billion files >> 20-60KB in size. We’ll begin with 4 billion files migrated from a legacy >> storage system. Ongoing writes will be handled by ~10 client machines and >> come in at a fairly steady 10-20 million files/day. Every file (excluding >> the legacy 4 billion) will be read once by a single client within hours of >> it’s initial write to the cluster. Future file read requests will come from >> a single server and with a long-tail distribution, with popular files read >> thousands of times a year but most read never or virtually never. >> >> >> Our “production” design has 6-nodes, 24-OSDs (expandable to 48 OSDs). SSD >> journals at a 1:4 ratio with HDDs, Each node looks like this: >> 2 x E5-2660 8-core Xeons >> 64GB RAM DDR-3 PC1600 >> 10Gb ceph-internal network (SFP+) >> LSI 9210-8i controller (IT mode) >> 4 x OSD 8TB HDDs, mix of two types >> Seagate ST8000DM002 >> HGST HDN728080ALE604 >> Mount options = xfs (rw,noatime,attr2,inode64,noquota) >> 1 x SSD journal Intel 200GB DC S3700 >> >> Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done with a >> replication level 2. We’re using rados bench to shotgun a lot of files into >> our test pools. Specifically following these two steps: >> ceph osd pool create poolofhopes 2048 2048 replicated "" replicated_ruleset >> 500000000 >> rados -p poolofhopes bench -t 32 -b 20000 30000000 write --no-cleanup >> >> We leave the bench running for days at a time and watch the objects in >> cluster count. We see performance that starts off decent and degrades over >> time. There’s a very brief initial surge in write performance after which >> things settle into the downward trending pattern. >> >> 1st hour - 2 million objects/hour >> 20th hour - 1.9 million objects/hour >> 40th hour - 1.7 million objects/hour >> >> This performance is not encouraging for us. We need to be writing 40 million >> objects per day (20 million files, duplicated twice). The rates we’re seeing >> at the 40th hour of our bench would be suffecient to achieve that. Those >> write rates are still falling though and we’re only at a fraction of the >> number of objects in cluster that we need to handle. So, the trends in >> performance suggests we shouldn’t count on having the write performance we >> need for too long. >> >> If we repeat the process of creating a new pool and running the bench the >> same pattern holds, good initial performance that gradually degrades. >> >> https://postimg.org/image/ovymk7n2d/ >> [caption:90 million objects written to a brand new, pre-split pool >> (poolofhopes). There are already 330 million objects on the cluster in other >> pools.] >> >> Our working theory is that the degradation over time may be related to inode >> or dentry lookups that miss cache and lead to additional disk reads and seek >> activity. There’s a suggestion that filestore directory splitting may >> exacerbate that problem as additional/longer disk seeks occur related to >> what’s in which XFS assignment group. We have found pre-split pools useful >> in one major way, they avoid periods of near-zero write performance that we >> have put down to the active splitting of directories (the "thundering herd" >> effect). The overall downward curve seems to remain the same whether we >> pre-split or not. >> >> The thundering herd seems to be kept in check by an appropriate pre-split. >> Bluestore may or may not be a solution, but uncertainty and stability within >> our fairly tight timeline don't recommend it to us. Right now our big >> question is "how can we avoid the gradual degradation in write performance >> over time?". >> >> Thank you, Patrick >> >> > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com