Hi Patrick,

You could add more RAM to the servers witch will not increase the cost too 
much, probably.

You could change swappiness value or use something like 
https://hoytech.com/vmtouch/ to pre-cache inodes entries.

You could tarball the smaller files before loading them into Ceph maybe.

How are the ten clients accessing Ceph by the way?

> On May 1, 2017, at 14:23, Patrick Dinnen <pdin...@gmail.com> wrote:
> 
> One additional detail, we also did filestore testing using Jewel and saw 
> substantially similar results to those on Kraken.
> 
>> On Mon, May 1, 2017 at 2:07 PM, Patrick Dinnen <pdin...@gmail.com> wrote:
>> Hello Ceph-users,
>> 
>> Florian has been helping with some issues on our proof-of-concept cluster, 
>> where we've been experiencing these issues. Thanks for the replies so far. I 
>> wanted to jump in with some extra details.
>> 
>> All of our testing has been with scrubbing turned off, to remove that as a 
>> factor.
>> 
>> Our use case requires a Ceph cluster to indefinitely store ~10 billion files 
>> 20-60KB in size. We’ll begin with 4 billion files migrated from a legacy 
>> storage system. Ongoing writes will be handled by ~10 client machines and 
>> come in at a fairly steady 10-20 million files/day. Every file (excluding 
>> the legacy 4 billion) will be read once by a single client within hours of 
>> it’s initial write to the cluster. Future file read requests will come from 
>> a single server and with a long-tail distribution, with popular files read 
>> thousands of times a year but most read never or virtually never.
>> 
>> 
>> Our “production” design has 6-nodes, 24-OSDs (expandable to 48 OSDs). SSD 
>> journals at a 1:4 ratio with HDDs, Each node looks like this:
>> 2 x E5-2660 8-core Xeons
>> 64GB RAM DDR-3 PC1600
>> 10Gb ceph-internal network (SFP+) 
>> LSI 9210-8i controller (IT mode)
>> 4 x OSD 8TB HDDs, mix of two types
>> Seagate ST8000DM002
>> HGST HDN728080ALE604
>> Mount options = xfs (rw,noatime,attr2,inode64,noquota) 
>> 1 x SSD journal Intel 200GB DC S3700
>> 
>> Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done with a 
>> replication level 2. We’re using rados bench to shotgun a lot of files into 
>> our test pools. Specifically following these two steps: 
>> ceph osd pool create poolofhopes 2048 2048 replicated "" replicated_ruleset 
>> 500000000
>> rados -p poolofhopes bench -t 32 -b 20000 30000000 write --no-cleanup
>> 
>> We leave the bench running for days at a time and watch the objects in 
>> cluster count. We see performance that starts off decent and degrades over 
>> time. There’s a very brief initial surge in write performance after which 
>> things settle into the downward trending pattern.
>> 
>> 1st hour - 2 million objects/hour
>> 20th hour - 1.9 million objects/hour 
>> 40th hour - 1.7 million objects/hour
>> 
>> This performance is not encouraging for us. We need to be writing 40 million 
>> objects per day (20 million files, duplicated twice). The rates we’re seeing 
>> at the 40th hour of our bench would be suffecient to achieve that. Those 
>> write rates are still falling though and we’re only at a fraction of the 
>> number of objects in cluster that we need to handle. So, the trends in 
>> performance suggests we shouldn’t count on having the write performance we 
>> need for too long.
>> 
>> If we repeat the process of creating a new pool and running the bench the 
>> same pattern holds, good initial performance that gradually degrades.
>> 
>> https://postimg.org/image/ovymk7n2d/
>> [caption:90 million objects written to a brand new, pre-split pool 
>> (poolofhopes). There are already 330 million objects on the cluster in other 
>> pools.]
>> 
>> Our working theory is that the degradation over time may be related to inode 
>> or dentry lookups that miss cache and lead to additional disk reads and seek 
>> activity. There’s a suggestion that filestore directory splitting may 
>> exacerbate that problem as additional/longer disk seeks occur related to 
>> what’s in which XFS assignment group. We have found pre-split pools useful 
>> in one major way, they avoid periods of near-zero write performance that we 
>> have put down to the active splitting of directories (the "thundering herd" 
>> effect). The overall downward curve seems to remain the same whether we 
>> pre-split or not.
>> 
>> The thundering herd seems to be kept in check by an appropriate pre-split. 
>> Bluestore may or may not be a solution, but uncertainty and stability within 
>> our fairly tight timeline don't recommend it to us. Right now our big 
>> question is "how can we avoid the gradual degradation in write performance 
>> over time?". 
>> 
>> Thank you, Patrick
>> 
>> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to