Hi Patrick,

 

Is there any chance that you can graph the XFS stats to see if there is an 
increase in inode/dentry cache misses as the ingest performance drops off? At 
least that might confirm the issue.

 

Only other thing I can think of would be to try running the OSD’s on top of 
something like a bcache set. As your workload is very heavily write only, 
running the cache in read only mode (writearound), might mean that inodes and 
dentries have a good chance of being cached on SSD. If you have some free 
capacity on your journals to use for bcache, it might be worth a shot. I have 
done something very similar on a single node recently to try and combat 
excessive dentry/inode lookups. 200GB cache for 12x8TB OSD’s. Performance is 
better, but I can’t say exactly how much is down to caching of the general data 
vs inodes…etc

 

Kernel 4.10+ supports bcache partitions, so it’s a lot easier to use with Ceph.

 

Nick

 

From: ceph-users [mailto:[email protected]] On Behalf Of 
Patrick Dinnen
Sent: 01 May 2017 19:07
To: [email protected]
Subject: [ceph-users] Maintaining write performance under a steady intake of 
small objects

 

Hello Ceph-users,

Florian has been helping with some issues on our proof-of-concept cluster, 
where we've been experiencing these issues. Thanks for the replies so far. I 
wanted to jump in with some extra details.

All of our testing has been with scrubbing turned off, to remove that as a 
factor.

Our use case requires a Ceph cluster to indefinitely store ~10 billion files 
20-60KB in size. We’ll begin with 4 billion files migrated from a legacy 
storage system. Ongoing writes will be handled by ~10 client machines and come 
in at a fairly steady 10-20 million files/day. Every file (excluding the legacy 
4 billion) will be read once by a single client within hours of it’s initial 
write to the cluster. Future file read requests will come from a single server 
and with a long-tail distribution, with popular files read thousands of times a 
year but most read never or virtually never.

Our “production” design has 6-nodes, 24-OSDs (expandable to 48 OSDs). SSD 
journals at a 1:4 ratio with HDDs, Each node looks like this:

*       2 x E5-2660 8-core Xeons

*       64GB RAM DDR-3 PC1600

*       10Gb ceph-internal network (SFP+) 

*       LSI 9210-8i controller (IT mode)

*       4 x OSD 8TB HDDs, mix of two types

*       Seagate ST8000DM002

*       HGST HDN728080ALE604

*       Mount options = xfs (rw,noatime,attr2,inode64,noquota) 

*       1 x SSD journal Intel 200GB DC S3700

 

Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done with a 
replication level 2. We’re using rados bench to shotgun a lot of files into our 
test pools. Specifically following these two steps: 

ceph osd pool create poolofhopes 2048 2048 replicated "" replicated_ruleset 
500000000

rados -p poolofhopes bench -t 32 -b 20000 30000000 write --no-cleanup

 

We leave the bench running for days at a time and watch the objects in cluster 
count. We see performance that starts off decent and degrades over time. 
There’s a very brief initial surge in write performance after which things 
settle into the downward trending pattern.

 

1st hour - 2 million objects/hour

20th hour - 1.9 million objects/hour 

40th hour - 1.7 million objects/hour


This performance is not encouraging for us. We need to be writing 40 million 
objects per day (20 million files, duplicated twice). The rates we’re seeing at 
the 40th hour of our bench would be suffecient to achieve that. Those write 
rates are still falling though and we’re only at a fraction of the number of 
objects in cluster that we need to handle. So, the trends in performance 
suggests we shouldn’t count on having the write performance we need for too 
long.


If we repeat the process of creating a new pool and running the bench the same 
pattern holds, good initial performance that gradually degrades.

 

 <https://postimg.org/image/ovymk7n2d/> https://postimg.org/image/ovymk7n2d/

[caption:90 million objects written to a brand new, pre-split pool 
(poolofhopes). There are already 330 million objects on the cluster in other 
pools.]

 

Our working theory is that the degradation over time may be related to inode or 
dentry lookups that miss cache and lead to additional disk reads and seek 
activity. There’s a suggestion that filestore directory splitting may 
exacerbate that problem as additional/longer disk seeks occur related to what’s 
in which XFS assignment group. We have found pre-split pools useful in one 
major way, they avoid periods of near-zero write performance that we have put 
down to the active splitting of directories (the "thundering herd" effect). The 
overall downward curve seems to remain the same whether we pre-split or not.

 

The thundering herd seems to be kept in check by an appropriate pre-split. 
Bluestore may or may not be a solution, but uncertainty and stability within 
our fairly tight timeline don't recommend it to us. Right now our big question 
is "how can we avoid the gradual degradation in write performance over time?". 

 

Thank you, Patrick

 

 

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to