> In particular, when using leveldb, stalls while reading or writing to 
> the store - typically, leveldb is compacting when this happens. This 
> leads to all sorts of timeouts to be triggered, but the really annoying 
> one would be the lease timeout, which tends to result in flapping quorum.
> 
> Also, being unable to sync monitors. Again, stalls on leveldb lead to 
> timeouts being triggered and the sync to restart.
> 
> Once upon a time, this *may* have also translated into large memory 
> consumption. A direct relation was never proved though, and behaviour 
> went away as ceph became smarter, and libs were updated by distros.

My team suffered no small amount of pain due to persistent DB inflation, not 
just during topology churn.  RHCS 1.3.2 addressed that for us.  Before we 
applied that update I saw mon DB’s grow as large as 54GB.

When measuring the size of /var/lib/ceph/mon/store.db, be careful to not 
blindly include *.log or *LOG* files that you may find there.  I set 
leveldb_log = /dev/null to supress writing those, which were confusing our 
metrics.   I also set mon_compact_on_start = true to compact each mon’s leveldb 
at startup.  This was found anecdotally to be more effective than using ceph 
tell to compact during operation, as there was less contention.  It does mean 
however that one needs to be careful, when the set of DB’s across mons is 
inflated, to not restart them all in a short amount of time.  It seems that 
even after the mon log reports that compaction is complete, there is (as of 
Hammer) trimming still running silently in the background that impacts 
performance until complete.  This means that one will see additional shrinkage 
of the store.db directory over time.

In my clusters of 450+ OSD’s, 4GB is the arbitrary point above which I get 
worried.  Mind you most of our mon DB’s are still on (wince) LFF rotational 
drives, which doesn’t help.  Strongly advise faster storage for the DB’s.  I 
found that the larger the DB’s grow, the slower all mon operations become — 
which includes peering and especially backfill/recovery.  With a DB that large 
you may find that OSD loss / removal via attrition can cause significant client 
impact.

Inflating during recovery/backfill does still happen; it sounds as though the 
OP doubled the size of his/her cluster in one swoop, which is fairly drastic.  
Early on with Dumpling I trebled the size of a cluster in one operation, and 
still ache from the fallout.  A phased deployment will spread out the impact 
and allow the DB’s to preen in between phases.  One approach is to add only 1 
or a few drives per OSD server at a time, but in parallel.  So if you were 
adding 10 servers of 12 OSD’s each, say 6-12 steps of 10x1 or 10x2 OSD’s.  That 
way the write workload is spread across 10 servers instead of funneling into 
just one, avoiding HBA saturation and the blocked requests that can result from 
it.  Adding the OSD’s with 0 weight and using ceph osd crush reweight to bring 
them up in phases can also ease the process.  Early on we would allow each 
reweight to fully recover before the next step, but I’ve since found that 
peering is the biggest share of the impact, and that upweighting can proceed 
just as safely after peering clears up from the previous adjustment.  This 
avoids moving some fraction of data more than once.  With Jewel 
backfill/recovery is improved to not shuffle data that doesnt’ really need to 
move, but with Hammer this decidely helps avoid a bolus of slow requests as 
each new OSD comes up and peers.

— Anthony

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to