Hi Cephers, At University of Zurich we are using Ceph as a storage back-end for our OpenStack installation. Since we recently reached 70% of occupancy (mostly caused by the cinder pool served by 16384PGs) we are in the phase of extending the cluster with additional storage nodes of the same type (except for a slight more powerful CPU).
We decided to opt for a gradual OSD deployment: we created a temporary "root" bucket called "fresh-install" containing the newly installed nodes and then we moved OSDs from this bucket to the current production root via: ceph osd crush set osd.{id} {weight} host={hostname} root={production_root} Everything seemed nicely planned but when we started adding a few new OSDs to the cluster, and thus triggering a rebalancing, one of the OSDs, already at 84% disk use, passed the 85% threshold. This in turn triggered the "near full osd(s)" warning and more than 20PGs previously in "wait_backfill" state were marked as: "wait_backfill+backfill_toofull". Since the OSD kept growing until, reached 90% disk use, we decided to reduce its relative weight from 1 to 0.95. The last action recalculated the crushmap and remapped a few PGs but did not appear to move any data off the almost full OSD. Only when, by steps of 0.05, we reached 0.50 of relative weight data was moved and some "backfill_toofull" requests were released. However, he had do go down almost to 0.10% of relative weight in order to trigger some additional data movement and have the backfilling process finally finished. We are now adding new OSDs but the problem is constantly triggered since we have multiple OSDs > 83% that starts growing during the rebalance. My questions are: - Is there something wrong in our process of adding new OSDs (some additional details below)? - We also noticed that the problem has the tendency to cluster around the newly added OSDs, so could those two things be correlated? - Why reweighting does not trigger instant data moving? What's the logic behind remapped PGs? Is there some sort of flat queue of tasks or does it have some priorities defined? - Did somebody experience this situation and eventually how was it solved/bypassed? Cluster details are as follows: - version: 0.94.9 - 5 monitors, - 40 storage hosts with an overall of 24 X 4TB disks: 1 OSD/disk (960 OSDs in total), - osd pool default size = 3, - journaling is on SSDs. We have "hosts" failure domain. Relevant crushmap details: # rules rule sas { ruleset 1 type replicated min_size 1 max_size 10 step take sas step chooseleaf firstn 0 type host step emit } root sas { id -41 # do not change unnecessarily # weight 3283.279 alg straw hash 0 # rjenkins1 item osd-l2-16 weight 87.360 item osd-l4-06 weight 87.360 ... item osd-k7-41 weight 14.560 item osd-l4-36 weight 14.560 item osd-k5-36 weight 14.560 } host osd-k7-21 { id -46 # do not change unnecessarily # weight 87.360 alg straw hash 0 # rjenkins1 item osd.281 weight 3.640 item osd.282 weight 3.640 item osd.285 weight 3.640 ... } host osd-k7-41 { id -50 # do not change unnecessarily # weight 14.560 alg straw hash 0 # rjenkins1 item osd.900 weight 3.640 item osd.901 weight 3.640 item osd.902 weight 3.640 item osd.903 weight 3.640 } As mentioned before we created a temporary bucket called "fresh-install" containing the newly installed nodes (i.e.): root fresh-install { id -34 # do not change unnecessarily # weight 218.400 alg straw hash 0 # rjenkins1 item osd-k5-36-fresh weight 72.800 item osd-k7-41-fresh weight 72.800 item osd-l4-36-fresh weight 72.800 } Then, by steps of 6 OSDs (2 OSDs from each new host), we move OSDs from the "fresh-install" to the "sas" bucket. Thank you in advance for all the suggestions. Cheers, Tyanko
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com