On Wed, 23 Sep 2015, Alexander Yang wrote: > hello, > We use Ceph+Openstack in our private cloud. In our cluster, we have > 5 mons and 800 osds, the Capacity is about 1Pb. And run about 700 vms and > 1100 volumes, > recently, we increase our pg_num , now the cluster have about 70000 > pgs. In my real intention? I want every osd have 100pgs. but after increase > pg_num, I find I'm wrong. Because the different crush weight for different > osd, the osd's pg_num is different, some osd have exceed 500pgs. > Now, the problem is appear?cause some reason when i want to change > some osd weight, that means change the crushmap. This change cause about > 0.03% data to migrate. the mon is always begin to election. It's will hung > the cluster, and when they end, the original leader still is the new > leader. And during the mon eclection?On the upper layer, vm have too many > slow request will appear. so now i dare to do any operation about change > crushmap. But i worry about an important thing, If when our cluster down > one host even down one rack. By the time, the cluster curshmap will > change large, and the migrate data also large. I worry the cluster will > hung long time. and result on upper layer, all vm became to shutdown. > In my opinion, I guess when I change the crushmap,* the leader mon > maybe calculate the too many information*, or* too many client want to get > the new crushmap from leader mon*. It must be hung the mon thread, so the > leader mon can't heatbeat to other mons, the other mons think the leader is > down then begin the new election. I am sorry if i guess is wrong. > The crushmap in accessory. So who can give me some advice or guide, > Thanks very much!
There were huge improvements made in hammer in terms of mon efficiency in these cases where it is under load. I recommend upgrading as that will help. You can also mitigate the problem somewhat by adjusting the mon_lease and associated settings up. Scale all of mon_lease, mon_lease_renew_interval, mon_lease_ack_timeout, mon_accept_timeout by 2x or 3x. It also sounds like you may be using some older tunables/settings for your pools or crush rules. Can you attach the output of 'ceph osd dump' and 'ceph osd crush dump | tail -n 20' ? sage _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com