Re: ceph-mon always election when change crushmap in firefly

2015-09-24 Thread Sage Weil
On Thu, 24 Sep 2015, Alexander Yang wrote:
> I use 'ceph osd crush dump | tail -n 20' get :
> 
>   "type": 1,
>   "min_size": 1,
>   "max_size": 10,
>   "steps": [
> { "op": "take",
>   "item": -62,
>   "item_name": "BJ-SSD"},
> { "op": "chooseleaf_firstn",
>   "num": 0,
>   "type": "rack"},
> { "op": "emit"}]}],
>   "tunables": { "choose_local_tries": 2,
>   "choose_local_fallback_tries": 5,
>   "choose_total_tries": 19,
>   "chooseleaf_descend_once": 0,
>   "profile": "argonaut",
>   "optimal_tunables": 0,
>   "legacy_tunables": 1,
>   "require_feature_tunables": 0,
>   "require_feature_tunables2": 0}}
> 
>does it provide some clue?
> 
>In my test environment, I can't reappear this problem, It just appear in
> production environment.So i need make sure my cluster as steady as
> possible,  as less data migration as possible. do you have some advice?

It looks like this is an old cluster that has been upgraded--it's still 
using the argonaut (original!) crush tunables.  I suggest moving to

 ceph osd crush tunables firefly

(assuming all your clients are firefly or newer).  If not, then the 
bobtail tunables are a good first step.  This should eliminate the behaior 
you see... but will trigger a fair bit of rebalancing to do the 
transition.  You can use crushtool to test how bad it will be with 
something like

 ceph osd getcrushmap -i cm
 crushtool -i cm --set-choose-local-tries 0 
--set-choose-local-fallback-tries 0 --set-choose-total-tries 50 
--set-chooseleaf-descend-once 1 --set-chooseleaf-vary-r 1 -o cm.new
 crushtool -i cm --test --num-rep 3 --show-mapping > /tmp/before
 curshtool -i cm.new --test --num-rep 3 --show-mapping > /tmp/after
 wc -l /tmp/before
 diff -u /tmp/before /tmp/after | grep ^+ | wc -l

sage


 > 
>In addition? I find  the cluster reply slower than before when I use
> 'ceph -s '.  Before 'ceph -w'  print the status  per second,  but now
> sometimes it print one status I will wait for 3-4 seconds, Is this contact
> with that problem?
> 
>thanks for your attention!
> 
> 2015-09-23 20:34 GMT+08:00 Sage Weil :
> 
> > On Wed, 23 Sep 2015, Alexander Yang wrote:
> > > hello,
> > > We use Ceph+Openstack in our private cloud. In our cluster, we
> > have
> > > 5 mons and 800 osds, the Capacity is about 1Pb. And run about 700 vms and
> > > 1100 volumes,
> > > recently, we increase our pg_num , now the cluster have about
> > 7
> > > pgs. In my real intention? I want every osd have 100pgs. but after
> > increase
> > > pg_num, I find I'm wrong. Because the different crush weight for
> > different
> > > osd, the osd's pg_num is different, some osd have exceed  500pgs.
> > > Now, the problem is  appear?cause some reason when i want to
> > change
> > > some osd  weight, that means change the crushmap.  This change cause
> > about
> > > 0.03% data to migrate. the mon is always begin to election. It's will
> > hung
> > > the cluster, and when they end, the  original  leader still is the new
> > > leader. And during the mon eclection?On the upper layer, vm have too many
> > > slow request will appear. so now i dare to do any operation about change
> > > crushmap. But i worry about an important thing, If  when our cluster
> > down
> > >  one host even down one rack.   By the time, the cluster curshmap will
> > > change large, and the migrate data also large. I worry the cluster will
> > > hung  long time. and result on upper layer, all vm became to  shutdown.
> > > In my opinion, I guess when I change the crushmap,* the leader
> > mon
> > > maybe calculate the too many information*, or* too many client want to
> > get
> > > the new crushmap from leader mon*.  It must be hung the mon thread, so
> > the
> > > leader mon can't heatbeat to other mons, the other mons think the leader
> > is
> > > down then begin the new election.  I am sorry if i guess is wrong.
> > > The crushmap in accessory. So who can give me some advice or
> > guide,
> > > Thanks very much!
> >
> > There were huge improvements made in hammer in terms of mon efficiency in
> > these cases where it is under load.  I recommend upgrading as that will
> > help.
> >
> > You can also mitigate the problem somewhat by adjusting the mon_lease and
> > associated settings up.  Scale all of mon_lease, mon_lease_renew_interval,
> > mon_lease_ack_timeout, mon_accept_timeout by 2x or 3x.
> >
> > It also sounds like you may be using some older tunables/settings
> > for your pools or crush rules.  Can you attach the output of 'ceph osd
> > dump' and 'ceph osd crush dump | tail -n 20' ?
> >
> > sage
> >
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: ceph-mon always election when change crushmap in firefly

2015-09-23 Thread Sage Weil
On Wed, 23 Sep 2015, Alexander Yang wrote:
> hello,
> We use Ceph+Openstack in our private cloud. In our cluster, we have
> 5 mons and 800 osds, the Capacity is about 1Pb. And run about 700 vms and
> 1100 volumes,
> recently, we increase our pg_num , now the cluster have about 7
> pgs. In my real intention? I want every osd have 100pgs. but after increase
> pg_num, I find I'm wrong. Because the different crush weight for different
> osd, the osd's pg_num is different, some osd have exceed  500pgs.
> Now, the problem is  appear?cause some reason when i want to change
> some osd  weight, that means change the crushmap.  This change cause about
> 0.03% data to migrate. the mon is always begin to election. It's will hung
> the cluster, and when they end, the  original  leader still is the new
> leader. And during the mon eclection?On the upper layer, vm have too many
> slow request will appear. so now i dare to do any operation about change
> crushmap. But i worry about an important thing, If  when our cluster  down
>  one host even down one rack.   By the time, the cluster curshmap will
> change large, and the migrate data also large. I worry the cluster will
> hung  long time. and result on upper layer, all vm became to  shutdown.
> In my opinion, I guess when I change the crushmap,* the leader mon
> maybe calculate the too many information*, or* too many client want to get
> the new crushmap from leader mon*.  It must be hung the mon thread, so the
> leader mon can't heatbeat to other mons, the other mons think the leader is
> down then begin the new election.  I am sorry if i guess is wrong.
> The crushmap in accessory. So who can give me some advice or guide,
> Thanks very much!

There were huge improvements made in hammer in terms of mon efficiency in 
these cases where it is under load.  I recommend upgrading as that will 
help.

You can also mitigate the problem somewhat by adjusting the mon_lease and 
associated settings up.  Scale all of mon_lease, mon_lease_renew_interval, 
mon_lease_ack_timeout, mon_accept_timeout by 2x or 3x.

It also sounds like you may be using some older tunables/settings 
for your pools or crush rules.  Can you attach the output of 'ceph osd 
dump' and 'ceph osd crush dump | tail -n 20' ?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html