Re: [ceph-users] OSD distribution unequally -- osd crashes

Kenneth Waegeman Fri, 25 Apr 2014 08:10:26 -0700


----- Message from Craig Lewis <[email protected]> ---------
   Date: Thu, 24 Apr 2014 11:20:08 -0700
   From: Craig Lewis <[email protected]>
Subject: Re: [ceph-users] OSD distribution unequally -- osd crashes
     To: Kenneth Waegeman <[email protected]>
     Cc: [email protected]

Your OSDs shouldn't be crashing during a remap. Although.. youmight try to get that one OSDs below 96% first, it could have aneffect. If OSDs continue to crash after that, I'd start a newthread about the crashes.
If you have the option to delete some data and reload it it later,I'd do that now. If you can't, or you can't delete enough, read upon mon_osd_full_ratio, mon_osd_nearfull_ratio,osd_failsafe_nearfull_ratio, and osd_backfill_full_ratio. You cantemporarily changes these values, but *be very careful*. Make sureyou understand the docs before doing it. Ceph is protecting you.You'll cause yourself a lot more pain if you let the disks get to100% full.
You can change those params on the fly with:
ceph tell osd.* injectargs '--mon_osd_nearfull_ratio 0.85'

Get the current values with:
ceph --admin-daemon /var/run/ceph/ceph-osd.<ID>.asok config getmon_osd_nearfull_ratio
injectargs changes won't survive restart. If you need them to lastthough a reboot or daemon restart, you'll want to add them toceph.conf too. I didn't, I just reran the injectargs whenever Imanually restarted something.
Once you get IO working again, I'd knock down those specific OSDsbefore proceeding with more pg_num and pgp_num changes.reweight-by-utilization is wrapper around:
ceph osd reweight <OSD> <WEIGHT>

You can get the current weights from
ceph osd dump | grep ^osd
Assuming all your OSDs are weighted 1.0, I'd drop the weight on allOSDs > 85% full to 0.95, and see what that looks like. You mightneed a couple passes, because some of the migrated PGs will bemigrated to other nearfull OSDs. I dropped weights by 0.025 everypass, and I ended up with one OSD weighted 0.875.
While it's remapping, some will get into state backfill_toofull.Assuming your reweights are enough, they will clear, but it doesslow things down. If you get to the point that all of the backfillPGs are backfill_toofull, then make another pass on the OSD reweights.
Once they're all below 85%, you can start doing the pg_num andpgp_num expansions. You may need to revisit the OSD weights duringthis process, and you'll probably want to reset all weights to 1.0when you're done. Since you have OSDs crashing, you'll want to notethat marking an OSD out then in will reset the weight to 1.0. Youmight want to track your weights outside of Ceph, so you canmanually reweight if that happens.
Lastly, you'll want monitor this. If you had noticed when the firstOSD hit 86%, things would be much less painful.



Thanks for the very useful information!

I started with deleting some data, and this helped for the stabilityof the OSDs. No OSD has crashed the past 6 hours (before it wasmultiple an hour).There are indeed a lot of PGs in backfill_toofull . Now the osdsstopped crashing it can actually recover and resume the backfilling!

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter<http://www.twitter.com/centraldesktop> | Facebook<http://www.facebook.com/CentralDesktop> | LinkedIn<http://www.linkedin.com/groups?gid=147417> | Blog<http://cdblog.centraldesktop.com/>
On 4/24/14 02:19 , Kenneth Waegeman wrote:
----- Message from Craig Lewis <[email protected]> ---------
  Date: Fri, 18 Apr 2014 14:59:25 -0700
  From: Craig Lewis <[email protected]>
Subject: Re: [ceph-users] OSD distribution unequally
    To: [email protected]
When you increase the number of PGs, don't just go to the maxvalue. Step into it.You'll want to end up around 2048, so do 400 -> 512, wait for itto finish, -> 1024, wait, -> 2048.
Thanks, I changed it to 512, also doing the reweight-by-utilisation.
While doing this, some osds crashed a few times. I thought it wasmaybe because of the fact they were almost full.When this was finishes, some pgs were still in active-remappedstate. I read in another mail from the list to try to do ceph osdcrush tunables optimal. So that is running now. But after a fewhours, again 17 out of 42 osds were crashed. (I don't know thecrashing is connected with the reweight and the pgs stuck inactive-remapped)
Out of log file:
2014-04-24 03:46:57.442110 7f4968a11700 -1 osd/PG.cc: In function'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed,PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na,mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,mpl_::na>, (boost::statechart::history_mode)0u>::my_context)'thread 7f4968a11700 time 2014-04-24 03:46:57.3
66010
osd/PG.cc: 5298: FAILED assert(0 == "we got a bad state machine event")

ceph version 0.79-209-g924064f (924064f83b7fb5d4f0961ee712d410ed1855cba0)
1:(PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na,mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,mpl_::na, mpl_::na, mpl_::na, mpl_::na, m
pl_::na>, (boost::statechart::history_mode)0>::my_context)+0x12f) [0x7a99ff]
2:(boost::statechart::detail::inner_constructor<boost::mpl::l_item<mpl_::long_<1l>, PG::RecoveryState::Crashed, boost::mpl::l_end>,boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial,std::allocator<void>, boost::statechart::null_exception_translator>>::construct(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,PG::RecoveryState::Initial, std::allocator<void>,boost::statechart::null_exception_translator>* const&,boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,PG::RecoveryState::Initial, std::allocator<void>,boost::statechart::null_exception_translator>&)+0x26) [0x7ed146]3: (boost::statechart::simple_state<PG::RecoveryState::Started,PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start,(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xfa)[0x7f6faa]4:(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive,PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering,(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x161)[0x802461]5:(boost::statechart::simple_state<PG::RecoveryState::RepRecovering,PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na,mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,mpl_::na>,(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x140)[0x7f5860]6:(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,PG::RecoveryState::Initial, std::allocator<void>,boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x4b)[0x7f7e6b]7:(PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>,PG::RecoveryCtx*)+0x32f) [0x7b335f]8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*>> const&, ThreadPool::TPHandle&)+0x330) [0x65e190]9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >const&, ThreadPool::TPHandle&)+0x16) [0x69bed6]
10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x551) [0x9d1e11]
11: (ThreadPool::WorkThread::entry()+0x10) [0x9d4e50]
12: (()+0x79d1) [0x7f49840d79d1]
13: (clone()+0x6d) [0x7f4982df8b6d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` isneeded to interpret this.
Many thanks!

Kenneth
Also remember that you don't need a lot of PGs if you don't havemuch data in the pools. My .rgw.buckets pool has 2k PGs, but thethe RGW metadata pools only have a couple MB and 32 PGs each.
*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter<http://www.twitter.com/centraldesktop> | Facebook<http://www.facebook.com/CentralDesktop> | LinkedIn<http://www.linkedin.com/groups?gid=147417> | Blog<http://cdblog.centraldesktop.com/>
On 4/18/14 05:04 , Tyler Brekke wrote:
That is rather low, increasing the pg count should help with thedata distribution.
Documentation recommends starting with (100 * (num of osds))/(replicas) rounded up to the nearest power of two.
https://ceph.com/docs/master/rados/operations/placement-groups/
On Fri, Apr 18, 2014 at 4:54 AM, Kenneth Waegeman<[email protected] <mailto:[email protected]>>wrote:
  ----- Message from Tyler Brekke <[email protected]
  <mailto:[email protected]>> ---------
     Date: Fri, 18 Apr 2014 04:37:26 -0700
     From: Tyler Brekke <[email protected]
  <mailto:[email protected]>>
  Subject: Re: [ceph-users] OSD distribution unequally
       To: Dan Van Der Ster <[email protected]
  <mailto:[email protected]>>
       Cc: Kenneth Waegeman <[email protected]
  <mailto:[email protected]>>, ceph-users
  <[email protected] <mailto:[email protected]>>



      How many placement groups do you have in your pool containing
      the data, and
      what is the replication level of that pool?


  400 pgs per pool, replication factor is 3



      Looks like you have too few placement groups which is causing
      the data
      distribution to be off.

      -Tyler


      On Fri, Apr 18, 2014 at 4:12 AM, Dan Van Der Ster
      <[email protected] <mailto:[email protected]>

          wrote:


           ceph osd reweight-by-utilization

          Is that still in 0.79?

          I'd start with reweight-by-utilization 200 and then adjust
          that number
          down until you get to 120 or so.

          Cheers, Dan
          On Apr 18, 2014 12:49 PM, Kenneth Waegeman
          <[email protected]>
          wrote:
            Hi,

          Some osds of our cluster filled up:
                health HEALTH_ERR 1 full osd(s); 4 near full osd(s)
                monmap e1: 3 mons at
          {ceph001=
10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0
<http://10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0>},
          election epoch 96, quorum 0,1,2
          ceph001,ceph002,ceph003
                mdsmap e93: 1/1/1 up
          {0=ceph001.cubone.os=up:active}, 1 up:standby
                osdmap e1974: 42 osds: 42 up, 42 in
                       flags full
                 pgmap v286626: 1200 pgs, 3 pools, 31096 GB data,
          26259 kobjects
                       94270 GB used, 40874 GB / 131 TB avail
                              1 active+clean+scrubbing+deep
                           1199 active+clean

          I knew it is never really uniform, but the differences of
          the OSDs
          seems very big, one OSD has 96% while another only has 48%
          usage,
          which is about 1,8 TB difference:
          /dev/sdc        3.7T  1.9T  1.8T  51% /var/lib/ceph/osd/sdc
          /dev/sdd        3.7T  2.5T  1.2T  68% /var/lib/ceph/osd/sdd
          /dev/sde        3.7T  2.3T  1.5T  61% /var/lib/ceph/osd/sde
          /dev/sdf        3.7T  2.7T  975G  74% /var/lib/ceph/osd/sdf
          /dev/sdg        3.7T  3.2T  491G  87% /var/lib/ceph/osd/sdg
          /dev/sdh        3.7T  2.0T  1.8T  53% /var/lib/ceph/osd/sdh
          /dev/sdi        3.7T  2.3T  1.4T  63% /var/lib/ceph/osd/sdi
          /dev/sdj        3.7T  3.4T  303G  92% /var/lib/ceph/osd/sdj
          /dev/sdk        3.7T  2.8T  915G  76% /var/lib/ceph/osd/sdk
          /dev/sdl        3.7T  1.8T  2.0T  48% /var/lib/ceph/osd/sdl
          /dev/sdm        3.7T  2.8T  917G  76% /var/lib/ceph/osd/sdm
          /dev/sdn        3.7T  3.5T  186G  96% /var/lib/ceph/osd/sdn

          We are running 0.79 (well precisely a patched version of
          it with an
          MDS fix of another thread:-) )
          I remember hearing something about the hashpgpool having
          an effect on
          it, but I read this was already default enabled on the latest
          versions. osd_pool_default_flag_hashpspool has indeed the
          value true,
          but I don't know how to check this for a specific pool.

          Is this behaviour normal? Or what can be wrong?

          Thanks!

          Kind regards,
          Kenneth Waegeman

          _______________________________________________
          ceph-users mailing list
          [email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

          _______________________________________________
          ceph-users mailing list
          [email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




  ----- End message from Tyler Brekke <[email protected]
  <mailto:[email protected]>> -----

  --     Met vriendelijke groeten,
  Kenneth Waegeman





_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
----- End message from Craig Lewis <[email protected]> -----



----- End message from Craig Lewis <[email protected]> -----

--

Met vriendelijke groeten,
Kenneth Waegeman

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD distribution unequally -- osd crashes

Reply via email to