----- Message from Craig Lewis <[email protected]> ---------
   Date: Thu, 24 Apr 2014 11:20:08 -0700
   From: Craig Lewis <[email protected]>
Subject: Re: [ceph-users] OSD distribution unequally -- osd crashes
     To: Kenneth Waegeman <[email protected]>
     Cc: [email protected]


Your OSDs shouldn't be crashing during a remap. Although.. you might try to get that one OSDs below 96% first, it could have an effect. If OSDs continue to crash after that, I'd start a new thread about the crashes.


If you have the option to delete some data and reload it it later, I'd do that now. If you can't, or you can't delete enough, read up on mon_osd_full_ratio, mon_osd_nearfull_ratio, osd_failsafe_nearfull_ratio, and osd_backfill_full_ratio. You can temporarily changes these values, but *be very careful*. Make sure you understand the docs before doing it. Ceph is protecting you. You'll cause yourself a lot more pain if you let the disks get to 100% full.

You can change those params on the fly with:
ceph tell osd.* injectargs '--mon_osd_nearfull_ratio 0.85'

Get the current values with:
ceph --admin-daemon /var/run/ceph/ceph-osd.<ID>.asok config get mon_osd_nearfull_ratio

injectargs changes won't survive restart. If you need them to last though a reboot or daemon restart, you'll want to add them to ceph.conf too. I didn't, I just reran the injectargs whenever I manually restarted something.


Once you get IO working again, I'd knock down those specific OSDs before proceeding with more pg_num and pgp_num changes. reweight-by-utilization is wrapper around:
ceph osd reweight <OSD> <WEIGHT>

You can get the current weights from
ceph osd dump | grep ^osd

Assuming all your OSDs are weighted 1.0, I'd drop the weight on all OSDs > 85% full to 0.95, and see what that looks like. You might need a couple passes, because some of the migrated PGs will be migrated to other nearfull OSDs. I dropped weights by 0.025 every pass, and I ended up with one OSD weighted 0.875.

While it's remapping, some will get into state backfill_toofull. Assuming your reweights are enough, they will clear, but it does slow things down. If you get to the point that all of the backfill PGs are backfill_toofull, then make another pass on the OSD reweights.


Once they're all below 85%, you can start doing the pg_num and pgp_num expansions. You may need to revisit the OSD weights during this process, and you'll probably want to reset all weights to 1.0 when you're done. Since you have OSDs crashing, you'll want to note that marking an OSD out then in will reset the weight to 1.0. You might want to track your weights outside of Ceph, so you can manually reweight if that happens.


Lastly, you'll want monitor this. If you had noticed when the first OSD hit 86%, things would be much less painful.


Thanks for the very useful information!

I started with deleting some data, and this helped for the stability of the OSDs. No OSD has crashed the past 6 hours (before it was multiple an hour). There are indeed a lot of PGs in backfill_toofull . Now the osds stopped crashing it can actually recover and resume the backfilling!




*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter <http://www.twitter.com/centraldesktop> | Facebook <http://www.facebook.com/CentralDesktop> | LinkedIn <http://www.linkedin.com/groups?gid=147417> | Blog <http://cdblog.centraldesktop.com/>

On 4/24/14 02:19 , Kenneth Waegeman wrote:

----- Message from Craig Lewis <[email protected]> ---------
  Date: Fri, 18 Apr 2014 14:59:25 -0700
  From: Craig Lewis <[email protected]>
Subject: Re: [ceph-users] OSD distribution unequally
    To: [email protected]


When you increase the number of PGs, don't just go to the max value. Step into it. You'll want to end up around 2048, so do 400 -> 512, wait for it to finish, -> 1024, wait, -> 2048.

Thanks, I changed it to 512, also doing the reweight-by-utilisation.
While doing this, some osds crashed a few times. I thought it was maybe because of the fact they were almost full. When this was finishes, some pgs were still in active-remapped state. I read in another mail from the list to try to do ceph osd crush tunables optimal. So that is running now. But after a few hours, again 17 out of 42 osds were crashed. (I don't know the crashing is connected with the reweight and the pgs stuck in active-remapped)
Out of log file:

2014-04-24 03:46:57.442110 7f4968a11700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryS tate::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0u>::my_context)' thread 7f4968a11700 time 2014-04-24 03:46:57.3
66010
osd/PG.cc: 5298: FAILED assert(0 == "we got a bad state machine event")

ceph version 0.79-209-g924064f (924064f83b7fb5d4f0961ee712d410ed1855cba0)
1: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mp l_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, m
pl_::na>, (boost::statechart::history_mode)0>::my_context)+0x12f) [0x7a99ff]
2: (boost::statechart::detail::inner_constructor<boost::mpl::l_item<mpl_::long_<1l>, PG::RecoveryState::Crashed, boost::mpl::l_end>, boost::statechart::state_machine<PG ::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator> >::construct(boost::statechart::state_m achine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>* const&, boost::statechart::st ate_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>&)+0x26) [0x7ed146] 3: (boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xfa) [0x7f6faa] 4: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x161) [0x802461] 5: (boost::statechart::simple_state<PG::RecoveryState::RepRecovering, PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x140) [0x7f5860] 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x4b) [0x7f7e6b] 7: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x32f) [0x7b335f] 8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x330) [0x65e190] 9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x16) [0x69bed6]
10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x551) [0x9d1e11]
11: (ThreadPool::WorkThread::entry()+0x10) [0x9d4e50]
12: (()+0x79d1) [0x7f49840d79d1]
13: (clone()+0x6d) [0x7f4982df8b6d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Many thanks!

Kenneth





Also remember that you don't need a lot of PGs if you don't have much data in the pools. My .rgw.buckets pool has 2k PGs, but the the RGW metadata pools only have a couple MB and 32 PGs each.


*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter <http://www.twitter.com/centraldesktop> | Facebook <http://www.facebook.com/CentralDesktop> | LinkedIn <http://www.linkedin.com/groups?gid=147417> | Blog <http://cdblog.centraldesktop.com/>

On 4/18/14 05:04 , Tyler Brekke wrote:
That is rather low, increasing the pg count should help with the data distribution.

Documentation recommends starting with (100 * (num of osds)) /(replicas) rounded up to the nearest power of two.

https://ceph.com/docs/master/rados/operations/placement-groups/



On Fri, Apr 18, 2014 at 4:54 AM, Kenneth Waegeman <[email protected] <mailto:[email protected]>> wrote:


  ----- Message from Tyler Brekke <[email protected]
  <mailto:[email protected]>> ---------
     Date: Fri, 18 Apr 2014 04:37:26 -0700
     From: Tyler Brekke <[email protected]
  <mailto:[email protected]>>
  Subject: Re: [ceph-users] OSD distribution unequally
       To: Dan Van Der Ster <[email protected]
  <mailto:[email protected]>>
       Cc: Kenneth Waegeman <[email protected]
  <mailto:[email protected]>>, ceph-users
  <[email protected] <mailto:[email protected]>>



      How many placement groups do you have in your pool containing
      the data, and
      what is the replication level of that pool?


  400 pgs per pool, replication factor is 3



      Looks like you have too few placement groups which is causing
      the data
      distribution to be off.

      -Tyler


      On Fri, Apr 18, 2014 at 4:12 AM, Dan Van Der Ster
      <[email protected] <mailto:[email protected]>

          wrote:


           ceph osd reweight-by-utilization

          Is that still in 0.79?

          I'd start with reweight-by-utilization 200 and then adjust
          that number
          down until you get to 120 or so.

          Cheers, Dan
          On Apr 18, 2014 12:49 PM, Kenneth Waegeman
          <[email protected]>
          wrote:
            Hi,

          Some osds of our cluster filled up:
                health HEALTH_ERR 1 full osd(s); 4 near full osd(s)
                monmap e1: 3 mons at
          {ceph001=
10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0
<http://10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0>},
          election epoch 96, quorum 0,1,2
          ceph001,ceph002,ceph003
                mdsmap e93: 1/1/1 up
          {0=ceph001.cubone.os=up:active}, 1 up:standby
                osdmap e1974: 42 osds: 42 up, 42 in
                       flags full
                 pgmap v286626: 1200 pgs, 3 pools, 31096 GB data,
          26259 kobjects
                       94270 GB used, 40874 GB / 131 TB avail
                              1 active+clean+scrubbing+deep
                           1199 active+clean

          I knew it is never really uniform, but the differences of
          the OSDs
          seems very big, one OSD has 96% while another only has 48%
          usage,
          which is about 1,8 TB difference:
          /dev/sdc        3.7T  1.9T  1.8T  51% /var/lib/ceph/osd/sdc
          /dev/sdd        3.7T  2.5T  1.2T  68% /var/lib/ceph/osd/sdd
          /dev/sde        3.7T  2.3T  1.5T  61% /var/lib/ceph/osd/sde
          /dev/sdf        3.7T  2.7T  975G  74% /var/lib/ceph/osd/sdf
          /dev/sdg        3.7T  3.2T  491G  87% /var/lib/ceph/osd/sdg
          /dev/sdh        3.7T  2.0T  1.8T  53% /var/lib/ceph/osd/sdh
          /dev/sdi        3.7T  2.3T  1.4T  63% /var/lib/ceph/osd/sdi
          /dev/sdj        3.7T  3.4T  303G  92% /var/lib/ceph/osd/sdj
          /dev/sdk        3.7T  2.8T  915G  76% /var/lib/ceph/osd/sdk
          /dev/sdl        3.7T  1.8T  2.0T  48% /var/lib/ceph/osd/sdl
          /dev/sdm        3.7T  2.8T  917G  76% /var/lib/ceph/osd/sdm
          /dev/sdn        3.7T  3.5T  186G  96% /var/lib/ceph/osd/sdn

          We are running 0.79 (well precisely a patched version of
          it with an
          MDS fix of another thread:-) )
          I remember hearing something about the hashpgpool having
          an effect on
          it, but I read this was already default enabled on the latest
          versions. osd_pool_default_flag_hashpspool has indeed the
          value true,
          but I don't know how to check this for a specific pool.

          Is this behaviour normal? Or what can be wrong?

          Thanks!

          Kind regards,
          Kenneth Waegeman

          _______________________________________________
          ceph-users mailing list
          [email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

          _______________________________________________
          ceph-users mailing list
          [email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




  ----- End message from Tyler Brekke <[email protected]
  <mailto:[email protected]>> -----

  --     Met vriendelijke groeten,
  Kenneth Waegeman





_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


----- End message from Craig Lewis <[email protected]> -----



----- End message from Craig Lewis <[email protected]> -----

--

Met vriendelijke groeten,
Kenneth Waegeman

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to