----- Message from Craig Lewis <[email protected]> ---------
Date: Thu, 24 Apr 2014 11:20:08 -0700
From: Craig Lewis <[email protected]>
Subject: Re: [ceph-users] OSD distribution unequally -- osd crashes
To: Kenneth Waegeman <[email protected]>
Cc: [email protected]
Your OSDs shouldn't be crashing during a remap. Although.. you
might try to get that one OSDs below 96% first, it could have an
effect. If OSDs continue to crash after that, I'd start a new
thread about the crashes.
If you have the option to delete some data and reload it it later,
I'd do that now. If you can't, or you can't delete enough, read up
on mon_osd_full_ratio, mon_osd_nearfull_ratio,
osd_failsafe_nearfull_ratio, and osd_backfill_full_ratio. You can
temporarily changes these values, but *be very careful*. Make sure
you understand the docs before doing it. Ceph is protecting you.
You'll cause yourself a lot more pain if you let the disks get to
100% full.
You can change those params on the fly with:
ceph tell osd.* injectargs '--mon_osd_nearfull_ratio 0.85'
Get the current values with:
ceph --admin-daemon /var/run/ceph/ceph-osd.<ID>.asok config get
mon_osd_nearfull_ratio
injectargs changes won't survive restart. If you need them to last
though a reboot or daemon restart, you'll want to add them to
ceph.conf too. I didn't, I just reran the injectargs whenever I
manually restarted something.
Once you get IO working again, I'd knock down those specific OSDs
before proceeding with more pg_num and pgp_num changes.
reweight-by-utilization is wrapper around:
ceph osd reweight <OSD> <WEIGHT>
You can get the current weights from
ceph osd dump | grep ^osd
Assuming all your OSDs are weighted 1.0, I'd drop the weight on all
OSDs > 85% full to 0.95, and see what that looks like. You might
need a couple passes, because some of the migrated PGs will be
migrated to other nearfull OSDs. I dropped weights by 0.025 every
pass, and I ended up with one OSD weighted 0.875.
While it's remapping, some will get into state backfill_toofull.
Assuming your reweights are enough, they will clear, but it does
slow things down. If you get to the point that all of the backfill
PGs are backfill_toofull, then make another pass on the OSD reweights.
Once they're all below 85%, you can start doing the pg_num and
pgp_num expansions. You may need to revisit the OSD weights during
this process, and you'll probably want to reset all weights to 1.0
when you're done. Since you have OSDs crashing, you'll want to note
that marking an OSD out then in will reset the weight to 1.0. You
might want to track your weights outside of Ceph, so you can
manually reweight if that happens.
Lastly, you'll want monitor this. If you had noticed when the first
OSD hit 86%, things would be much less painful.
Thanks for the very useful information!
I started with deleting some data, and this helped for the stability
of the OSDs. No OSD has crashed the past 6 hours (before it was
multiple an hour).
There are indeed a lot of PGs in backfill_toofull . Now the osds
stopped crashing it can actually recover and resume the backfilling!
*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>
*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter
<http://www.twitter.com/centraldesktop> | Facebook
<http://www.facebook.com/CentralDesktop> | LinkedIn
<http://www.linkedin.com/groups?gid=147417> | Blog
<http://cdblog.centraldesktop.com/>
On 4/24/14 02:19 , Kenneth Waegeman wrote:
----- Message from Craig Lewis <[email protected]> ---------
Date: Fri, 18 Apr 2014 14:59:25 -0700
From: Craig Lewis <[email protected]>
Subject: Re: [ceph-users] OSD distribution unequally
To: [email protected]
When you increase the number of PGs, don't just go to the max
value. Step into it.
You'll want to end up around 2048, so do 400 -> 512, wait for it
to finish, -> 1024, wait, -> 2048.
Thanks, I changed it to 512, also doing the reweight-by-utilisation.
While doing this, some osds crashed a few times. I thought it was
maybe because of the fact they were almost full.
When this was finishes, some pgs were still in active-remapped
state. I read in another mail from the list to try to do ceph osd
crush tunables optimal. So that is running now. But after a few
hours, again 17 out of 42 osds were crashed. (I don't know the
crashing is connected with the reweight and the pgs stuck in
active-remapped)
Out of log file:
2014-04-24 03:46:57.442110 7f4968a11700 -1 osd/PG.cc: In function
'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed,
PG::RecoveryS
tate::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na>, (boost::statechart::history_mode)0u>::my_context)'
thread 7f4968a11700 time 2014-04-24 03:46:57.3
66010
osd/PG.cc: 5298: FAILED assert(0 == "we got a bad state machine event")
ceph version 0.79-209-g924064f (924064f83b7fb5d4f0961ee712d410ed1855cba0)
1:
(PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na,
mp
l_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, m
pl_::na>, (boost::statechart::history_mode)0>::my_context)+0x12f) [0x7a99ff]
2:
(boost::statechart::detail::inner_constructor<boost::mpl::l_item<mpl_::long_<1l>, PG::RecoveryState::Crashed, boost::mpl::l_end>,
boost::statechart::state_machine<PG
::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial,
std::allocator<void>, boost::statechart::null_exception_translator>
>::construct(boost::statechart::state_m
achine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>* const&,
boost::statechart::st
ate_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>&)+0x26) [0x7ed146]
3: (boost::statechart::simple_state<PG::RecoveryState::Started,
PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xfa)
[0x7f6faa]
4:
(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive,
PG::RecoveryState::Started, PG::RecoveryState::RepNotRecovering,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x161)
[0x802461]
5:
(boost::statechart::simple_state<PG::RecoveryState::RepRecovering,
PG::RecoveryState::ReplicaActive, boost::mpl::list<mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x140)
[0x7f5860]
6:
(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x4b)
[0x7f7e6b]
7:
(PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>,
PG::RecoveryCtx*)+0x32f) [0x7b335f]
8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*>
> const&, ThreadPool::TPHandle&)+0x330) [0x65e190]
9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
const&, ThreadPool::TPHandle&)+0x16) [0x69bed6]
10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x551) [0x9d1e11]
11: (ThreadPool::WorkThread::entry()+0x10) [0x9d4e50]
12: (()+0x79d1) [0x7f49840d79d1]
13: (clone()+0x6d) [0x7f4982df8b6d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
Many thanks!
Kenneth
Also remember that you don't need a lot of PGs if you don't have
much data in the pools. My .rgw.buckets pool has 2k PGs, but the
the RGW metadata pools only have a couple MB and 32 PGs each.
*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>
*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter
<http://www.twitter.com/centraldesktop> | Facebook
<http://www.facebook.com/CentralDesktop> | LinkedIn
<http://www.linkedin.com/groups?gid=147417> | Blog
<http://cdblog.centraldesktop.com/>
On 4/18/14 05:04 , Tyler Brekke wrote:
That is rather low, increasing the pg count should help with the
data distribution.
Documentation recommends starting with (100 * (num of osds))
/(replicas) rounded up to the nearest power of two.
https://ceph.com/docs/master/rados/operations/placement-groups/
On Fri, Apr 18, 2014 at 4:54 AM, Kenneth Waegeman
<[email protected] <mailto:[email protected]>>
wrote:
----- Message from Tyler Brekke <[email protected]
<mailto:[email protected]>> ---------
Date: Fri, 18 Apr 2014 04:37:26 -0700
From: Tyler Brekke <[email protected]
<mailto:[email protected]>>
Subject: Re: [ceph-users] OSD distribution unequally
To: Dan Van Der Ster <[email protected]
<mailto:[email protected]>>
Cc: Kenneth Waegeman <[email protected]
<mailto:[email protected]>>, ceph-users
<[email protected] <mailto:[email protected]>>
How many placement groups do you have in your pool containing
the data, and
what is the replication level of that pool?
400 pgs per pool, replication factor is 3
Looks like you have too few placement groups which is causing
the data
distribution to be off.
-Tyler
On Fri, Apr 18, 2014 at 4:12 AM, Dan Van Der Ster
<[email protected] <mailto:[email protected]>
wrote:
ceph osd reweight-by-utilization
Is that still in 0.79?
I'd start with reweight-by-utilization 200 and then adjust
that number
down until you get to 120 or so.
Cheers, Dan
On Apr 18, 2014 12:49 PM, Kenneth Waegeman
<[email protected]>
wrote:
Hi,
Some osds of our cluster filled up:
health HEALTH_ERR 1 full osd(s); 4 near full osd(s)
monmap e1: 3 mons at
{ceph001=
10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0
<http://10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0>},
election epoch 96, quorum 0,1,2
ceph001,ceph002,ceph003
mdsmap e93: 1/1/1 up
{0=ceph001.cubone.os=up:active}, 1 up:standby
osdmap e1974: 42 osds: 42 up, 42 in
flags full
pgmap v286626: 1200 pgs, 3 pools, 31096 GB data,
26259 kobjects
94270 GB used, 40874 GB / 131 TB avail
1 active+clean+scrubbing+deep
1199 active+clean
I knew it is never really uniform, but the differences of
the OSDs
seems very big, one OSD has 96% while another only has 48%
usage,
which is about 1,8 TB difference:
/dev/sdc 3.7T 1.9T 1.8T 51% /var/lib/ceph/osd/sdc
/dev/sdd 3.7T 2.5T 1.2T 68% /var/lib/ceph/osd/sdd
/dev/sde 3.7T 2.3T 1.5T 61% /var/lib/ceph/osd/sde
/dev/sdf 3.7T 2.7T 975G 74% /var/lib/ceph/osd/sdf
/dev/sdg 3.7T 3.2T 491G 87% /var/lib/ceph/osd/sdg
/dev/sdh 3.7T 2.0T 1.8T 53% /var/lib/ceph/osd/sdh
/dev/sdi 3.7T 2.3T 1.4T 63% /var/lib/ceph/osd/sdi
/dev/sdj 3.7T 3.4T 303G 92% /var/lib/ceph/osd/sdj
/dev/sdk 3.7T 2.8T 915G 76% /var/lib/ceph/osd/sdk
/dev/sdl 3.7T 1.8T 2.0T 48% /var/lib/ceph/osd/sdl
/dev/sdm 3.7T 2.8T 917G 76% /var/lib/ceph/osd/sdm
/dev/sdn 3.7T 3.5T 186G 96% /var/lib/ceph/osd/sdn
We are running 0.79 (well precisely a patched version of
it with an
MDS fix of another thread:-) )
I remember hearing something about the hashpgpool having
an effect on
it, but I read this was already default enabled on the latest
versions. osd_pool_default_flag_hashpspool has indeed the
value true,
but I don't know how to check this for a specific pool.
Is this behaviour normal? Or what can be wrong?
Thanks!
Kind regards,
Kenneth Waegeman
_______________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
----- End message from Tyler Brekke <[email protected]
<mailto:[email protected]>> -----
-- Met vriendelijke groeten,
Kenneth Waegeman
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
----- End message from Craig Lewis <[email protected]> -----
----- End message from Craig Lewis <[email protected]> -----
--
Met vriendelijke groeten,
Kenneth Waegeman
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com