Re: [ceph-users] Gentoo & ceph 0.67 & pg stuck After fresh Installation

Aaron Ten Clay Thu, 30 Jan 2014 17:21:20 -0800

Philipp,

I have had issues with clock sync on machines before that I could usually
alleviate by tweaking the kernel config. Changing CONFIG_HZ to 300 instead
of 1000 can help. If you ever reboot the machines, making sure your init
system writes the current software clock to the hardware clock on shutdown
(if you use OpenRC, /etc/conf.d/hwclock should have 'clock_hctosys="YES"')
can help that situation.


Some more hardware details might be helpful. On very, very overloaded
systems I've seen the software clock drift a lot, you might just be trying
to do too much with the number of cores you have. Also, cheap or
badly-configured hardware can cause spurious interrupts, so keeping an eye
on the context-switches-per-second, and interrupts-per-second values over
time might be a clue for clock drift as well.

Glad you found my notes helpful - I didn't write the majority of that
howto, though, just the notes at the top :)

-Aaron


On Tue, Jan 28, 2014 at 2:32 PM, Philipp von Strobl-Albeg <
[email protected]> wrote:

> Hi all,
>
> thank you very much for your input.
>
> I sync the clock on all hosts per ntpdate pool.ntp.org and sync this with
> the hwclock on every host.
> For strange reason, on is after some minutes out of sync. I can't say
> where this comes from...
> Perhaps this is a special gentoo-thing or a "cheap-pc"-problem.
>
> What is the worsest thing i have to expect, if i won't fix this ?
>
>
> Anyway i get manage to fix the pgs stuck-thing.
> I redesign the crush map (mainly set the host to the a rack and this to
> default) and now the health is OK !
>
>
> Thank you again for you kindly help and great job - inktank ;-)
>
> PS: Aaron - your Howto was really helpful
>
>
> Best
> Philipp
>
>
> Am 20.01.2014 05:59, schrieb Sage Weil:
>
>  On Sun, 19 Jan 2014, Sherry Shahbazi wrote:
>>
>>> Hi Philipp,
>>>
>>> Installing "ntp" on each server might solve the clock skew problem.
>>>
>> At the very least a onetime 'ntpdate time.apple.com' should make that
>> issue go away for the time being.
>>
>> s
>>
>>    Best Regards
>>> Sherry
>>>
>>>
>>> On Sunday, January 19, 2014 6:34 AM, Philipp Strobl <
>>> [email protected]>
>>> wrote:
>>> HI Aaron,
>>>
>>> sorry for taking so long...
>>>
>>> After i add the osd and buckets to the crushmap i get
>>>
>>> ceph osd tree
>>> # id    weight    type name    up/down    reweight
>>> -3    1    host dp2
>>> 1    1        osd.1    up    1
>>> -2    1    host dp1
>>> 0    1        osd.0    up    1
>>> -1    0    root default
>>>
>>>
>>> Both osds are up and in
>>>
>>> ceph osd stat
>>> e25: 2 osds: 2 up, 2 in
>>>
>>> ceph health detail says:
>>>
>>> HEALTH_WARN 292 pgs stuck inactive; 292 pgs stuck unclean; clock skew
>>> detected on mon.vmsys-dp2
>>> pg 3.f is stuck inactive since forever, current state creating, last
>>> acting
>>> []
>>> pg 0.c is stuck inactive since forever, current state creating, last
>>> acting
>>> []
>>> pg 1.d is stuck inactive since forever, current state creating, last
>>> acting
>>> []
>>> pg 2.e is stuck inactive since forever, current state creating, last
>>> acting
>>> []
>>> pg 3.8 is stuck inactive since forever, current state creating, last
>>> acting
>>> []
>>> pg 0.b is stuck inactive since forever, current state creating, last
>>> acting
>>> []
>>> pg 1.a is stuck inactive since forever, current state creating, last
>>> acting
>>> []
>>> ...
>>> pg 2.c is stuck unclean since forever, current state creating, last
>>> acting
>>> []
>>> pg 1.f is stuck unclean since forever, current state creating, last
>>> acting
>>> []
>>> pg 0.e is stuck unclean since forever, current state creating, last
>>> acting
>>> []
>>> pg 3.d is stuck unclean since forever, current state creating, last
>>> acting
>>> []
>>> pg 2.f is stuck unclean since forever, current state creating, last
>>> acting
>>> []
>>> pg 1.c is stuck unclean since forever, current state creating, last
>>> acting
>>> []
>>> pg 0.d is stuck unclean since forever, current state creating, last
>>> acting
>>> []
>>> pg 3.e is stuck unclean since forever, current state creating, last
>>> acting
>>> []
>>> mon.vmsys-dp2 addr 10.0.0.22:6789/0 clock skew 16.4914s > max 0.05s
>>> (latency
>>> 0.00666228s)
>>>
>>> All pgs have the same status.
>>>
>>> Is the clock skew an important fact ?
>>>
>>> I compiled ceph like this - eix ceph:
>>> ...
>>> Installed versions:  0.67{tbz2}(00:54:50 01/08/14)(fuse -debug -gtk
>>> -libatomic -radosgw -static-libs -tcmalloc)
>>>   cluster name is vmsys, servers are dp1 and dp2
>>> config:
>>>
>>> [global]
>>>      auth cluster required = none
>>>      auth service required = none
>>>      auth client required = none
>>>      auth supported = none
>>>      fsid = 265d12ac-e99d-47b9-9651-05cb2b4387a6
>>>
>>> [mon.vmsys-dp1]
>>>      host = dp1
>>>      mon addr = INTERNAL-IP1:6789
>>>      mon data = /var/lib/ceph/mon/ceph-vmsys-dp1
>>>
>>> [mon.vmsys-dp2]
>>>      host = dp2
>>>      mon addr = INTERNAL-IP2:6789
>>>      mon data = /var/lib/ceph/mon/ceph-vmsys-dp2
>>>
>>> [osd]
>>> [osd.0]
>>>      host = dp1
>>>      devs = /dev/sdb1
>>>      osd_mkfs_type = xfs
>>>      osd data = /var/lib/ceph/osd/ceph-0
>>>
>>> [osd.1]
>>>      host = dp2
>>>      devs = /dev/sdb1
>>>      osd_mkfs_type = xfs
>>>      osd data = /var/lib/ceph/osd/ceph-1
>>>
>>> [mds.vmsys-dp1]
>>>          host = dp1
>>>
>>> [mds.vmsys-dp2]
>>>          host = dp2
>>>
>>>
>>>
>>> Hope this is helpful - i really don't know at the moment what is wrong.
>>>
>>> Perhaps i try the manual-deploy howto from inktank or do you have an
>>> idea ?
>>>
>>>
>>>
>>> Best Philipp
>>>
>>> http://www.pilarkto.net
>>> Am 10.01.2014 20:50, schrieb Aaron Ten Clay:
>>>        Hi Philipp,
>>>
>>> It sounds like perhaps you don't have any OSDs that are both "up" and
>>> "in" the cluster. Can you provide the output of "ceph health detail"
>>> and "ceph osd tree" for us?
>>>
>>> As for the "howto" you mentioned, I added some notes to the top but
>>> never really updated the body of the document... I'm not entirely sure
>>> it's straightforward or up to date any longer :) I'd be happy to make
>>> changes as needed but I haven't manually deployed a cluster in several
>>> months, and Inktank now has a manual deployment guide for Ceph at
>>> http://ceph.com/docs/master/install/manual-deployment/
>>>
>>> -Aaron
>>>
>>>
>>>
>>> On Fri, Jan 10, 2014 at 6:57 AM, Philipp Strobl <[email protected]>
>>> wrote:
>>>        Hi,
>>>
>>> After managed to deploy ceph manual in gentoo (ceph-disk tools
>>> are under /usr/usr/sbin...), the daemons are coming properly up,
>>> but "ceph health" shows warn for all pgs stuck unclean.
>>> This is a strange behavior for a clean new installtion i guess.
>>>
>>> So the question is, do i'm something wrong Or can i reset the
>>> PGs for getting the Cluster Running ?
>>>
>>> Also the rbd-Client Or Mount.ceph Hangs with no answer.
>>>
>>> I used thishowto: https://github.com/aarontc/
>>> ansible-playbooks/blob/master/roles/ceph.
>>> notes-on-deployment.rst
>>>
>>> Resp. our German translation/expansion
>>> http://wiki.open-laboratory.de/Intern:IT:HowTo:Ceph
>>>
>>> With auth Support ... = none
>>>
>>>
>>> Best regards
>>> And thank you in advance
>>>
>>> Philipp Strobl
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>>> --
>>> Aaron Ten Clay
>>> http://www.aarontc.com/
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>


-- 
Aaron Ten Clay
http://www.aarontc.com/

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Gentoo & ceph 0.67 & pg stuck After fresh Installation

Reply via email to