Hello Jacob, Gregory,

did you manage to start up those OSDs at last? I came across a very
much alike incident [1] (no flags preventing the OSDs from getting UP
in the cluster though, no hardware problems reported) and I wonder if
you found out what was the culprit in your case.

[1] http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/30432

Best regards,
Kostis

On 17 April 2015 at 02:04, Gregory Farnum <[email protected]> wrote:
> The monitor looks like it's not generating a new OSDMap including the
> booting OSDs. I could say with more certainty what's going on with the
> monitor log file, but I'm betting you've got one of the noin or noup
> family of flags set. I *think* these will be output in "ceph -w" or in
> "ceph osd dump", although I can't say for certain in Firefly.
> -Greg
>
> On Fri, Apr 10, 2015 at 1:57 AM, Jacob Reid <[email protected]> 
> wrote:
>> On Fri, Apr 10, 2015 at 09:55:20AM +0100, Jacob Reid wrote:
>>> On Thu, Apr 09, 2015 at 05:21:47PM +0100, Jacob Reid wrote:
>>> > On Thu, Apr 09, 2015 at 08:46:07AM -0700, Gregory Farnum wrote:
>>> > > On Thu, Apr 9, 2015 at 8:14 AM, Jacob Reid 
>>> > > <[email protected]> wrote:
>>> > > > On Thu, Apr 09, 2015 at 06:43:45AM -0700, Gregory Farnum wrote:
>>> > > >> You can turn up debugging ("debug osd = 10" and "debug filestore = 
>>> > > >> 10"
>>> > > >> are probably enough, or maybe 20 each) and see what comes out to get
>>> > > >> more information about why the threads are stuck.
>>> > > >>
>>> > > >> But just from the log my answer is the same as before, and now I 
>>> > > >> don't
>>> > > >> trust that controller (or maybe its disks), regardless of what it's
>>> > > >> admitting to. ;)
>>> > > >> -Greg
>>> > > >>
>>> > > >
>>> > > > Ran with osd and filestore debug both at 20; still nothing jumping 
>>> > > > out at me. Logfile attached as it got huge fairly quickly, but mostly 
>>> > > > seems to be the same extra lines. I tried running some test I/O on 
>>> > > > the drives in question to try and provoke some kind of problem, but 
>>> > > > they seem fine now...
>>> > >
>>> > > Okay, this is strange. Something very wonky is happening with your
>>> > > scheduler — it looks like these threads are all idle, and they're
>>> > > scheduling wakeups that handle an appreciable amount of time after
>>> > > they're supposed to. For instance:
>>> > > 2015-04-09 15:56:55.953116 7f70a7963700 20
>>> > > filestore(/var/lib/ceph/osd/osd.15) sync_entry woke after 5.416704
>>> > > 2015-04-09 15:56:55.953153 7f70a7963700 20
>>> > > filestore(/var/lib/ceph/osd/osd.15) sync_entry waiting for
>>> > > max_interval 5.000000
>>> > >
>>> > > This is the thread that syncs your backing store, and it always sets
>>> > > itself to get woken up at 5-second intervals — but here it took >5.4
>>> > > seconds, and later on in your log it takes more than 6 seconds.
>>> > > It looks like all the threads which are getting timed out are also
>>> > > idle, but are taking so much longer to wake up than they're set for
>>> > > that they get a timeout warning.
>>> > >
>>> > > There might be some bugs in here where we're expecting wakeups to be
>>> > > more precise than they can be, but these sorts of misses are
>>> > > definitely not normal. Is this server overloaded on the CPU? Have you
>>> > > done something to make the scheduler or wakeups wonky?
>>> > > -Greg
>>> >
>>> > CPU load is minimal - the host does nothing but run OSDs and has 8 cores 
>>> > that are all sitting idle with a load average of 0.1. I haven't done 
>>> > anything to scheduling. That was with the debug logging on, if that could 
>>> > be the cause of any delays. A scheduler issue seems possible - I haven't 
>>> > done anything to it, but `time sleep 5` run a few times returns anything 
>>> > spread randomly from 5.002 to 7.1(!) seconds but mostly in the 5.5-6.0 
>>> > region where it managed fairly consistently <5.2 on the other servers in 
>>> > the cluster and <5.02 on my desktop. I have disabled the CPU power saving 
>>> > mode as the only thing I could think of that might be having an effect on 
>>> > this, and running the same test again gives more sane results... we'll 
>>> > see if this reflects in the OSD logs or not, I guess. If this is the 
>>> > cause, it's probably something that the next version might want to make a 
>>> > specific warning case of detecting. I will keep you updated as to their 
>>> > behaviour now...
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > [email protected]
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> Overnight, nothing changed - I am no longer seeing the timeout in the logs 
>>> but all the OSDs in questions are still happily sitting at booting and 
>>> showing as down in the tree. Debug 20 logfile attached again.
>> ...and here actually *is* the logfile, which I managed to forget... must be 
>> Friday, I guess.
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to