Hello Jacob, Gregory, did you manage to start up those OSDs at last? I came across a very much alike incident [1] (no flags preventing the OSDs from getting UP in the cluster though, no hardware problems reported) and I wonder if you found out what was the culprit in your case.
[1] http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/30432 Best regards, Kostis On 17 April 2015 at 02:04, Gregory Farnum <[email protected]> wrote: > The monitor looks like it's not generating a new OSDMap including the > booting OSDs. I could say with more certainty what's going on with the > monitor log file, but I'm betting you've got one of the noin or noup > family of flags set. I *think* these will be output in "ceph -w" or in > "ceph osd dump", although I can't say for certain in Firefly. > -Greg > > On Fri, Apr 10, 2015 at 1:57 AM, Jacob Reid <[email protected]> > wrote: >> On Fri, Apr 10, 2015 at 09:55:20AM +0100, Jacob Reid wrote: >>> On Thu, Apr 09, 2015 at 05:21:47PM +0100, Jacob Reid wrote: >>> > On Thu, Apr 09, 2015 at 08:46:07AM -0700, Gregory Farnum wrote: >>> > > On Thu, Apr 9, 2015 at 8:14 AM, Jacob Reid >>> > > <[email protected]> wrote: >>> > > > On Thu, Apr 09, 2015 at 06:43:45AM -0700, Gregory Farnum wrote: >>> > > >> You can turn up debugging ("debug osd = 10" and "debug filestore = >>> > > >> 10" >>> > > >> are probably enough, or maybe 20 each) and see what comes out to get >>> > > >> more information about why the threads are stuck. >>> > > >> >>> > > >> But just from the log my answer is the same as before, and now I >>> > > >> don't >>> > > >> trust that controller (or maybe its disks), regardless of what it's >>> > > >> admitting to. ;) >>> > > >> -Greg >>> > > >> >>> > > > >>> > > > Ran with osd and filestore debug both at 20; still nothing jumping >>> > > > out at me. Logfile attached as it got huge fairly quickly, but mostly >>> > > > seems to be the same extra lines. I tried running some test I/O on >>> > > > the drives in question to try and provoke some kind of problem, but >>> > > > they seem fine now... >>> > > >>> > > Okay, this is strange. Something very wonky is happening with your >>> > > scheduler — it looks like these threads are all idle, and they're >>> > > scheduling wakeups that handle an appreciable amount of time after >>> > > they're supposed to. For instance: >>> > > 2015-04-09 15:56:55.953116 7f70a7963700 20 >>> > > filestore(/var/lib/ceph/osd/osd.15) sync_entry woke after 5.416704 >>> > > 2015-04-09 15:56:55.953153 7f70a7963700 20 >>> > > filestore(/var/lib/ceph/osd/osd.15) sync_entry waiting for >>> > > max_interval 5.000000 >>> > > >>> > > This is the thread that syncs your backing store, and it always sets >>> > > itself to get woken up at 5-second intervals — but here it took >5.4 >>> > > seconds, and later on in your log it takes more than 6 seconds. >>> > > It looks like all the threads which are getting timed out are also >>> > > idle, but are taking so much longer to wake up than they're set for >>> > > that they get a timeout warning. >>> > > >>> > > There might be some bugs in here where we're expecting wakeups to be >>> > > more precise than they can be, but these sorts of misses are >>> > > definitely not normal. Is this server overloaded on the CPU? Have you >>> > > done something to make the scheduler or wakeups wonky? >>> > > -Greg >>> > >>> > CPU load is minimal - the host does nothing but run OSDs and has 8 cores >>> > that are all sitting idle with a load average of 0.1. I haven't done >>> > anything to scheduling. That was with the debug logging on, if that could >>> > be the cause of any delays. A scheduler issue seems possible - I haven't >>> > done anything to it, but `time sleep 5` run a few times returns anything >>> > spread randomly from 5.002 to 7.1(!) seconds but mostly in the 5.5-6.0 >>> > region where it managed fairly consistently <5.2 on the other servers in >>> > the cluster and <5.02 on my desktop. I have disabled the CPU power saving >>> > mode as the only thing I could think of that might be having an effect on >>> > this, and running the same test again gives more sane results... we'll >>> > see if this reflects in the OSD logs or not, I guess. If this is the >>> > cause, it's probably something that the next version might want to make a >>> > specific warning case of detecting. I will keep you updated as to their >>> > behaviour now... >>> > _______________________________________________ >>> > ceph-users mailing list >>> > [email protected] >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> Overnight, nothing changed - I am no longer seeing the timeout in the logs >>> but all the OSDs in questions are still happily sitting at booting and >>> showing as down in the tree. Debug 20 logfile attached again. >> ...and here actually *is* the logfile, which I managed to forget... must be >> Friday, I guess. >> >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
