Hello Mark, Raised tracker for the issue -- http://tracker.ceph.com/issues/20222
Jake can you share the restart_OSD_and_log-this.sh script Thanks Jayaram On Wed, Jun 7, 2017 at 9:40 PM, Jake Grimmett <j...@mrc-lmb.cam.ac.uk> wrote: > Hi Mark & List, > > Unfortunately, even when using yesterdays master version of ceph, > I'm still seeing OSDs go down, same error as before: > > OSD log shows lots of entries like this: > > (osd38) > 2017-06-07 16:48:46.070564 7f90b58c3700 1 heartbeat_map is_healthy > 'tp_osd_tp thread tp_osd_tp' had timed out after 60 > > (osd3) > 2017-06-07 17:01:25.391075 7f62de6c3700 1 heartbeat_map is_healthy > 'tp_osd_tp thread tp_osd_tp' had timed out after 60 > 2017-06-07 17:01:26.276881 7f62dbe86700 -1 osd.3 6165 heartbeat_check: > no reply from 10.1.0.86:6811 osd.2 since back 2017-06-07 17:00:19.640002 > front 2017-06-07 17:01:21.950160 (cutoff 2017-06-07 17:01:06.276881) > > > [root@ceph4 ceph]# ceph -v > ceph version 12.0.2-2399-ge38ca14 > (e38ca14914340d65ea8001c7bd6e0ff769f3eb2e) luminous (dev) > > > I'll continue running the cluster with my "restart_OSD_and_log-this.sh" > workaround... > > thanks again for your help, > > Jake > > On 06/06/17 15:52, Jake Grimmett wrote: > > Hi Mark, > > > > OK, I'll upgrade to the current master and retest... > > > > best, > > > > Jake > > > > On 06/06/17 15:46, Mark Nelson wrote: > >> Hi Jake, > >> > >> I just happened to notice this was on 12.0.3. Would it be possible to > >> test this out with current master and see if it still is a problem? > >> > >> Mark > >> > >> On 06/06/2017 09:10 AM, Mark Nelson wrote: > >>> Hi Jake, > >>> > >>> Thanks much. I'm guessing at this point this is probably a bug. Would > >>> you (or nokiauser) mind creating a bug in the tracker with a short > >>> description of what's going on and the collectl sample showing this is > >>> not IOs backing up on the disk? > >>> > >>> If you want to try it, we have a gdb based wallclock profiler that > might > >>> be interesting to run while it's in the process of timing out. It > tries > >>> to grab 2000 samples from the osd process which typically takes about > 10 > >>> minutes or so. You'll need to either change the number of samples to > be > >>> lower in the python code (maybe like 50-100), or change the timeout to > >>> be something longer. > >>> > >>> You can find the code here: > >>> > >>> https://github.com/markhpc/gdbprof > >>> > >>> and invoke it like: > >>> > >>> udo gdb -ex 'set pagination off' -ex 'attach 27962' -ex 'source > >>> ./gdbprof.py' -ex 'profile begin' -ex 'quit' > >>> > >>> where 27962 in this case is the PID of the ceph-osd process. You'll > >>> need gdb with the python bindings and the ceph debug symbols for it to > >>> work. > >>> > >>> This might tell us over time if the tp_osd_tp processes are just > sitting > >>> on pg::locks. > >>> > >>> Mark > >>> > >>> On 06/06/2017 05:34 AM, Jake Grimmett wrote: > >>>> Hi Mark, > >>>> > >>>> Thanks again for looking into this problem. > >>>> > >>>> I ran the cluster overnight, with a script checking for dead OSDs > every > >>>> second, and restarting them. > >>>> > >>>> 40 OSD failures occurred in 12 hours, some OSDs failed multiple times, > >>>> (there are 50 OSDs in the EC tier). > >>>> > >>>> Unfortunately, the output of collectl doesn't appear to show any > >>>> increase in disk queue depth and service times before the OSDs die. > >>>> > >>>> I've put a couple of examples of collectl output for the disks > >>>> associated with the OSDs here: > >>>> > >>>> https://hastebin.com/icuvotemot.scala > >>>> > >>>> please let me know if you need more info... > >>>> > >>>> best regards, > >>>> > >>>> Jake > >>>> > >>>> > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com