Hello Mark,

Raised tracker for the issue  -- http://tracker.ceph.com/issues/20222

Jake can you share the restart_OSD_and_log-this.sh script

Thanks
Jayaram

On Wed, Jun 7, 2017 at 9:40 PM, Jake Grimmett <j...@mrc-lmb.cam.ac.uk> wrote:

> Hi Mark & List,
>
> Unfortunately, even when using yesterdays master version of ceph,
> I'm still seeing OSDs go down, same error as before:
>
> OSD log shows lots of entries like this:
>
> (osd38)
> 2017-06-07 16:48:46.070564 7f90b58c3700  1 heartbeat_map is_healthy
> 'tp_osd_tp thread tp_osd_tp' had timed out after 60
>
> (osd3)
> 2017-06-07 17:01:25.391075 7f62de6c3700  1 heartbeat_map is_healthy
> 'tp_osd_tp thread tp_osd_tp' had timed out after 60
> 2017-06-07 17:01:26.276881 7f62dbe86700 -1 osd.3 6165 heartbeat_check:
> no reply from 10.1.0.86:6811 osd.2 since back 2017-06-07 17:00:19.640002
> front 2017-06-07 17:01:21.950160 (cutoff 2017-06-07 17:01:06.276881)
>
>
> [root@ceph4 ceph]# ceph -v
> ceph version 12.0.2-2399-ge38ca14
> (e38ca14914340d65ea8001c7bd6e0ff769f3eb2e) luminous (dev)
>
>
> I'll continue running the cluster with my "restart_OSD_and_log-this.sh"
> workaround...
>
> thanks again for your help,
>
> Jake
>
> On 06/06/17 15:52, Jake Grimmett wrote:
> > Hi Mark,
> >
> > OK, I'll upgrade to the current master and retest...
> >
> > best,
> >
> > Jake
> >
> > On 06/06/17 15:46, Mark Nelson wrote:
> >> Hi Jake,
> >>
> >> I just happened to notice this was on 12.0.3.  Would it be possible to
> >> test this out with current master and see if it still is a problem?
> >>
> >> Mark
> >>
> >> On 06/06/2017 09:10 AM, Mark Nelson wrote:
> >>> Hi Jake,
> >>>
> >>> Thanks much.  I'm guessing at this point this is probably a bug.  Would
> >>> you (or nokiauser) mind creating a bug in the tracker with a short
> >>> description of what's going on and the collectl sample showing this is
> >>> not IOs backing up on the disk?
> >>>
> >>> If you want to try it, we have a gdb based wallclock profiler that
> might
> >>> be interesting to run while it's in the process of timing out.  It
> tries
> >>> to grab 2000 samples from the osd process which typically takes about
> 10
> >>> minutes or so.  You'll need to either change the number of samples to
> be
> >>> lower in the python code (maybe like 50-100), or change the timeout to
> >>> be something longer.
> >>>
> >>> You can find the code here:
> >>>
> >>> https://github.com/markhpc/gdbprof
> >>>
> >>> and invoke it like:
> >>>
> >>> udo gdb -ex 'set pagination off' -ex 'attach 27962' -ex 'source
> >>> ./gdbprof.py' -ex 'profile begin' -ex 'quit'
> >>>
> >>> where 27962 in this case is the PID of the ceph-osd process.  You'll
> >>> need gdb with the python bindings and the ceph debug symbols for it to
> >>> work.
> >>>
> >>> This might tell us over time if the tp_osd_tp processes are just
> sitting
> >>> on pg::locks.
> >>>
> >>> Mark
> >>>
> >>> On 06/06/2017 05:34 AM, Jake Grimmett wrote:
> >>>> Hi Mark,
> >>>>
> >>>> Thanks again for looking into this problem.
> >>>>
> >>>> I ran the cluster overnight, with a script checking for dead OSDs
> every
> >>>> second, and restarting them.
> >>>>
> >>>> 40 OSD failures occurred in 12 hours, some OSDs failed multiple times,
> >>>> (there are 50 OSDs in the EC tier).
> >>>>
> >>>> Unfortunately, the output of collectl doesn't appear to show any
> >>>> increase in disk queue depth and service times before the OSDs die.
> >>>>
> >>>> I've put a couple of examples of collectl output for the disks
> >>>> associated with the OSDs here:
> >>>>
> >>>> https://hastebin.com/icuvotemot.scala
> >>>>
> >>>> please let me know if you need more info...
> >>>>
> >>>> best regards,
> >>>>
> >>>> Jake
> >>>>
> >>>>
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to