Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

Peter Bortas Fri, 19 Oct 2018 07:59:54 -0700

So, that is at least not a syntax for abort_recovery I'm familiar
with. To take an example from last time I did this, I first determined
which device wasn't completing the recovery, then logged in on the
server (an OST in this case) and ran:


# lctl dl|grep obdfilter|grep fouo5-OST0000
  3 UP obdfilter fouo5-OST0000 fouo5-OST0000_UUID 629
# lctl --device 3 abort_recovery

Attached is a script that you can invoke with "lustre_watch_recovery
<servername>" that will give you the status of recovery on the named
server updated once per second. I find it useful for keeping track of
how things are working out while doing restarts.

Regards,
-- 
Peter Bortas, NSC








On Fri, Oct 19, 2018 at 4:42 PM Marion Hakanson <[email protected]> wrote:
>
> Thanks for the feedback.  You're both confirming what we've learned so far, 
> that we had to unmount all the clients (which required rebooting most of 
> them), then reboot all the storage servers, to get things unstuck until the 
> problem recurred.
>
> I tried abort_recovery on the clients last night, before rebooting the MDS, 
> but that did not help.  Could well be I'm not using it right:
>
> - look up the MDT in "lctl dl" list.
> - run "lctl abort_recovery $mdt" on all clients
> - reboot the MDS.
>
> The MDS still reported recovering all 259 clients at boot time.
>
> BTW, we have a separate MGS from the MDS.  Could it be we need to reboot both 
> MDS & MGS to clear things?
>
> Thanks and regards,
>
> Marion
>
>
> > On Oct 19, 2018, at 07:28, Peter Bortas <[email protected]> wrote:
> >
> > That should fix it, but I'd like to advocate for using abort_recovery.
> > Compared to unmounting thousands of clients abort_recovery is a quick
> > operation that takes a few minutes to do. Wouldn't say it gets used a
> > lot, but I've done it on NSCs live environment six times since 2016,
> > solving the deadlocks each time.
> >
> > Regards,
> > --
> > Peter Bortas
> > Swedish National Supercomputer Centre
> >
> >> On Fri, Oct 19, 2018 at 3:04 PM Patrick Farrell <[email protected]> wrote:
> >>
> >>
> >> Marion,
> >>
> >> You note the deadlock reoccurs on server reboot, so you’re really stuck.  
> >> This is most likely due to recovery where operations from the clients are 
> >> replayed.
> >>
> >> If you’re fine with letting any pending I/O fail in order to get the 
> >> system back up, I would suggest a client side action: unmount (-f, and be 
> >> patient) and /or shut down all of your clients.  That will discard things 
> >> the clients are trying to replay, (causing pending I/O to fail).  Then 
> >> shut down your servers and start them up again.  With no clients, there’s 
> >> (almost) nothing to replay, and you probably won’t hit the issue on 
> >> startup.  (There’s also the abort_recovery option covered in the manual, 
> >> but I personally think this is easier.)
> >>
> >> There’s no guarantee this avoids your deadlock happening again, but it’s 
> >> highly likely it’ll at least get you running.
> >>
> >> If you need to save your pending I/O, you’ll have to install patched 
> >> software with a fix for this (sounds like WC has identified the bug) and 
> >> then reboot.
> >>
> >> Good luck!
> >> - Patrick
> >> ________________________________
> >> From: lustre-discuss <[email protected]> on behalf 
> >> of Marion Hakanson <[email protected]>
> >> Sent: Friday, October 19, 2018 1:32:10 AM
> >> To: [email protected]
> >> Subject: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5
> >>
> >> This issue is really kicking our behinds:
> >> https://jira.whamcloud.com/browse/LU-11465
> >>
> >> While we're waiting for the issue to get some attention from Lustre 
> >> developers, are there suggestions on how we can recover our cluster from 
> >> this kind of deadlocked, stuck-threads-on-the-MDS (or OSS) situation?  
> >> Rebooting the storage servers does not clear the hang-up, as upon reboot 
> >> the MDS quickly ends up with the same number of D-state threads (around 
> >> the same number as we have clients).  It seems to me like there is some 
> >> state stashed away in the filesystem which restores the deadlock as soon 
> >> as the MDS comes up.
> >>
> >> Thanks and regards,
> >>
> >> Marion
> >>
> >> _______________________________________________
> >> lustre-discuss mailing list
> >> [email protected]
> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

lustre_watch_recovery
Description: Binary data

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

Reply via email to