On Thu, 28 Mar 2013 16:38:55 -0500
Andrew Deason <[email protected]> wrote:

> What I was after is the stack trace of all of the LWPs in the buserver
> process. You cannot get at those easily, since LWP is a threading
> system that is not understood by the debugger (dbx or gdb). That's
> kinda why I was treating the 'core file' option as something where you
> give the core file to a developer. Getting that information by
> providing instructions to you makes this a bit more difficult... but
> is probably doable.

So, while I was waiting for some stuff to compile while trying this, I
realized this might be fixed by
<http://git.openafs.org/?p=openafs.git;a=patch;h=dce2d8206ecd35c96e75cc0662432c2a4f9c3d7a>.
I'm not clear on what exactly the principal is for, but that does fix a
bug that was introduced in the 1.6 series. Since there have not been
many substantial changes to budb in general, and that change impacts the
CreateDump function, that seems like a likely culprit. (To devs: the
original change doesn't make a lot of sense to me; the commit messages
suggest there are different strutures in play, but the args and function
parameters are all ktc_principal.)

I'm not sure how that would cause it to hang, but if you want to try a
patch, you can try that one. If you want to look at the core you
captured before, then read on. I would still be interested in seeing a
stack trace, even if the above patch appears to fix the issue.

> is probably doable. I've only done that a couple of times before, and
> it involved hex editing the core file; it may be a bit easier with
> Solaris dbx, but I'll need a little time to look into it.

This actually isn't so bad if you rely on mdb to give you the stack
traces. Attached a dbx script that can be used to get some traces. This
should probably live in a repo or something... somewhere. Do people have
an opinion on where this should go?

Anyway, you can use it like this. If you compiled with LWP debug turned
on, it's more likely to work (this means running ./configure with
--enable-debug-lwp), but it's not required. Run:

$ /opt/SUNWspro/bin/dbx /path/to/buserver /path/to/core
[...]
(dbx) source lwpstacks.ksh
(dbx) lwpstacks

If you don't have LWP debug, this will fail (probably with something
like "dbx: struct "lwp_pcb" is not defined[...]"). You can try running
this without using debug symbols (we'll guess at where some data is), by
running this instead:

(dbx) lwpstacks nodebug

With the script as-is, the 'nodebug' stuff seems to work with OpenAFS
1.6.2 on Solaris 10 SPARC, but it may need fiddling to work anywhere
else.

If either of those works, you'll see something like:

(dbx) lwpstacks nodebug
!# NOT using debug symbols
!# looking for threads in blocked 
::echo stack pointer for thread 14a530: 1562d8
0x001562d8::stack 0 ! sed 's/^/  /'
::echo
::echo stack pointer for thread 180cf8: 18caa0
0x0018caa0::stack 0 ! sed 's/^/  /'
[...]

To get actual stack traces out of that, pipe the output through mdb:

(dbx) lwpstacks nodebug | mdb /path/to/buserver /path/to/core
stack pointer for thread 14a530: 1562d8
  LWP_WaitProcess+0x38()
  rxi_Sleep+4()
  rx_GetCall+0x320()
  rxi_ServerProc+0x40()
  rx_ServerProc+0x74()
  Create_Process_Part2+0x40()
  0x68388()
  ubik_ServerInitCommon+0x23c()

stack pointer for thread 180cf8: 18caa0
  LWP_WaitProcess+0x38()
[...]

This output is similar enough to mdb ::findstack output that it will
work with David Powell's "munges" script if you have that. But it's also
pretty useful just by itself.

Surprisingly, that doesn't require any manual core editing. mdb I think
is the only debugger I've used that lets you get stack trace information
from arbitrary context (at least, I haven't seen an easy way for gdb or
dbx to do this), but the way state is stored on solaris on sparc
probably helps make that easier.

If you want to provide such stack output from the core you captured, it
may say what's going on.

-- 
Andrew Deason
[email protected]

Attachment: lwpstacks.ksh
Description: Binary data

Reply via email to