El 07/02/2012, a las 06:34, Derek Gaston escribi?: > On Mon, Feb 6, 2012 at 10:27 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote: > > Are _all_ the processes making it here? > > Sigh. I knew someone was going to ask that ;-) > > I'll have to write a short script to grab the stack trace from every one of > the 10,000 processes to see where they are and try to find any anomalies. > Anyone have a script (or pieces of one) to do this that they wouldn't mind > sharing?
Try with PADB: http://padb.pittman.org.uk/ Jose > > I did spot check quite a few and they were all in the same spot. > > Now here comes the weirdness: I left one of these processes attached in GDB > for quite a while (10+ minutes) after the whole job had been hung for over an > hour. When I noticed that I had left it attached I detached GDB and.... the > job started right up! That is: it moved on past this problem! How is that > for some weirdness. It might have just been coincidence... or maybe me > stalling that process for a bit by attaching GDB nudged some communication in > the right direction... I don't know. > > I know that's not terribly scientific. I'll have to wait until the next job > hangs before I can do more inspection, but when (not if) that happens I'll > post back with more info. > > Derek
