Mike and Paul
Thanks for the help. I am using -j 8 to run on a eight-core node.
ulimit -a shows basically no limits.
I may try the kernel route and I will go back and look at the logs. I
may try to repeat one of the failures to get fresh log entries. I also
enabled core dumps in hopes that a c method might barf up some info.
Thanks again
Mike
On Jul 23, 2009, at 2:51 PM, Mike Shal wrote:
On 7/23/09, Michael Muratet <[email protected]> wrote:
Greetings
I am using a data processing application that uses make for its
implementation. The application is a set of python scripts that
write out
Makefiles and the user launches the analysis by typing make -j n
target. I
suspect the authors were looking for a cheap way to get
parallelization. The
make takes many hours to run in most cases, executing a variety of
c methods
and scripts. My problem comes about when make tries to launch a new
thread:
make.err:make[1]: vfork: Resource temporarily unavailable
I suspect that the resource it wants is swap space, I can see that it
occasionally fills up and I am working on fixing that. But failing
that, is
there a way to get make to tell me what it lacks?
I don't think there is a way to get make to give you this information,
since make doesn't get any more specific info from the kernel. The
basic gist is the kernel will return with -EAGAIN somewhere along the
way while executing sys_vfork(). This gets stuck into errno by libc (I
think), and so all make sees is a -1 return value from vfork, and
errno = EAGAIN (which corresponds to 'Resource temporarily
unavailable'). Unfortunately if there are several spots in the kernel
where it can set EAGAIN, you don't know which one specifically will
have been triggered.
If you don't mind building your own kernel and adding debug to it,
that might be one way you could figure out what's going on. Depending
on your specific version/arch, you can start by looking at
kernel/fork.c:do_fork(), which calls copy_process(), which has some
-EAGAIN returns in it. Maybe someone has a method of tracing the
existing kernel?
Of course, it's possible it will fail in a different spot everytime if
it's just running low on memory. Are you sure the old processes are
properly being waited on? What size '-j' are you running anyway?
-Mike
Michael Muratet, Ph.D.
Senior Scientist
HudsonAlpha Institute for Biotechnology
[email protected]
(256) 327-0473 (p)
(256) 327-0966 (f)
Room 4005
601 Genome Way
Huntsville, Alabama 35806
_______________________________________________
Help-make mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/help-make