On Dec 4, 2008, at 11:28 AM, David Boyce wrote:
On Thu, Dec 4, 2008 at 11:05 AM, Steve Waltner
<[EMAIL PROTECTED]> wrote:
After two months, I'm finally looking into this issue again. Gotta
get it
working by the end of the year since migrating builds to Linux (more
specifically the faster x86 hardware) is one of my business
objectives
Somewhat off topic: Solaris is now FOSS and runs on the same X86
hardware as Linux. Thus there may be good reasons to convert to Linux
but access to faster X86 hardware is not a sufficient one.
I presume you know this and have additional reasons for the switch but
wanted to point it out for the record/archives.
You are correct that going to Solaris x86 would be the better solution
to get the performance gains of the x86 hardware and not deal with the
compatibility issues between Linux and Solaris that I'm seeing.
Unfortunately the toolset that we are using to build (VxWorks from
WindRiver) is only available on Solaris SPARC, Linux x86, and Windows.
Obviously, going to Windows would be a monumental undertaking with all
the unix based scripts that are used during the build, so that wasn't
considered. Going to Linux seemed like the easiest way to get the
speed boost, but is proving a little bit of a problem. I had asked
WindRiver about a Solaris x86 release of their software in the past.
Maybe it's time to ping them again about this. It would have been
better to ping them 6 weeks ago before we sent them a PO for licenses
for the next four years though. "Port your software, and get the
cash..." :-)
I do remember
the developer that did most of the work on the makefiles making the
comment
about /bin/sh on Solaris being junk and switching to /bin/ksh.
That reasoning made sense on Solaris but may have a problem now, given
that you're moving to Linux, because /bin/ksh on Linux is *also* junk.
[snip] Fortunately Solaris has been bundling
bash for quite a long time, so perhaps the most robust and portable
arrangement for you would be to settle on SHELL=/bin/bash.
I'll investigate using bash (as well as CentOS and Ubuntu as mentioned
by Galen) to see if it behaves any differently.
The main question that remains would be: Is there a way to debug
and follow
the token check-in/check-out process that is used internally in GNU
make to
try and see what's going on here? I can work on trying to track
down what's
going wrong, but without a way to get visibility into the process,
I'd just
be making random changes to the makefiles, which isn't going to be
very
productive.
Sorry, can't help directly with your main problem since I haven't
worked much with make -j. Since you're building your own make anyway
it shouldn't be too hard to insert some debugging printfs. Or if you
want to be really aggressive you could build a Solaris 10 machine and
install Linux in a "zone" (semi-virtualization concept), then use
dtrace to track what's happening with the job server. Possibly even
strace would help on native Linux.
I don't remember if this was mentioned upthread but presumably you've
read http://make.paulandlesley.org/jobserver.html for background? If
not, probably a good idea.
Hmm... as I think about it, the whole jobserver technique depends on
downstream processes to leave those file descriptors open. If anybody
messes with the FD_CLOEXEC flag or closes them explicitly, you might
see the behavior described. I've seen programs that do something like
for (i = 3; i < maxfds; i++) close(i);
before an exec, just for the heck of it. I've already mentioned that
pdksh is crap; I wonder if it's doing something like that? Wait, no,
you said you took /bin/ksh out and it still broke ... anyway, I'd try
strace or similar to see if the jobserver pipe's file descriptors are
being closed. Note that this is all based on a memory of the jobserver
document; I have not read it closely, lately.
I had read through the jobserver web page two years ago when we
switched our builds from using "-j --max-load=4" to "-j 4" at the same
time we moved the builds from running on the servers that everyone
uses for their interactive jobs to a cluster of dedicated build
servers. We did have several issues in the makefiles originally that
needed to be fixed in regards to how make called itself recursively to
run the build.
I'll do some testing with strace and possibly re-compiling GNU make
with some printfs in there to see if that provides any insight. Your
comment about something (possibly ksh) closing file handles may be
exactly what's going on here. I let a "gmake -j 100" job run to
completion on the Linux server. It too eventually degraded to a single-
threaded build, but it took a lot longer than the "-j 32" builds I
would normally run. This build exited with the following warning:
gmake: INTERNAL: Exiting with 1 jobserver tokens available; should be
100!
So, something is definitely interfering with the jobserver when the
build is run on Linux and consuming tokens that should only be used by
GNU make.
Thank you everyone for the detailed responses. I will some digging and
let you know what I find out.
Steve
_______________________________________________
Help-make mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/help-make