It appears that we can likely eliminate 32/64 bit issues, then.
Some more questions:
Is this 20K command job:
- a sequence of trivially parallel commands,
- an MPI job,
- a job "array",
- or is it a complicated DAG?
Can you capture the qsub(s) commands associated with this job?
Are you sure that the number of systems and number of streams are
correctly specified?
-Keith.
At 1:39 PM -0800 1/11/12, Wil Irwin wrote:
Hi-
It is 64-bit on 64-bit. The exact version is from
'ge-6.2-bin-lx24-amd64.tar.gz' and 'ge-6.2-common.tar.gz'. So I can
rule out that issue.
As for the problems, I can provide more detail, but in brief (sort of):
1. The installation is w/o incident and I have used all the
suggested defaults. Out of frustration, I've also installed in a
couple of dozen time changing some of the more flexible defaults one
at a time.
2. The "simple" job runs as it should.
3. There are 3 nodes (with the master also serving as an executor).
All are talking to each other in term of the SGE ports and NFS.
4. My inquire was intended to be general in terms of some possible
incompatibility between SGE and SL 6.1, the comment which follow
have, unfortunately, the factor of submitting jobs using an analysis
application. The script which this application uses is a bit
convoluted, but I studied pretty well and, if there is some problem,
I don't see it. I have not received any negative feedback from other
users of this application. Unfortunately, it really isn't possible
to submit the job from this application w/o using the accompanying
script. So, of course, there is a bit of black-box factor.
5. One particular job is very large (~20K commands). After the
commands are generated and submitted, SGE returns the rather
confusing error message of "Unable to run job: job rejected: You try
to submit a job with more than 75000 tasks. Exiting." 75000 is the
configured limit, but I can readily see the command lines being
generated and it is exactly 16900. I would say in general, this is
the most perplexing problem.
6. #5 is accompanied by "failure" email messages, but no 16900
messages (I would say many hundred). I can't explain this behavior
either. It could actually be an email server issue and not related
to SGE, per se.
7. Another example is or will appear to be very specific to the
analysis application I am using as opposed to a general SGE issue.
For this application, there is an explicit user variable to set the
queue, and I have set it to 'verylong.q'. When I submit a much
smaller job (~200 commands) to try to figure out what is going
wrong, the 'verylong.q' is ignored and 'short.q' is selected. But
more curious and more SGE-related is the job will run, but it runs
the commands in series and only uses 1 processor on the master node
(each node has 6 x 2 cores).
That's a flavor of what is causing my sanity to slowly drift away.
Regards,
Wil
On Wed, Jan 11, 2012 at 1:00 PM, Keith Chadwick
<<mailto:[email protected]>[email protected]> wrote:
Are you trying to run either:
1. A 32 bit version of SGE 6.2 on a 64 bit SL 6.1 system?
or
2. A 64 bit version of SGE 6.2 on a 32 bit SL 6.1 system?
In the case #1, you should be able to get SGE to run once you install
the necessary 32 bit compatibility libraries, or (recommended) switch
to a 64 bit version of SGE 6.2.
In the case #2, you are going to be out of luck...
-Keith.
At 12:43 PM -0800 1/11/12, Wil Irwin wrote:
Hello-
I am having unparalleled (no pun intended) problems getting SGE 6.2
to run under SL 6.1. I have consulted with others who have quite a
bit of experience using SGE on an earlier version of SL, and we
cannot determine why it won't run.
Before I list the nature of the problems, I though I would start by
asking if anyone has had a successful experience with SGE 6.2 on SL
6.1.
I'm running kernel: 2.6.32-220.2.1.el6.x86_64 #1 SMP Thu Dec 22
11:15:52 CST 2011 x86_64
Thanks for any help.
-Wil