Hi,
What determines the limits on number of jobs that can be submitted
through the single <job> description? We have 10 node cluster with 2
quad-core processors per node, and when number of jobs is greater then
160 there seems to be increasing probability to get the following
error:
/bin/sh:
/home/grid-bestgrid/.globus/90bbca80-2ba4-11dd-95fc-8fae74568b88/scheduler_pbs_cmd_script:
No such file or directory
This error does not happen all the time, but the probability increases
as number of jobs increase, and I hasn't been able to trigger this
error with number of processors < number of cores * 2.
Also <count> tag seems to have no effect on number of jobs executed, other
then if it is equal to one, all jobs execute on single node.
Example job description:
<job>
<factoryEndpoint
xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job"
xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing">
<wsa:Address>
https://ng2test.auckland.ac.nz:8443/wsrf/services/ManagedJobFactoryService
</wsa:Address>
<wsa:ReferenceProperties>
<gram:ResourceID>PBS</gram:ResourceID>
</wsa:ReferenceProperties>
</factoryEndpoint>
<executable>/bin/hostname</executable>
<count>200</count>
<queue>[EMAIL PROTECTED]</queue>
<jobType>multiple</jobType>
<extensions>
<resourceAllocationGroup>
<hostCount>10</hostCount>
<cpusPerHost>8</cpusPerHost>
<processCount>162</processCount>
</resourceAllocationGroup>
</extensions>
</job>
For MPI jobs the limit seems to be 20 * number of cores, for larger
number of processes I see erros like this:
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1164
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
line 90
mpiexec noticed that job rank 8 with PID 22257 on node compute-10 exited on
signal 15 (Terminated).
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1196
--------------------------------------------------------------------------
Again, this does not happen all the time.
Example job description:
<job>
<factoryEndpoint
xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job"
xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing">
<wsa:Address>
https://ng2test.auckland.ac.nz:9443/wsrf/services/ManagedJobFactoryService
</wsa:Address>
<wsa:ReferenceProperties>
<gram:ResourceID>PBS</gram:ResourceID>
</wsa:ReferenceProperties>
</factoryEndpoint>
<executable>test</executable>
<directory>/home/grid-bestgrid/MPI/</directory>
<queue>[EMAIL PROTECTED]</queue>
<jobType>mpi</jobType>
<extensions>
<resourceAllocationGroup>
<hostCount>5</hostCount>
<cpusPerHost>8</cpusPerHost>
<processCount>900</processCount>
</resourceAllocationGroup>
</extensions>
</job>
Can anyone explain what is going on here?
Regards,
Yuriy