Hi,

What determines the limits on number of jobs that can be submitted
through the single <job> description? We have 10 node cluster with 2
quad-core processors per node, and when number of jobs is greater then
160 there seems to be increasing probability to get the following
error:


/bin/sh:
/home/grid-bestgrid/.globus/90bbca80-2ba4-11dd-95fc-8fae74568b88/scheduler_pbs_cmd_script:
No such file or directory


This error does not happen all the time, but the probability increases
as number of jobs increase, and I hasn't been able to trigger this
error with number of processors < number of cores * 2.

Also <count> tag seems to have no effect on number of jobs executed, other
then if it is equal to one, all jobs execute on single node.

Example job description:

<job>
    <factoryEndpoint
            xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job";
            xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing";>
        <wsa:Address>
            
https://ng2test.auckland.ac.nz:8443/wsrf/services/ManagedJobFactoryService
        </wsa:Address>
        <wsa:ReferenceProperties>
            <gram:ResourceID>PBS</gram:ResourceID>
        </wsa:ReferenceProperties>
    </factoryEndpoint>
 <executable>/bin/hostname</executable>
<count>200</count>
 <queue>[EMAIL PROTECTED]</queue>
 <jobType>multiple</jobType>
    <extensions>
        <resourceAllocationGroup>
                <hostCount>10</hostCount>
                <cpusPerHost>8</cpusPerHost>
                <processCount>162</processCount>
        </resourceAllocationGroup>
    </extensions>
</job>

For MPI jobs the limit seems to be 20 * number of cores, for larger
number of processes I see erros like this:

--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 275
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
pls_rsh_module.c at line 1164
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at 
line 90
mpiexec noticed that job rank 8 with PID 22257 on node compute-10 exited on 
signal 15 (Terminated). 
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 188
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
pls_rsh_module.c at line 1196
--------------------------------------------------------------------------

Again, this does not happen all the time.


Example job description:

<job>
    <factoryEndpoint
            xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job";
            xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing";>
        <wsa:Address>
            
https://ng2test.auckland.ac.nz:9443/wsrf/services/ManagedJobFactoryService
        </wsa:Address>
        <wsa:ReferenceProperties>
            <gram:ResourceID>PBS</gram:ResourceID>
        </wsa:ReferenceProperties>
    </factoryEndpoint>

 <executable>test</executable>
<directory>/home/grid-bestgrid/MPI/</directory>
 <queue>[EMAIL PROTECTED]</queue>
 <jobType>mpi</jobType>

    <extensions>
        <resourceAllocationGroup>
                <hostCount>5</hostCount>
                <cpusPerHost>8</cpusPerHost>
                <processCount>900</processCount>
        </resourceAllocationGroup>
    </extensions>
</job>



Can anyone explain what is going on here?


Regards,
Yuriy

Reply via email to