Re: [gt-user] Limits when submitting jobs to pbs through globus

Yuriy Tue, 27 May 2008 19:32:40 -0700

Hi Stuart,

Sorry I forgot to include error message from the client:


ssh_exchange_identification: Connection closed by remote host

I suspect that there is limit on simultaneous connections in sshd.
Modified pbs.pm to insert random delay before making next connection,
and it increased number of jobs that could be run, but did not
eliminate the problem. Also the following script 

for i in `seq 1 $1 `
do
 ssh compute-$[$i % 10 + 1] echo "hello world" &
done;


 produces equivalent errors when run from
compute node and value of $1 is large (~ 150)

> This is not be well documented, but for PBS, when the 
> resourceAllocationGroup is used, then count is ignored.  As you can see 
> here in the if-then-else in choosing what is going to set the PBS nodes 
> directive.

Yeah and looking at pbs.pm code also explained why count has to be bigger
then 1, thank you.

Regards,
Yuriy 

On Tue, May 27, 2008 at 12:21:51PM -0500, Stuart Martin wrote:
>
> On May 27, 2008, at May 27, 12:20 AM, Yuriy wrote:
>
>>
>> Hi,
>>
>>
>> What determines the limits on number of jobs that can be submitted
>> through the single <job> description?
>
> There is no limit.  Only the limit of what is available in the cluster/LRM 
> that GRAM is interfacing with.
>
>> We have 10 node cluster with 2
>> quad-core processors per node, and when number of jobs is greater then
>> 160 there seems to be increasing probability to get the following
>> error:
>>
>>
>> /bin/sh:
>> /home/grid-bestgrid/.globus/90bbca80-2ba4-11dd-95fc-8fae74568b88/scheduler_pbs_cmd_script:
>> No such file or directory
>>
>> This error does not happen all the time, but the probability increases
>> as number of jobs increase, and I hasn't been able to trigger this
>> error with number of processors < number of cores * 2.
>
> If seen a similar error on TeraGrid's UC/ANL cluster due to NFS scalability 
> issues between the cluster's head node and compute nodes.  What happens is 
> that GRAM creates the scheduler_pbs_cmd_script for the job.  From the main 
> (first) compute node allocated, GRAM will rsh to each of the other compute 
> nodes allocated and run that same command script.  When all compute nodes 
> in the job access that file at the same time, some fail even though the 
> file is there.  Maybe we need to add some reliability in there to retry 
> (since we know that script should be there).  Or maybe there is a better 
> way to handle this situation.  I'll have to think about this some.
>
> One place to look to improve this would be to optimize your NFS 
> configuration/setup.  I am not an expert here and cannot offer much advise. 
>  But it would be good to have some helpful hints on this for people who run 
> into it.  So please provide any information if you find ways to improve the 
> situation.
>
> At what scale do problems occur with this?  By that I mean, how many PBS 
> processes/nodes are trying to access that file at (nearly) the same time 
> when errors begin to occur?
>
>>
>>
>> Also <count> tag seems to have no effect on number of jobs executed, other
>> then if it is equal to one, all jobs execute on single node.
>
> Here is the 4.0 doc on extension handling:
>       
> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs
>
> This is not be well documented, but for PBS, when the 
> resourceAllocationGroup is used, then count is ignored.  As you can see 
> here in the if-then-else in choosing what is going to set the PBS nodes 
> directive.
>
> From PBS.pm >>>>>>>>
>     if (defined $description->nodes())
>     {
>         #Generated by ExtensionsHandler.pm from resourceAllocationGroup 
> elements
>         print JOB '#PBS -l nodes=', $description->nodes(), "\n";
>     }
>     elsif($description->host_count() != 0)
>     {
>         print JOB '#PBS -l nodes=', $description->host_count(), "\n";
>     }
>     elsif($cluster && $cpu_per_node != 0)
>     {
>         print JOB '#PBS -l nodes=',
>         myceil($description->count() / $cpu_per_node), "\n";
>     }
> <<<<<<<<
>
>>
>>
>> Example job description:
>>
>> <job>
>>    <factoryEndpoint
>>            xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job";
>>            xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing";>
>>        <wsa:Address>
>>            
>> https://ng2test.auckland.ac.nz:8443/wsrf/services/ManagedJobFactoryService
>>        </wsa:Address>
>>        <wsa:ReferenceProperties>
>>            <gram:ResourceID>PBS</gram:ResourceID>
>>        </wsa:ReferenceProperties>
>>    </factoryEndpoint>
>> <executable>/bin/hostname</executable>
>> <count>200</count>
>> <queue>[EMAIL PROTECTED]</queue>
>> <jobType>multiple</jobType>
>>    <extensions>
>>        <resourceAllocationGroup>
>>                <hostCount>10</hostCount>
>>                <cpusPerHost>8</cpusPerHost>
>>                <processCount>162</processCount>
>>        </resourceAllocationGroup>
>>    </extensions>
>> </job>
>>
>> For MPI jobs the limit seems to be 20 * number of cores, for larger
>> number of processes I see erros like this:
>>
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (goodbye)
>> [compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
>> base/pls_base_orted_cmds.c at line 275
>> [compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
>> pls_rsh_module.c at line 1164
>> [compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
>> errmgr_hnp.c at line 90
>> mpiexec noticed that job rank 8 with PID 22257 on node compute-10 exited 
>> on signal 15 (Terminated).
>> [compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
>> base/pls_base_orted_cmds.c at line 188
>> [compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
>> pls_rsh_module.c at line 1196
>> --------------------------------------------------------------------------
>>
>> Again, this does not happen all the time.
>>
>>
>> Example job description:
>>
>> <job>
>>    <factoryEndpoint
>>            xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job";
>>            xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing";>
>>        <wsa:Address>
>>            
>> https://ng2test.auckland.ac.nz:9443/wsrf/services/ManagedJobFactoryService
>>        </wsa:Address>
>>        <wsa:ReferenceProperties>
>>            <gram:ResourceID>PBS</gram:ResourceID>
>>        </wsa:ReferenceProperties>
>>    </factoryEndpoint>
>>
>> <executable>test</executable>
>> <directory>/home/grid-bestgrid/MPI/</directory>
>> <queue>[EMAIL PROTECTED]</queue>
>> <jobType>mpi</jobType>
>>
>>    <extensions>
>>        <resourceAllocationGroup>
>>                <hostCount>5</hostCount>
>>                <cpusPerHost>8</cpusPerHost>
>>                <processCount>900</processCount>
>>        </resourceAllocationGroup>
>>    </extensions>
>> </job>
>>
>>
>>
>> Can anyone explain what is going on here?
>>
>>
>> Regards,
>> Yuriy
>>
>

Re: [gt-user] Limits when submitting jobs to pbs through globus

Reply via email to