Re: [gt-user] Limits when submitting jobs to pbs through globus

Stuart Martin Wed, 28 May 2008 08:16:34 -0700

Hi Yuriy,

Thanks for the additional details.


On May 27, 2008, at May 27, 9:32 PM, Yuriy wrote:

Hi Stuart,

Sorry I forgot to include error message from the client:

ssh_exchange_identification: Connection closed by remote host

What client got this error message? Are you using globusrun-ws? Orsome other client?

If the issue is the remote host closing the connection because theclient is unable to respond in time, maybe there is an ssh timeoutvalue that can be increased? Note, this is unrelated to GRAM as yousay just below. The issue here is between one compute node and a setof others in a cluster.

Is rsh an option here for you? Maybe that would offer greaterscalability?



I suspect that there is limit on simultaneous connections in sshd.

Interesting. Looks like your starting to pin down the problem andunderstand the limits.


Modified pbs.pm to insert random delay before making next connection,
and it increased number of jobs that could be run, but did not
eliminate the problem. Also the following script

for i in `seq 1 $1 `
do
ssh compute-$[$i % 10 + 1] echo "hello world" &
done;


produces equivalent errors when run from
compute node and value of $1 is large (~ 150)

I am adding a summary here for others in case they run into the sameproblem.


keywords: cluster gram process scalability ssh error

Cluster OS/Arch:  <?>

Cluster size: 10 node cluster with 2 quad-core processors per node (80cores in all)

GRAM description of job execution: For a jobtype multiple, after theLRM has started the job, gram will ssh (or rsh) from a single computenode to each allocated compute node in order to start the userapplication. For a job with a resourceAllocationGroup -> processCountof 160, that equates to 160 nearly simultaneous ssh commands from asingle compute node.


Job/process sizes and results:

        safe limit

- Cannot cause the ssh error with less than 160 processes started byGRAM in a single job


        error condition

- when number of processes is >= 160 there is an increasingprobability to get the ssh error.

This is not be well documented, but for PBS, when the
resourceAllocationGroup is used, then count is ignored. As you canseehere in the if-then-else in choosing what is going to set the PBSnodes
directive.
Yeah and looking at pbs.pm code also explained why count has to bebigger
then 1, thank you.

Regards,
Yuriy

On Tue, May 27, 2008 at 12:21:51PM -0500, Stuart Martin wrote:
On May 27, 2008, at May 27, 12:20 AM, Yuriy wrote:
Hi,


What determines the limits on number of jobs that can be submitted
through the single <job> description?
There is no limit. Only the limit of what is available in thecluster/LRM
that GRAM is interfacing with.
We have 10 node cluster with 2
quad-core processors per node, and when number of jobs is greaterthen
160 there seems to be increasing probability to get the following
error:


/bin/sh:
/home/grid-bestgrid/.globus/90bbca80-2ba4-11dd-95fc-8fae74568b88/scheduler_pbs_cmd_script:
No such file or directory
This error does not happen all the time, but the probabilityincreases
as number of jobs increase, and I hasn't been able to trigger this
error with number of processors < number of cores * 2.
If seen a similar error on TeraGrid's UC/ANL cluster due to NFSscalabilityissues between the cluster's head node and compute nodes. Whathappens isthat GRAM creates the scheduler_pbs_cmd_script for the job. Fromthe main(first) compute node allocated, GRAM will rsh to each of the othercomputenodes allocated and run that same command script. When all computenodesin the job access that file at the same time, some fail even thoughthefile is there. Maybe we need to add some reliability in there toretry(since we know that script should be there). Or maybe there is abetter
way to handle this situation.  I'll have to think about this some.

One place to look to improve this would be to optimize your NFS
configuration/setup. I am not an expert here and cannot offer muchadvise.But it would be good to have some helpful hints on this for peoplewho runinto it. So please provide any information if you find ways toimprove the
situation.
At what scale do problems occur with this? By that I mean, howmany PBSprocesses/nodes are trying to access that file at (nearly) the sametime
when errors begin to occur?
Also <count> tag seems to have no effect on number of jobsexecuted, other
then if it is equal to one, all jobs execute on single node.
Here is the 4.0 doc on extension handling:
        
http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs

This is not be well documented, but for PBS, when the
resourceAllocationGroup is used, then count is ignored. As you canseehere in the if-then-else in choosing what is going to set the PBSnodes
directive.

From PBS.pm >>>>>>>>
   if (defined $description->nodes())
   {
#Generated by ExtensionsHandler.pm fromresourceAllocationGroup
elements
       print JOB '#PBS -l nodes=', $description->nodes(), "\n";
   }
   elsif($description->host_count() != 0)
   {
       print JOB '#PBS -l nodes=', $description->host_count(), "\n";
   }
   elsif($cluster && $cpu_per_node != 0)
   {
       print JOB '#PBS -l nodes=',
       myceil($description->count() / $cpu_per_node), "\n";
   }
<<<<<<<<
Example job description:

<job>
  <factoryEndpoint
xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job"xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing">
      <wsa:Address>

https://ng2test.auckland.ac.nz:8443/wsrf/services/ManagedJobFactoryService
      </wsa:Address>
      <wsa:ReferenceProperties>
          <gram:ResourceID>PBS</gram:ResourceID>
      </wsa:ReferenceProperties>
  </factoryEndpoint>
<executable>/bin/hostname</executable>
<count>200</count>
<queue>[EMAIL PROTECTED]</queue>
<jobType>multiple</jobType>
  <extensions>
      <resourceAllocationGroup>
              <hostCount>10</hostCount>
              <cpusPerHost>8</cpusPerHost>
              <processCount>162</processCount>
      </resourceAllocationGroup>
  </extensions>
</job>

For MPI jobs the limit seems to be 20 * number of cores, for larger
number of processes I see erros like this:

--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1164
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
errmgr_hnp.c at line 90
mpiexec noticed that job rank 8 with PID 22257 on node compute-10exited
on signal 15 (Terminated).
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1196
--------------------------------------------------------------------------

Again, this does not happen all the time.


Example job description:

<job>
  <factoryEndpoint
xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job"xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing">
      <wsa:Address>

https://ng2test.auckland.ac.nz:9443/wsrf/services/ManagedJobFactoryService
      </wsa:Address>
      <wsa:ReferenceProperties>
          <gram:ResourceID>PBS</gram:ResourceID>
      </wsa:ReferenceProperties>
  </factoryEndpoint>

<executable>test</executable>
<directory>/home/grid-bestgrid/MPI/</directory>
<queue>[EMAIL PROTECTED]</queue>
<jobType>mpi</jobType>

  <extensions>
      <resourceAllocationGroup>
              <hostCount>5</hostCount>
              <cpusPerHost>8</cpusPerHost>
              <processCount>900</processCount>
      </resourceAllocationGroup>
  </extensions>
</job>



Can anyone explain what is going on here?


Regards,
Yuriy

Re: [gt-user] Limits when submitting jobs to pbs through globus

Reply via email to