Hi Yuriy,

Thanks for the additional details.

On May 27, 2008, at May 27, 9:32 PM, Yuriy wrote:

Hi Stuart,

Sorry I forgot to include error message from the client:

ssh_exchange_identification: Connection closed by remote host

What client got this error message? Are you using globusrun-ws? Or some other client?

If the issue is the remote host closing the connection because the client is unable to respond in time, maybe there is an ssh timeout value that can be increased? Note, this is unrelated to GRAM as you say just below. The issue here is between one compute node and a set of others in a cluster.

Is rsh an option here for you? Maybe that would offer greater scalability?



I suspect that there is limit on simultaneous connections in sshd.

Interesting. Looks like your starting to pin down the problem and understand the limits.


Modified pbs.pm to insert random delay before making next connection,
and it increased number of jobs that could be run, but did not
eliminate the problem. Also the following script

for i in `seq 1 $1 `
do
ssh compute-$[$i % 10 + 1] echo "hello world" &
done;


produces equivalent errors when run from
compute node and value of $1 is large (~ 150)

I am adding a summary here for others in case they run into the same problem.

keywords: cluster gram process scalability ssh error

Cluster OS/Arch:  <?>

Cluster size: 10 node cluster with 2 quad-core processors per node (80 cores in all)

GRAM description of job execution: For a jobtype multiple, after the LRM has started the job, gram will ssh (or rsh) from a single compute node to each allocated compute node in order to start the user application. For a job with a resourceAllocationGroup -> processCount of 160, that equates to 160 nearly simultaneous ssh commands from a single compute node.

Job/process sizes and results:

        safe limit
- Cannot cause the ssh error with less than 160 processes started by GRAM in a single job

        error condition
- when number of processes is >= 160 there is an increasing probability to get the ssh error.



This is not be well documented, but for PBS, when the
resourceAllocationGroup is used, then count is ignored. As you can see here in the if-then-else in choosing what is going to set the PBS nodes
directive.

Yeah and looking at pbs.pm code also explained why count has to be bigger
then 1, thank you.

Regards,
Yuriy

On Tue, May 27, 2008 at 12:21:51PM -0500, Stuart Martin wrote:

On May 27, 2008, at May 27, 12:20 AM, Yuriy wrote:


Hi,


What determines the limits on number of jobs that can be submitted
through the single <job> description?

There is no limit. Only the limit of what is available in the cluster/LRM
that GRAM is interfacing with.

We have 10 node cluster with 2
quad-core processors per node, and when number of jobs is greater then
160 there seems to be increasing probability to get the following
error:


/bin/sh:
/home/grid-bestgrid/.globus/90bbca80-2ba4-11dd-95fc-8fae74568b88/ scheduler_pbs_cmd_script:
No such file or directory

This error does not happen all the time, but the probability increases
as number of jobs increase, and I hasn't been able to trigger this
error with number of processors < number of cores * 2.

If seen a similar error on TeraGrid's UC/ANL cluster due to NFS scalability issues between the cluster's head node and compute nodes. What happens is that GRAM creates the scheduler_pbs_cmd_script for the job. From the main (first) compute node allocated, GRAM will rsh to each of the other compute nodes allocated and run that same command script. When all compute nodes in the job access that file at the same time, some fail even though the file is there. Maybe we need to add some reliability in there to retry (since we know that script should be there). Or maybe there is a better
way to handle this situation.  I'll have to think about this some.

One place to look to improve this would be to optimize your NFS
configuration/setup. I am not an expert here and cannot offer much advise. But it would be good to have some helpful hints on this for people who run into it. So please provide any information if you find ways to improve the
situation.

At what scale do problems occur with this? By that I mean, how many PBS processes/nodes are trying to access that file at (nearly) the same time
when errors begin to occur?



Also <count> tag seems to have no effect on number of jobs executed, other
then if it is equal to one, all jobs execute on single node.

Here is the 4.0 doc on extension handling:
        
http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs

This is not be well documented, but for PBS, when the
resourceAllocationGroup is used, then count is ignored. As you can see here in the if-then-else in choosing what is going to set the PBS nodes
directive.

From PBS.pm >>>>>>>>
   if (defined $description->nodes())
   {
#Generated by ExtensionsHandler.pm from resourceAllocationGroup
elements
       print JOB '#PBS -l nodes=', $description->nodes(), "\n";
   }
   elsif($description->host_count() != 0)
   {
       print JOB '#PBS -l nodes=', $description->host_count(), "\n";
   }
   elsif($cluster && $cpu_per_node != 0)
   {
       print JOB '#PBS -l nodes=',
       myceil($description->count() / $cpu_per_node), "\n";
   }
<<<<<<<<



Example job description:

<job>
  <factoryEndpoint
xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job " xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing ">
      <wsa:Address>

https://ng2test.auckland.ac.nz:8443/wsrf/services/ManagedJobFactoryService
      </wsa:Address>
      <wsa:ReferenceProperties>
          <gram:ResourceID>PBS</gram:ResourceID>
      </wsa:ReferenceProperties>
  </factoryEndpoint>
<executable>/bin/hostname</executable>
<count>200</count>
<queue>[EMAIL PROTECTED]</queue>
<jobType>multiple</jobType>
  <extensions>
      <resourceAllocationGroup>
              <hostCount>10</hostCount>
              <cpusPerHost>8</cpusPerHost>
              <processCount>162</processCount>
      </resourceAllocationGroup>
  </extensions>
</job>

For MPI jobs the limit seems to be 20 * number of cores, for larger
number of processes I see erros like this:

--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1164
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
errmgr_hnp.c at line 90
mpiexec noticed that job rank 8 with PID 22257 on node compute-10 exited
on signal 15 (Terminated).
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[compute-1.local:23438] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1196
--------------------------------------------------------------------------

Again, this does not happen all the time.


Example job description:

<job>
  <factoryEndpoint
xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job " xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing ">
      <wsa:Address>

https://ng2test.auckland.ac.nz:9443/wsrf/services/ManagedJobFactoryService
      </wsa:Address>
      <wsa:ReferenceProperties>
          <gram:ResourceID>PBS</gram:ResourceID>
      </wsa:ReferenceProperties>
  </factoryEndpoint>

<executable>test</executable>
<directory>/home/grid-bestgrid/MPI/</directory>
<queue>[EMAIL PROTECTED]</queue>
<jobType>mpi</jobType>

  <extensions>
      <resourceAllocationGroup>
              <hostCount>5</hostCount>
              <cpusPerHost>8</cpusPerHost>
              <processCount>900</processCount>
      </resourceAllocationGroup>
  </extensions>
</job>



Can anyone explain what is going on here?


Regards,
Yuriy




Reply via email to