Hi,

I wrote a document about how to integrate Torque/MAUI and Globus. It works
for me, besides, to provide interesting hints to overcome problems that I
and other people found.

Sure, it can be improved, then, your feedback is important to me.

regards.

PD: Gramatically need to be revised. Technicals suggestions are welcome.

http://ece.uprm.edu/~s047267
http://del.icio.us/josanabr
http://blog-grid.blogspot.com
Title: Preparing Torque/Maui for Work with GT4

Preparing Torque/Maui for Work with GT4

John A. Sanabria - [EMAIL PROTECTED]
Last Updated: Wed Sep 5 23:35:49 2007



This document is based on my previous tutorial Installing and Configuring Torque/MAUI. The previous tutorial works fine when you deal with a plain cluster, however, additional configuration steps are necessary to be done for integrate Torque/MAUI with GT4. Then, the reader of the previouis article, will found similarities between them. Likewise, the new reader do not require has any knowledge about the previous article to install a Torque/MAUI cluster that works with no Globus Toolkit (GT) integration. Finally, this tutorial is an on-going work, so, any feedback is welcome.

Requirements

First of all, for deploy a minimal cluster you need one machine, however, for present problems related with independent machines, two computational nodes is suggested.

My testbed consist of two Linux machines with FC7 installed.

Furthermore, The machines have installed the following network services:

  • rsh (client and server). For environments with direct access to internet rsh/rlogin installation is discourage.
  • nfs (client and server).

Besides, you must get torque and maui source code.

Users

For every machine belonging to the cluster, is necessary create one user with the same id. (Someone could provide a short NIS+ tutorial).

For this tutorial the user created is josanabr.

Setting Up the Services

Now, I provide a short steps and hacks for configure RSH and NFS properly.

RSH

Next, I describe the steps to configure the RSH service. (For simplicity, I recommend execute this configuration steps at EVERY cluster machine. (cssh))

  1. Enable the RSH service, since by default it is disable, then, edit the file /etc/xinetd.d/rsh. Next, look for the variable disable and set its value for no.
  2. Enable the RLogin service, since by default it is disable, then, edit the file /etc/xinetd.d/rlogin. Next, look for the variable disable and set its value for no.
  3. As root, restart the xinetd service, as follow: /etc/init.d/xinetd restart.

Now, you can try to login as josanabr user.

  [EMAIL PROTECTED] ~]$ rsh pdclab-04
  connect to address 136.145.116.81 port 543: Connection refused
  Trying krb4 rlogin...
  connect to address 136.145.116.81 port 543: Connection refused
  trying normal rlogin (/usr/bin/rlogin)
  Password: 
  Last login: Tue Sep  4 15:46:11 from pdclab-00
  [EMAIL PROTECTED] ~]$ 

That's cool? For Torque purposes, not is, It does not like Connection refused messages. The problem here is disclosed, when you execute the next commmand:

  [EMAIL PROTECTED] ~]$ which -a rsh
  /usr/kerberos/bin/rsh
  /usr/bin/rsh

The first rsh instance to be executed is located under /usr/kerberos directory. Then, to avoid the ugly message is necessary, try first, the instance located at /usr/bin.

(You can try another more elegant solution) In order to do that, as root, execute the next commands:

  [EMAIL PROTECTED] etc]# cd /usr/kerberos/bin/
  [EMAIL PROTECTED] bin]# mv rsh rsh.krb
  [EMAIL PROTECTED] bin]# mv rlogin rlogin.krb
  [EMAIL PROTECTED] bin]# ln -sf /usr/bin/rsh .
  [EMAIL PROTECTED] bin]# ln -sf /usr/bin/rlogin .

Now, try again:

  [EMAIL PROTECTED] ~]$ rsh pdclab-04
  Password: 
  Last login: Tue Sep  4 15:46:29 from pdclab-00
  [EMAIL PROTECTED] ~]$ 

Hmmm!, looks better :-), still, I need provide my password. To avoid type the password, e.g. when you try to login from pdclab-00 to pdclab-04, login to pdclab-04 and execute the next commands:

  [EMAIL PROTECTED] ~]$ vi .rhosts
  [EMAIL PROTECTED] ~]$ cat .rhosts 
  pdclab-00.ece.uprm.edu
  pdclab-00
  [EMAIL PROTECTED] ~]$ chmod og-r .rhosts 
  [EMAIL PROTECTED] ~]$ ls -l .rhosts 
  -rw------- 1 josanabr josanabr 33 Sep  4 16:16 .rhosts
  [EMAIL PROTECTED] ~]$ 

Now, try again:

  [EMAIL PROTECTED] ~]$ rsh pdclab-04
  Last login: Tue Sep  4 16:18:17 from pdclab-00
  [EMAIL PROTECTED] ~]$ 

Hmmm!!!, well done, guy. Now, allow the connection from pdclab-04 to pdclab-00.

NFS

Now, for proper integration of Torque and GT4 we need to share a filesystem from the master node with compute nodes. Remember, our master node is pdclab-00 and our compute node is pdclab-04, then our NFS server must be pdclab-00. The filesystem to be shared is the /home directory. Then, as root, execute next commands:

  [EMAIL PROTECTED] etc]# vi /etc/exports
  [EMAIL PROTECTED] etc]# cat /etc/exports 
  /home pdclab-04.ece.uprm.edu(rw,sync)
  [EMAIL PROTECTED] etc]# 

Then, restart the NFS related services:

  /etc/init.d/portmap restart
  /etc/init.d/nfs restart
  /etc/init.d/nfslock restart

Now, you can mount from pdclab-04 the pdclab-00's /home directory.

  [EMAIL PROTECTED] ~]# mount -t nfs pdclab-00:/home /home
  [EMAIL PROTECTED] ~]# ls -l /home/
  total 8
  drwx------ 5 globus   globus   4096 Jul 23 17:56 globus
  drwx------ 4 josanabr josanabr 4096 Sep  5 14:19 josanabr

Building the Cluster

For achieve the "PBS" and GT4 integration, the first thing to must describe is set up the Torque and MAUI components. Now, we select Torque as the resource manager for distributed environments and can be consider like a PBS clone, but open source. On the other hand, MAUI is a robust scheduler to support advanced mechanism and policies for schedul large set of distributed computational resources.

No more words, hands on.

Setting up Torque

Torque is an open PBS descendant. Then, it is a distributed resource manager, providing control over batch jobs and distributed compute nodes. Although, has support to handle scheduling policies, this is not a major concern.

Let's put hands on.

Download the software

The software can be downloaded from here.

Note: this tutorial employ the version 2.1.9.

Unpacking, Configuring, Compiling and Installing the Server

Go to a proper directory where you wish uncompress the file:

  [EMAIL PROTECTED] ~]# cd /usr/local/src/
  [EMAIL PROTECTED] src]# tar xfz ~/torque-2.1.9.tar.gz 
  [EMAIL PROTECTED] src]# cd torque-2.1.9/

Now, for configuring torque, explicitly, is requested to build the monitor and clients scripts, used for install the proper software at compute nodes.

  ./configure --enable-server --enable-monitor --enable-clients
  make
  make install

With no errors, under torque source code directory, you need need to execute next commands:

  ./torque.setup globus
  make packages

the first command, end to configure the server and indicate that globus user is the Torque manager. Finally, the packages necessary to be delivered at every compute node are created.

Setting Up a Compute Node

According to the tasks done at this moment, you can copy the scripts:

  • torque-package-clients-linux-i686.sh
  • torque-package-mom-linux-i686.sh

from the server pdclab-00 to the compute node pdclab-04. Then, loging on the pdclab-04 as root user and copy the scripts located at pdclab-00:

  [EMAIL PROTECTED] ~]# scp pdclab-00:/usr/local/src/torque-2.1.9/torque-package-clients-linux-i686.sh .
  [EMAIL PROTECTED]'s password: 
  torque-package-clients-linux-i686.sh          100%  400KB 400.0KB/s   00:00    
  [EMAIL PROTECTED] ~]# scp pdclab-00:/usr/local/src/torque-2.1.9/torque-package-mom-linux-i686.sh .
  [EMAIL PROTECTED]'s password: 
  torque-package-mom-linux-i686.sh              100%  448KB 447.5KB/s   00:00    
  [EMAIL PROTECTED] ~]#

Now, install the packages:

  [EMAIL PROTECTED] ~]# ./torque-package-clients-linux-i686.sh --install
  
  Installing TORQUE archive... 
  
  Done.
  [EMAIL PROTECTED] ~]# ./torque-package-mom-linux-i686.sh --install
  
  Installing TORQUE archive... 
  
  Done.
  [EMAIL PROTECTED] ~]#

Verify that the pdclab-04's master node is pdclab-00, then:

  [EMAIL PROTECTED] ~]# cat /var/spool/torque/server_name 
  pdclab-00.ece.uprm.edu
  [EMAIL PROTECTED] ~]#

It's ok. For finalize the client configuration, edit the file /var/spool/torque/mom_priv/config and add those lines:

  arch x86
  opsys fc6
  $logevent 255
  $usecp *:/home /mnt/home

Last line indicate to map the directory /home on submit host to /mnt/home on compute node.

Now you can execute the program to receive the jobs from master node:

  [EMAIL PROTECTED] ~]# pbs_mom

Setting Up Maui

Maui is an advanced policy engine used to improve the manageability and efficiency of machines ranging from clusters of a few processors to multi-teraflop supercomputers.

Next steps must be executed at master node (pdclab-00).

Download the software

In order to get the software go here.

Previous hacks

Due to integration issues with Torque, Maui expects to find libpbs.so and libpbs.a libraries on the master node. Then, execute that:

  [EMAIL PROTECTED] ~]# cd /usr/local/lib
  [EMAIL PROTECTED] lib]# ln -sf libtorque.so libpbs.so
  [EMAIL PROTECTED] lib]# ln -sf libtorque.a libpbs.a

Unpacking, Configuring, Compiling and Installing

Execute the next commands:

  [EMAIL PROTECTED] lib]# cd /usr/local/src/
  [EMAIL PROTECTED] src]# tar xfz ~/maui-3.2.6p13.tar.gz
  [EMAIL PROTECTED] src]# cd maui-3.2.6p13/
  [EMAIL PROTECTED] maui-3.2.6p13]# export MAUIDIR=/var/spool/maui
  [EMAIL PROTECTED] maui-3.2.6p13]# ./configure --with-spooldir=${MAUIDIR}
  [EMAIL PROTECTED] maui-3.2.6p13]# make
  [EMAIL PROTECTED] maui-3.2.6p13]# make install

Note If you got some message error related with -lnet, as root, execute yum install libnet-devel.

Final Configuration Steps

Ok, almost is done, so execute next:

  [EMAIL PROTECTED] ~]# qmgr
  Qmgr: set server resources_default.nodect = 1
  Qmgr: set server resources_default.walltime = 00:05:00
  Qmgr: quit

Finally,

  [EMAIL PROTECTED] ~]# qterm -t quick ; pbs_server
  [EMAIL PROTECTED] ~]# /usr/local/maui/sbin/maui
  [EMAIL PROTECTED] ~]# pbsnodes -a
  pdclab-04.ece.uprm.edu
       state = free
       np = 1
       ntype = cluster
       status = arch=x86,opsys=fc6,uname=Linux pdclab-04.ece.uprm.edu 2.6.18-1.2798.fc6xen #1 SMP Mon Oct 16 15:11:19 EDT 2006 i686,sessions=? 0,nsessions=? 0,nusers=0,idletime=185,totmem=816556kb,availmem=729324kb,physmem=262324kb,ncpus=1,loadave=0.02,netload=26127667,state=free,jobs=? 0,rectime=1189026813

Good kid.

Testing Torque/MAUI Installation

In order to test our cluster deployment, login as josanabr in pdclab-00. There create a text file, with content:

  #!/bin/bash
    
  /bin/hostname

save it as, mysub, then:

  [EMAIL PROTECTED] ~]$ qsub mysub
  0.pdclab-00.ece.uprm.edu
  [EMAIL PROTECTED] ~]$ ls -rtl
  total 8
  -rw-r--r-- 1 josanabr josanabr 29 Sep  5 17:17 mysub
  -rw------- 1 josanabr josanabr 23 Sep  5 17:17 mysub.o0
  -rw------- 1 josanabr josanabr  0 Sep  5 17:17 mysub.e0
  [EMAIL PROTECTED] ~]$ cat mysub.o0 
  pdclab-04.ece.uprm.edu
  [EMAIL PROTECTED] ~]$ 

Already you have a cluster, congrats. :-D.

Now, our journey begins, ;-)

A Short Journey over GT Quick Start Guide

Here, instead to provide a deep description of the steps for carry out the Globus configuration, compilation and installation process, we just provide a checklist to follow for achieve the integration between Torque/MAUI and GT4. Then, a prior knowlegde with GT installation is recommended.

Configuring, Compiling and Installing GT4

Ok, remember pdclab-00 is our master node. My globus user is named globus. Then, I got into globus account at pdclab-00. I have the GT source in the globus home directory. Execute the next commands:

  [EMAIL PROTECTED] ~]$ cd gt4.0.5-all-source-installer/
  [EMAIL PROTECTED] gt4.0.5-all-source-installer]$ ./configure --prefix=/opt/gt --enable-wsgram-pbs
  [EMAIL PROTECTED] gt4.0.5-all-source-installer]$ make
  ...
  ...
  echo "Your build completed successfully.  Please run make install."
  Your build completed successfully.  Please run make install.
  [EMAIL PROTECTED] gt4.0.5-all-source-installer]$ make install

Setting up Security at your Cluster

For that step, I have a simpleCA set up in the init.ece.uprm.edu machine. So, for configure the master node you need follow the steps given in section 3.3 of GT4 QuickStart Guide.

Besides, init.ece.uprm.edu has MyProxy server running. Then, josanabr can request his new certificate executing:

  [EMAIL PROTECTED] ~]$ myproxy-init -s init
  Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-init.ece.uprm.edu/OU=ece.uprm.edu/CN=John Sanabria
  Enter GRID pass phrase for this identity:
  Creating proxy ................................... Done
  Proxy Verify OK
  Your proxy is valid until: Wed Sep 12 21:06:10 2007
  Enter MyProxy pass phrase:
  Verifying - Enter MyProxy pass phrase:
  A proxy valid for 168 hours (7.0 days) for user josanabr now exists on init.
  [EMAIL PROTECTED] ~]$ myproxy-logon -s init
  Enter MyProxy pass phrase:
  A credential has been received for user josanabr in /tmp/x509up_u501.
  [EMAIL PROTECTED] ~]$

For more information read section 4.3 of GT QuickStart Guide.

Preparing GridFTP Service

You can follow the steps described in section 5.4.

Preparing the Globus Container

Previous to follow the instructions given in section 5.5, you need provide some information to configure the RFT service at pdclab-00. Then:

  1. Edit $GLOBUS_LOCATION/etc/globus_wsrf_rft/jndi-config.xml, look for the string jdbc:postgresql and change the host name for the host name of the machine with postgres and rftDatabase installed. For me, my database is located at init machine.

GRAM, the time of truth

Read the section 5.7.

Test it, as follow:

  [EMAIL PROTECTED] ~]$ globusrun-ws -submit -s -c /bin/date
  Delegating user credentials...Done.
  Submitting job...Done.
  Job ID: uuid:99bef29a-5c26-11dc-b839-00163e3dc54e
  Termination time: 09/07/2007 03:09 GMT
  Current job state: Active
  Current job state: CleanUp-Hold
  Wed Sep  5 23:09:38 AST 2007
  Current job state: CleanUp
  Current job state: Done
  Destroying job...Done.
  Cleaning up any delegated credentials...Done.

Well, but still you're not utilizing your cluster. Try this:

  [EMAIL PROTECTED] ~]$ globusrun-ws -Ft PBS -submit -S -f a.rsl
  Delegating user credentials...Done.
  Submitting job...Done.
  Job ID: uuid:b504593c-5c26-11dc-8737-00163e3dc54e
  Termination time: 09/07/2007 03:10 GMT
  Current job state: StageIn
  Current job state: Pending
  Current job state: Active
  Current job state: CleanUp
  Current job state: Done
  Destroying job...Done.
  Cleaning up any delegated credentials...Done.

Works?, hmmm!!! I guess not:

  [EMAIL PROTECTED] ~]$ cat stderr 
  pdclab-04.ece.uprm.edu: Connection refused
  /var/spool/torque/mom_priv/jobs/22.pdclab-0.SC: line 55: [: too many arguments

But, everything looks correct? More amazing is the way to resolve the problem. Edit the file ${GLOBUS_LOCATION}/lib/perl/Globus/GRAM/JobManager/pbs.pm and initialize the variable $cluster in 0 instead 1. Try again:

  [EMAIL PROTECTED] ~]$ globusrun-ws -Ft PBS -submit -S -f a.rsl
  Delegating user credentials...Done.
  Submitting job...Done.
  Job ID: uuid:2483539e-5c27-11dc-9bc2-00163e3dc54e
  Termination time: 09/07/2007 03:13 GMT
  Current job state: StageIn
  Current job state: Pending
  Current job state: Active
  Current job state: CleanUp
  Current job state: Done
  Destroying job...Done.
  Cleaning up any delegated credentials...Done.
  [EMAIL PROTECTED] ~]$ cat stderr 
  [EMAIL PROTECTED] ~]$ cat stdout
  Hello World!

Already done!

Final Comments

At this time, integrate a cluster with GT can be a hard task. Besides, there exist so many factors that can affect the normal integration process. Then, the mailing list support, sometimes, is either unavailable or poor effective. This is not your foul guys, anyway, is dissapointed.

This document, fulfill my need, perhaps for a newbie reader, more details are necessary.

The main motivation for generate this document, is provide a better roadmap to integrate Torque/MAUI with Globus.

I am sure, this document can be improved, then, I need your feedback. Any correction (grammar, technical, whatever!), I'll appreciate it.

Regards.

Resources

Certainly, i did not write all that from scratch, but, i use the next web resources:

For more information use google :-D, or write me at .

<<attachment: email.jpg>>

Reply via email to