Now, I get quite confused. And don't know what's the
problem exactly.

Let me describe what I have done first.

I have installed some toy cluster(1 master+2
slave,oscar1.2,kernel:2.4.7smp-10) successfully two or
three weeks ago.

Now we want to build a cluster with 32 dual-cpu PCs.
And the master node is a uni-cpu machine. But we can
not add all 32 machine together at once, we need to
use serveral ones first,then add some others.
Because we use oscar 1.2,which doesn't provide the
function of adding clients. So I figure out a way to
add new clients on my toy cluster.
For example:
=================================================
Using my case,1 node first, then add the other 1
 node
 First, in DEFINE OSCAR CLIENTS, define
 NUM OF HOSTS=1
 STARTING NUM = 1
 STARTING IP = 192.168.1.101  When you want to add the
other one.  Also in this step,set
 NUM OF NODES=1
 STARTING NUM=2
 STARTING IP=192.168.1.102
===================================================
And finally run the "Complete Cluster Setup" in
OSCAR_WIZARD each time for adding a client.

It did work well on my toy cluster, and pass all
tests.

Why I does not use OSCAR-1.3?
Because I can never even pass the first step.I don't
know the reason. And i don't have much time to figure
out what the matter is.


Now come to my real cluster.

First I update the master(uni-cpu)'s kernel from 2.4.7
to 2.4.18. I do this by compile the source code of
kernel-2.4.18 directly (make menuconfig,make
bzImage.......blabla)

And for the slave nodes,I just put the
kernel-smp-2.4.18-7.95.i386/athlon/i686.rpm in
/tftpboot/rpm/, and update the sample.rpmlist
manually. And updated some other packets to satisfy
the dependencies(It's a painful work).
Finally I could build an image)

Now,I want to test again.
So,first time I just add one client. And install it
easily and pass all tests.
(In fact,When i used test_cluster,I got an error ad
below:========================================================
 > ./test_cluster
 Enter the number of client nodes: 2
 Enter the number of processors per client: 2
 PBS TEST

--------------------------------------------------------------------------------Running
 a simple PBS shell job...
 PBS shell test failed - output file empty
 Contents of shelltest.err:
 /usr/local/pbs/bin/pbsdsh: tm_init failed, rc =
 TM_ESYSTEM (17000)
 /usr/local/pbs/bin/pbsdsh: tm_init failed, rc =
 TM_ESYSTEM (17000)  Since simple PBS job failed,
exiting
 ==============================================
I ask for reason in maillist,it's told that this is
just a bug.
And I can use qsub to test the mpich,lam,pvm
successfully. So I don't think this is a problem.
)

Ok,then,I add another one client. It still can pass
the mpich, pvm tests using qsub. For lam,it get error
info.
And i use mpirun to run some mpi program directly
without pbs. And find it spend quite long time on
initialization. And can not get speedup.(Because now I
have four cpu,"mpirun -np 4 xx" should be faster than
"mpirun -np 2 xx",just as my toy cluster) And I found
for SSH,it quite slow. When I want to ssh to one of
the client,it costs about 30 seconds. And the mpich is
based on SSH, so I think this should be the problem.
And our parallel application which is based on RSH
works well at this time.

So I continue to add another 4 nodes.
After this step finished.
I can only pass the pvm test.
For the mpich test,it always timed out.
And for my own mpi program, after running very
unexpected long time, it reports some errors.
And this time,our RSH based parallel application can
not run,when exceed 2 process.

I got quite confused now.
What's the problem here?
The Mpich? The SSH? The 2.4.18 kernel? Or something
else?

Would you like to help me to find what happened here?

Thanks very much!


__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com


-------------------------------------------------------
This sf.net email is sponsored by: Dice - The leading online job board
for high-tech professionals. Search and apply for tech jobs today!
http://seeker.dice.com/seeker.epl?rel_code=31
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to