Now, I get quite confused. And don't know what's the problem exactly. Let me describe what I have done first.
I have installed some toy cluster(1 master+2 slave,oscar1.2,kernel:2.4.7smp-10) successfully two or three weeks ago. Now we want to build a cluster with 32 dual-cpu PCs. And the master node is a uni-cpu machine. But we can not add all 32 machine together at once, we need to use serveral ones first,then add some others. Because we use oscar 1.2,which doesn't provide the function of adding clients. So I figure out a way to add new clients on my toy cluster. For example: ================================================= Using my case,1 node first, then add the other 1 node First, in DEFINE OSCAR CLIENTS, define NUM OF HOSTS=1 STARTING NUM = 1 STARTING IP = 192.168.1.101 When you want to add the other one. Also in this step,set NUM OF NODES=1 STARTING NUM=2 STARTING IP=192.168.1.102 =================================================== And finally run the "Complete Cluster Setup" in OSCAR_WIZARD each time for adding a client. It did work well on my toy cluster, and pass all tests. Why I does not use OSCAR-1.3? Because I can never even pass the first step.I don't know the reason. And i don't have much time to figure out what the matter is. Now come to my real cluster. First I update the master(uni-cpu)'s kernel from 2.4.7 to 2.4.18. I do this by compile the source code of kernel-2.4.18 directly (make menuconfig,make bzImage.......blabla) And for the slave nodes,I just put the kernel-smp-2.4.18-7.95.i386/athlon/i686.rpm in /tftpboot/rpm/, and update the sample.rpmlist manually. And updated some other packets to satisfy the dependencies(It's a painful work). Finally I could build an image) Now,I want to test again. So,first time I just add one client. And install it easily and pass all tests. (In fact,When i used test_cluster,I got an error ad below:======================================================== > ./test_cluster Enter the number of client nodes: 2 Enter the number of processors per client: 2 PBS TEST --------------------------------------------------------------------------------Running a simple PBS shell job... PBS shell test failed - output file empty Contents of shelltest.err: /usr/local/pbs/bin/pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) /usr/local/pbs/bin/pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) Since simple PBS job failed, exiting ============================================== I ask for reason in maillist,it's told that this is just a bug. And I can use qsub to test the mpich,lam,pvm successfully. So I don't think this is a problem. ) Ok,then,I add another one client. It still can pass the mpich, pvm tests using qsub. For lam,it get error info. And i use mpirun to run some mpi program directly without pbs. And find it spend quite long time on initialization. And can not get speedup.(Because now I have four cpu,"mpirun -np 4 xx" should be faster than "mpirun -np 2 xx",just as my toy cluster) And I found for SSH,it quite slow. When I want to ssh to one of the client,it costs about 30 seconds. And the mpich is based on SSH, so I think this should be the problem. And our parallel application which is based on RSH works well at this time. So I continue to add another 4 nodes. After this step finished. I can only pass the pvm test. For the mpich test,it always timed out. And for my own mpi program, after running very unexpected long time, it reports some errors. And this time,our RSH based parallel application can not run,when exceed 2 process. I got quite confused now. What's the problem here? The Mpich? The SSH? The 2.4.18 kernel? Or something else? Would you like to help me to find what happened here? Thanks very much! __________________________________________________ Do You Yahoo!? HotJobs - Search Thousands of New Jobs http://www.hotjobs.com ------------------------------------------------------- This sf.net email is sponsored by: Dice - The leading online job board for high-tech professionals. Search and apply for tech jobs today! http://seeker.dice.com/seeker.epl?rel_code=31 _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
