Just some thoughts... Since you are physically moving the machines,
things like loose cards, processors, heat sinks/fans, memory, cables
come to mind. I've personally have had loose heat sinks cause
processors to do funky things (software crashes/corruption, etc...).
I've heard of issues with the disk heads hitting the platters while they
were moved which lead to data loss. Have you tried running a full file
system check? I think most modern disks lock the disk armatures in
place now but the disks/raid device might have software to do this for
you though still. Other problem sources might include weird
environmental ones, like excessive heat and magnetic fields playing
havoc with the hardware during the transition.
Good luck figuring it out.
Bart
Steve Herborn wrote:
I have a small test cluster built off Novell SUES Enterprise Server
10.2 that is giving me fits. It seems that every time the hardware is
physically moved (keep getting kicked out of the space I'm using), I
end up with any number of different problems.
Personally I suspect some type of hardware issue (this equipment is
about 5 years old), but one of my co-workers isn't so sure hardware is
in play. I was having problems with the RAID initializing after one
move back which I resolved a while back by reseating the RAID
controller card.
This time It appears that the file system & configuration databases
became corrupted after moving the equipment. Several services aren't
starting up (LADP, DHCP, PBS to name a few) and YAST2 hangs any time
an attempt is made to use it. For example adding a printer or software
package. My co-worker feels the issue maybe related to the ReiserFS
file system with AMD processors. The ReiserFS file system was the
default presented when I initially installed SLES so I went with it.
Do you know of any issues with using the ReiserFS file system on AMD
based systems or have any other ideas what I maybe facing?
*Steven A. Herborn*
*U.S. Naval Academy*
*Advanced Research Computing*
*410-293-6480 (Desk)*
*757-418-0505 (Cell)*
------------------------------------------------------------------------
*From:* [email protected]
[mailto:[email protected]] *On Behalf Of *gossips J
*Sent:* Monday, March 09, 2009 5:08 AM
*To:* [email protected]
*Subject:* [Beowulf] HPCC "intel_mpi" error
Hi,
We are using ICR validation.
We are facing following problem while running below command:
cluster-check --debug --include_only intel_mpi /root/sample.xml
Problem is:
Output of cluster checker shows us that "intel_mpi" FAILED, where as by
looking into debug.out file it is seen that "Hello World" is returned from
all nodes.
I have 16 nodes configuration and we are running 8 proc/node.
Above behavior is observed with even 1 proc/node, 2 proc/node, 4 proc/node
as well. I also tried "rdma" and "rdssm" as a DEVICE in XML file but no luck.
If anyone can shed some light on this issue, it would be great help.
Another thing I would like to know is:
Is there a way to specify "-env RDMA_TRANSLATION_CACHE" option with Intel Cluster Checker?
Awaiting for kind response,
Thanks in advance,
Polk.
------------------------------------------------------------------------
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf