Art Poon wrote:
Dear colleagues,

[...]

What's got me and the IT guys stumped is that while the compute nodes
boot via PXE from the head node without trouble on the NetGear, they
barf with the SMC.  To be specific, after the initial boot with a
minimal Linux kernel, there is a "fatal error" with "timeout waiting
for getfile" when the compute node attempts to download the
provisioning image from head.  However, when they were running Rocks
before I arrived, the cluster worked fine with the SMC switch.

Is it the switch of the dhcp/bootp/tftp setup thats the problem? Are you sure the tftp daemon is up, or bootp is configured correctly?

Switches sometimes have broadcast storm suppression turned on, or worse, sometimes they have spanning tree turned on. You want the switch to be as dumb as you can possibly make it for most linux clusters. Fast, but dumb.

I've tried resetting the SMC switch to factory defaults (with
auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and
it doesn't seem to be demanding anything exotic.  We've tried
swapping out to another SMC switch but that didn't change anything.

This sounds more on the server software stack than the switch. Could you describe this? Are you using Scyld/Rocks for that?

Rocks is quite sensitive to configuration issues, and really doesn't like altered configurations (it is possible to do, though non-trivial).

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: [email protected]
web  : http://scalableinformatics.com
       http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to