Hi all, Thanks for your responses! I finally fixed this yesterday afternoon but neglected to update my post, my apologies. After discussing our problem to the Penguin Computing service rep, I reconfigured the switch to enable fast spanning-tree mode for compute node ports. That apparently fixed the problem and thanks to your feedback I am starting to understand why.
Thanks again, - Art. On Dec 2, 2009, at 10:30 AM, Joe Landman wrote: > Art Poon wrote: >> Dear colleagues, > > [...] > >> What's got me and the IT guys stumped is that while the compute nodes >> boot via PXE from the head node without trouble on the NetGear, they >> barf with the SMC. To be specific, after the initial boot with a >> minimal Linux kernel, there is a "fatal error" with "timeout waiting >> for getfile" when the compute node attempts to download the >> provisioning image from head. However, when they were running Rocks >> before I arrived, the cluster worked fine with the SMC switch. > > Is it the switch of the dhcp/bootp/tftp setup thats the problem? Are you > sure the tftp daemon is up, or bootp is configured correctly? > > Switches sometimes have broadcast storm suppression turned on, or worse, > sometimes they have spanning tree turned on. You want the switch to be as > dumb as you can possibly make it for most linux clusters. Fast, but dumb. > >> I've tried resetting the SMC switch to factory defaults (with >> auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and >> it doesn't seem to be demanding anything exotic. We've tried >> swapping out to another SMC switch but that didn't change anything. > > This sounds more on the server software stack than the switch. Could you > describe this? Are you using Scyld/Rocks for that? > > Rocks is quite sensitive to configuration issues, and really doesn't like > altered configurations (it is possible to do, though non-trivial). > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics Inc. > email: [email protected] > web : http://scalableinformatics.com > http://scalableinformatics.com/jackrabbit > phone: +1 734 786 8423 x121 > fax : +1 866 888 3112 > cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
