A couple small 10node clusters we have setup used to routinely drop off the network and the switch would have to be hard reset for it to return. Granted we didn't do any deep analysis (just replaced with cisco) and it could be attributed to some bad switches, but i've also seen this at home with some 1gb switches i bought.
over the years i've been using netgear enterprise and home products, they are wonderful in light use 80-85% max throughput, but once you hit the 90+ areas they seem to start to degrade either through packet loss or over heating we still buy them for our management network, they're cheaper then hp and we just need it for kickstarts, snmp, etc.. as joe said, its just our opinion, your mileage may vary On Mon, Apr 5, 2010 at 3:40 PM, David Mathog <mat...@caltech.edu> wrote: > Michael Di Domenico >> I would have to agree. I have Netgears in my lab now and for light >> use they seem to be okay, but once you run a communications heavy MPI >> job over them they seem to fall down > > Please define "fall down". > > One test I have applied to a switch (only 100baseT) to see if it could > handle "full traffic" was running the script below on all nodes: > > #!/bin/bash > TINFO=`topology_info` > NEXT=`echo $TINFO | extract -mt -cols [3]` > if [ $NEXT != "none" ] > then > TIME=`accudate -t0` > dd if=/dev/zero bs=4096 count=1000000 | rsh $NEXT 'cat - >/dev/null' > accudate -ds $TIME >/tmp/elapsed_${HOSTNAME}.txt > fi > > Where topology_info defines a linear chain through all nodes, and what > ends up in the elapsed_HOSTNAME.txt files is transmission time from this > to the next node. extract and accudate are mine, the former is like > "cut" and the latter is just used here to calculate an elapsed time. > > This is slightly apples and oranges because in the two node (reference) > test the target node is only accepting packets, whereas when they are > all running it is also sending packets, and those compete with the ack's > going back to the first node. The D-Link switch held up quite well, I > thought. One pair of nodes tested this way completed in 350 seconds > (+/-), whereas it and the others took 370-380 seconds when they were all > running at once (20 compute nodes, first only sends, last only > receives). That is, 11.7 MB/sec for the pair, 10.8 MB/sec for all > pairs. For GigE it should come out at 117 and 108 (or so), if the > switch can keep up. > > I'm curious what the netgears and HP do in a test like this. If anybody > would like to try this, all the pieces for this simple test (if you can > run binaries for a 32 bit x86 environment) are here: > > http://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/testswitch.tar.gz > > (For other platforms obtain source for accudate and extract from here > > http://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/drm_tools.tar.gz > ) > > Start the jobs simultaneously on all nodes using whichever queue system > you have installed. Be sure to run it once first with a small count > number to force anything coming over nfs into cache before doing the big > test. (Or one could run netpipe on each pair of nodes, or anything else > really that loads the network.) > > Regards, > > David Mathog > mat...@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf