I've been trying to check why a Lustre fileystem on one cluster performs
so poorly. I have 2 clusters, 1 20 node dual Opteron with channel bonded
gigE ethernet and 1 10 node i386 cluster with 100mbit ethernet. Both run
the same version of Lustre, 1.5.95 and were set up the same way in which
the frontend is the MDT/MGS and each compute node is an OST, then the
/lustre filesystem is mounted on each node and frontend in /lustre.

On the Opteron cluster, I can start a copy of the NCBI databases into
/lustre and it takes forever, if it makes it. Loads go up to 8 and 9 on
the frontend and eventually I have to CTRL-C the copy. I have also tried
rsync with the same results. About 2 weeks ago I tried to copy about 8
gigs of data into /lustre and the lustre filesystem completely crashed.
The copy timed out about 4 times, then when I go back to check on it, I
can't see anything in /lustre. Checking the logs tells me that a few nodes
"died" but I can ssh to them and they are fine.

If I do the same things as above on the 10 node i386 cluster I get what I
expect. Excellent speed results and almost no load. I am quite happy with
it.

Why does Lustre perform so badly on the Opterons and not the i386 cluster?
I have a hunch that the channel bonding is the culprit.

Has channel bonding ever been a problem before? If so, are there any
bonding tweaks that I can perform on the Opteron cluster? Is anybody else
using channel bonding with a Lustre filesystem?

Thanks for any help or tips you can provide.


-- 
Jeremy Mann
[EMAIL PROTECTED]

University of Texas Health Science Center
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: (210) 567-2672

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to