Hi Andreas, Thanks for your informative reply. In general terms, what you've written confirms my suspicions as to the underlying factors limiting the filesystem's performance in my application. I've interspersed a few comments below.
Andreas Dilger wrote: > Note that using single SCSI disks means you have no redundancy of your > data. If any disk is lost, and you are striping your files over all > of the OSTs (as it seems from below) then all of your files will also > lose data. That might be fine if Lustre is just used as a scratch > filesystem, but it might also not be what you are expecting. The Lustre filesystem in this application is, in fact, a scratch filesystem. Once the files have been written, they are copied to an archive area. Although I might be interested in availability/reliability for this filesystem to some degree in the future, presently it's performance that I'm after. > Writing small file chunks from many clients to a single file is definitely > one way to have very bad IO performance with Lustre. > > Some ways to improve this: > - have the application aggregate writes some amount before submitting > them to Lustre. Lustre by default enforces POSIX coherency semantics, > so it will result in lock ping-pong between client nodes if they are > all writing to the same file at one time That's a possibility, but limited to a degree by the instrument streaming the raw data into the cluster, and the output file format. I'm already in discussion with others on the project about this approach. > - have the application to 4kB O_DIRECT sized IOs to the file and disable > locking on the output file. That will avoid partial-page IO submissions, > and by disabling locking you will at least avoid the contention between > the clients. I'll try this out. Luckily, no application level locking is being done at this time. > - I thought there was also an option to have clients do lockless/uncached > IO wihtout changing the app, but I can't recall the details on how to > activate it. Possibly another of the Lustre engineers will recall. I'd be interested in finding out how to do that. > - add more disks, or use SSD disks for the OSTs. This will improve your > IOPS rate dramatically. It probably makes sense to create larger OSTs > rather than many smaller OSTs due to less overhead (journal, connections, > etc). I have been wondering about the effect SSD disks might have. Unfortunately, for now, I need to show that it's worth my time to keep working on a Lustre solution. > - using MPI-IO might also help MPI-IO is already on my list of things to try. Thanks again. -- Martin _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
