Hi Andreas,

Thanks for your informative reply. In general terms, what you've written 
confirms my suspicions as to the underlying factors limiting the 
filesystem's performance in my application. I've interspersed a few 
comments below.

Andreas Dilger wrote:
> Note that using single SCSI disks means you have no redundancy of your
> data.  If any disk is lost, and you are striping your files over all
> of the OSTs (as it seems from below) then all of your files will also
> lose data.  That might be fine if Lustre is just used as a scratch
> filesystem, but it might also not be what you are expecting.

The Lustre filesystem in this application is, in fact, a scratch 
filesystem. Once the files have been written, they are copied to an 
archive area. Although I might be interested in availability/reliability 
for this filesystem to some degree in the future, presently it's 
performance that I'm after.

> Writing small file chunks from many clients to a single file is definitely
> one way to have very bad IO performance with Lustre.
> 
> Some ways to improve this:
> - have the application aggregate writes some amount before submitting
>   them to Lustre.  Lustre by default enforces POSIX coherency semantics,
>   so it will result in lock ping-pong between client nodes if they are
>   all writing to the same file at one time

That's a possibility, but limited to a degree by the instrument 
streaming the raw data into the cluster, and the output file format. I'm 
already in discussion with others on the project about this approach.

> - have the application to 4kB O_DIRECT sized IOs to the file and disable
>   locking on the output file.  That will avoid partial-page IO submissions,
>   and by disabling locking you will at least avoid the contention between
>   the clients.

I'll try this out. Luckily, no application level locking is being done 
at this time.

> - I thought there was also an option to have clients do lockless/uncached
>   IO wihtout changing the app, but I can't recall the details on how to
>   activate it.  Possibly another of the Lustre engineers will recall.

I'd be interested in finding out how to do that.

> - add more disks, or use SSD disks for the OSTs.  This will improve your
>   IOPS rate dramatically.  It probably makes sense to create larger OSTs
>   rather than many smaller OSTs due to less overhead (journal, connections,
>   etc).

I have been wondering about the effect SSD disks might have. 
Unfortunately, for now, I need to show that it's worth my time to keep 
working on a Lustre solution.

> - using MPI-IO might also help

MPI-IO is already on my list of things to try.

Thanks again.

-- 
Martin
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to