Thanks Stephen, we are looking at a possible peak performance for a single OSS with IB outlet. I understand that at the client level the tradeoffs may be visible, and 750 MB/sec aggregate that you observe is not bad at all. But we want to to ensure that our OSS is able to unleash around 1 GB/sec into the clients' IB network...
Performance of a single OSS depends on the perfomance of the local ext3 backend file system, and we were unable to push it over 750 MB/sec. The advice of Andreas from Clusterfs is to use 3 OSTs inside one OSS and stripe files over all three of them. Time ago we have however considered and discarded this solution as we wanted to ensure that every file is confined to one and only one OSS capable to deliver 0.9-1 GB/sec. Setting the filesystem default stripe count to 3 may lead to a situation when a file may end up on different OSS machines, and that's exactly what we want to avoid. (I have asked Andreas to comment on the configuration; would it be possible to migrate to striping over 3 OST per OSS, and still ensure the OSS confinement, we would certainly follow the 3-OST solution). Greetings - Andrei. On 4/13/07, Stephen Simms <[EMAIL PROTECTED]> wrote:
Hi Andrei- 750 MB/s or so is about the max that we have seen from a single client to multiple OSSs across TCP, however we discovered that you can use both front side busses if you perform two simultaneous writes (turning ksocklnd's irq affinity off on the server side). This got us over 1 GB/s aggregate writes with multiple OSSs on the back end. Reads have been less - roughly 400 MB/s and 600 MB/s respectively. These numbers were using Myri-10G cards in ethernet mode with DDN 9550 controllers on the back end. So I believe that front side bus speed and internal memory copies have prevented us from better single file performance (reads worse than writes because you can't use zero copy for reads). My suspicion is that this is the case for you as well. Our network performance (measured with netperf) has been 9.1 Gb/s or better using the Myricom cards in Ethernet mode so we know that is not the limiting factor. Likewise, we see better than 350 MB/s per port on the DDN side (using sgpdd) so that's not the limiting factor either. I hope this helps, simms On Fri, 13 Apr 2007, Andrei Maslennikov wrote: > We are currently evaluating possible commodity hardware candidates > suitable for a single OSS with a single OST served to the clients via the > IB/RDMA. The goal is to provide the peak performance around 1 GB/sec > for large streaming I/O for a single file at the client level, *without* > striping. > In other words, we want to see if we could build a high performance > standalone box which would be acting as a Lustre Head for a couple > of clients (obviously, we will have to run also the metadata service on it). > > > Economically, the most attractive scenario is to use the "storage-in-a-box" > element as it allows to save on FC/SCSI cards and external disk enclosures. > One such candidate box that we tried had three RAID-6 controllers, with > 8 disk modules per controller. The machine is Intel dual core 3 GHz, with > 8 GB of RAM. And we are able to get an aggregate disk performance of > 300+, 600+, 900+ MB/sec on writes if we run 1,2,3 processes against > 1,2,3 distinct logical drives. > > Now comes the interesting point: if we run a single write process against > a striped logical volume built upon the three available drives, we only are > able to obtain 750 MB/sec. The writer process eats 100% of CPU, and > there is no way to improve this. This behaviour, of course, is perfectly > normal, but for us this means that if we would base our OST on this > combination of CPU + striped volume, we probably will never be able to > spit out more than 750 MB/sec peak i/o perf to the clients. Unless > the OST backend service itself is multithreaded! > > As we do not have at the moment a running Lustre/IB environment to check > it, I would appreciate if someone could comment on how OST processes are > organized internally. If only one thread is doing i/o towards the the > backend > ext3 partition, we won't be able to go over 750 MB/sec on such a machine. > Otherwise, we could probably grow ip to 900 MB/sec. > > Thanks ahead for any comment - Andrei. >
_______________________________________________ Lustre-discuss mailing list [EMAIL PROTECTED] https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
