On Jul 23, 2009, at 12:59 PM, Milo wrote:

Hi, guys. I'm getting surprisingly poor performance on my PVFS2 cluster. Here's the setup:

*) 13 nodes running PVFS2 2.8.1 with Linux Kernel 2.6.28-13 server, each with a 15 drive RAID-5 array.

*) The RAID-5 array gets 60 MB/s local write speeds with XFS according to iozone (writing in 4 MB records)

Doing a component test of the device with local file system is a good idea, but you should see what you get with 1 MB records, since that's what you will be doing with PVFS (using a 1 MB flow buffer size). I wouldn't expect a difference, but just for consistency... Also, iozone doesn't normally include syncs with the timing, so you may see caching effects in that number. You would see them with PVFS too since you disabled TroveDataSync, but if the I/O is larger in the PVFS case you'll start to see actual disk I/O when you run out of memory. How big are your iozone runs?

I actually tend to use sgdd instead of iozone when doing a local filesystem test like this, because it allows me specify exactly when/ if to sync, or use O_DIRECT, block size, etc. That tells me exactly which mode works best for my server/hardware setup, so that I can choose the right parameters for PVFS.

Also, a component test of the network with iperf or netpipe or something would show you what to expect there. PVFS includes a BMI pingpong test in the test suite which shows point to point performance too.


I'd like to get at least 50 MB/s/server from the cluster and I've been testing this with a single PVFS2 server and client with the client running either on the same node or a node on the same switch (it doesn't seem to make a lot of difference). The server is configured with Trove syncing off, a 4 MB strip size simple_strip distribution, and a 1 MB FlowBufferSizeBytes. Results have been as follow:

With TroveMethod alt-aio or default, I'm getting around 15 MB/s when transferring a 3 GB file through pvfs2-cp:

        r...@ss239:~# pvfs2-cp -t ./file.3g /mnt/pvfs2/out
        Wrote 2867200000 bytes in 192.811478 seconds. 14.181599 MB/seconds

dd'ing a similar file through pvfs2fuse gets about a third of that performance, 5 MB/s:

r...@ss239:~# dd if=/dev/zero of=/mnt/pvfs2fuse/out bs=1024K count=1024
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB) copied, 206.964 s, 5.2 MB/s

I get similar results using iozone writing through the fuse client.

If I switch the method to null-aio, things speed up a lot, but it's still suspiciously slow:

        r...@ss239:~# pvfs2-cp -t ./file.out /mnt/pvfs2fuse/out7-nullaio
        Wrote 2867200000 bytes in 60.815127 seconds. 44.962086 MB/seconds
        
r...@ss239:~# dd if=/dev/zero of=/mnt/pvfs2fuse/out-nullaio bs=1024K count=1024
        1024+0 records in
        1024+0 records out
        1073741824 bytes (1.1 GB) copied, 21.0201 s, 51.1 MB/s

You're only going from one client so you're limited to single-link bandwidth, which isn't going to be more than ~120 MB/s I would guess (GigE right?). So you're getting a little over a third of that, which still isn't good. This isn't incast because you're doing writes, but one thing you can try is looking at doing writes to fewer servers. You can specify a stripe width on a per file basis, so that you can look at performance at 2, 4, and 8 servers. You can use setfattr:
setfattr -n "user.pvfs2.num_dfiles" -v "8" dir
Or just use pvfs2-xattr if you don't have the kernel module.
-sam


I suspect there's some network bottleneck. I'm going to try to adjust the MTU as Jim just did. But are there any other configuration options I should look into?

Thanks.

~Milo
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to