Ok, I ran Netpipe on a node and found out that the Pvfs2 traffic travelled over IPoIB, whereas writing to the underlying filesystem (ext4) travelled over native Infiniband. Hence, the performance with pvfs2 looked bad. Secondly, we are using OpenMPI, and it was set up such that it tried to use both the Ethernet and Infiniband on the machine for writing to pvfs2, which slowed things down even more. When I configured it such that it used IPoIB only, the performance was as expected.
Thank you Kyle, and everyone else for your help. - Kshitij -----Original Message----- From: Kyle Schochenmaier [mailto:[email protected]] Sent: Monday, December 19, 2011 10:03 PM To: Kshitij Mehta Cc: Troy Benjegerdes Subject: Re: [Pvfs2-users] Tracing pvfs2 internals Hi Kshitij - CC'ing a colleague of mine who worked on the same machines with me.. A couple things here.. In the past (4 years ago or more) the ipoib tests that I ran used to underperform quite a bit, theres something about the way that you have to get ethernet packets moving fast enough to actually use up the IB bandwidth that I could never quite figure out with IPOIB and pvfs2. This may have changed but its at least worth noting.. That being said, I did a significant amount of work on native IB with pvfs2 and it works quite well, I was able to saturate links doing raw IO to disk on many occassions, so this might be something that you can try if you're able to setup native IB for pvfs2 (its an option, if you have the drivers for it already) How is the actual bandwidth of your IPOIB link? Have you run something like netpipe to test the bandwidth of the IPOIB link? If so, what are the results (can you send it along?) Cheers, Kyle Schochenmaier On Mon, Dec 19, 2011 at 6:47 PM, Kshitij Mehta <[email protected]> wrote: > I have IPoIB , with a theoretical peak of 1GByte/s. > > - Kshitij > > On Dec 19, 2011, at 5:05 PM, Kyle Schochenmaier <[email protected]> wrote: > >> Hi Kshitij - >> >> What type of network are you using? GigE or 10GigE ? >> >> ~Kyle >> Kyle Schochenmaier >> >> >> >> On Mon, Dec 19, 2011 at 4:52 PM, Kshitij Mehta <[email protected]> wrote: >>> -------------------------------------------------------------------- >>> <Defaults> >>> UnexpectedRequests 50 >>> EventLogging none >>> EnableTracing no >>> LogStamp datetime >>> BMIModules bmi_tcp >>> FlowModules flowproto_multiqueue >>> PerfUpdateInterval 1000 >>> ServerJobBMITimeoutSecs 30 >>> ServerJobFlowTimeoutSecs 30 >>> ClientJobBMITimeoutSecs 300 >>> ClientJobFlowTimeoutSecs 300 >>> ClientRetryLimit 5 >>> ClientRetryDelayMilliSecs 2000 >>> PrecreateBatchSize 512 >>> PrecreateLowThreshold 256 >>> >>> StorageSpace /ramsanLU >>> LogFile /tmp/pvfs2-server.log </Defaults> >>> >>> <Aliases> >>> Alias crillio-01 tcp://192.168.2.94:3334 >>> Alias crillio-02 tcp://192.168.2.95:3334 </Aliases> >>> >>> <Filesystem> >>> Name pvfs2-ssd-fs >>> ID 1330429891 >>> RootHandle 1048576 >>> FileStuffing yes >>> <MetaHandleRanges> >>> Range crillio-01 3-2305843009213693953 >>> Range crillio-02 >>> 2305843009213693954-4611686018427387904 >>> </MetaHandleRanges> >>> <DataHandleRanges> >>> Range crillio-01 >>> 4611686018427387905-6917529027641081855 >>> Range crillio-02 >>> 6917529027641081856-9223372036854775806 >>> </DataHandleRanges> >>> <StorageHints> >>> TroveSyncMeta yes >>> TroveSyncData no >>> TroveMethod alt-aio >>> </StorageHints> >>> </Filesystem> >>> -------------------------------------------------------------------- >>> >>> >>> -----Original Message----- >>> From: Kyle Schochenmaier [mailto:[email protected]] >>> Sent: Monday, December 19, 2011 4:28 PM >>> To: Kshitij Mehta >>> Subject: Re: [Pvfs2-users] Tracing pvfs2 internals >>> >>> Can you send me your fs config file offline ? >>> >>> Kyle Schochenmaier >>> >>> >>> >>> On Mon, Dec 19, 2011 at 4:24 PM, Kshitij Mehta <[email protected]> wrote: >>>> I set the stripe size to the default 64k only when I pvfs-cp to >>>> individual servers. Otherwise, it was set to 4MB. Confirming through this: >>>> >>>> >>>> >>>> time /opt/pvfs-2.8.2/bin/pvfs2-cp -b 4194304 ior >>>> /pvfs2-ssd/iori.out >>>> >>>> >>>> >>>> real 0m6.787s >>>> >>>> user 0m0.216s >>>> >>>> sys 0m3.006s >>>> >>>> >>>> >>>> ls -lh /pvfs2-ssd/iori.out >>>> >>>> -rw-r--r-- 1 kmehta users 1.0G 2011-12-19 16:21 /pvfs2-ssd/iori.out >>>> >>>> >>>> >>>> Again, I see ~150 MB/s. >>>> >>>> >>>> >>>> - Kshitij >>>> >>>> >>>> >>>> From: Kyle Schochenmaier [mailto:[email protected]] >>>> Sent: Monday, December 19, 2011 4:18 PM >>>> To: Kshitij Mehta >>>> Cc: Michael Moore; [email protected] >>>> Subject: RE: [Pvfs2-users] Tracing pvfs2 internals >>>> >>>> >>>> >>>> Hi Kshitij - >>>> >>>> I still see a 64k strip size (via pvfs2-viewdist) so im assuming >>>> that was what the fs was taking over the network for its transfers >>>> when you did the test below. I think you might see better performance with : >>>> >>>> `pvfs2-cp -b 262144 /tmp/ior.out /pvfs2-ssd/ss_mb/ior.out` >>>> >>>> This will set the blocksize to 256k instead of 64k. >>>> >>>> Does this make a difference ? If so this may just be a performance >>>> tweaking exercise. >>>> >>>> If not, something might be wrong with your network links? >>>> >>>> ~Kyle >>>> >>>> >>>> >>>> On Dec 19, 2011 3:48 PM, "Kshitij Mehta" <[email protected]> wrote: >>>> >>>> Apologies for my late reply. >>>> >>>> Regarding Kyles suggestion, 42MB/s certainly seems to be the local >>>> hard drive speed, I have 2G RAM on my machine. >>>> >>>> I performed a pvfs-cp on a 1G file from the local hard drive to >>>> /pvfs2 (I did this on an I/O server), and it took 6.2 seconds (165 MB/s). >>>> >>>> >>>> >>>> time /opt/pvfs-2.8.2/bin/pvfs2-cp /tmp/ior.out >>>> /pvfs2-ssd/ss_4mb/ior.out >>>> >>>> >>>> >>>> real 0m6.230s >>>> >>>> user 0m0.162s >>>> >>>> sys 0m3.968s >>>> >>>> >>>> >>>> Also, I did what Michael suggested, where I create a directory that >>>> uses a single datafile , and pvfs-cp files to this directory. >>>> >>>> Again, I see similar performance (6.x seconds) to copy 1G files. >>>> >>>> >>>> >>>> So if I pvfs-cp files to the file system or to individual servers, >>>> I see similar performance. I believe when I cp files to pvfs2, the >>>> performance should be nearly double the performance seen when I cp >>>> to individual servers. Correct me if I am wrong. >>>> >>>> >>>> >>>> (In the snippet below, I cp two files, ior.out and ior2.out to the >>>> directory that uses a single datafile. I then verify that they are >>>> using separate I/O >>>> servers.) >>>> >>>> >>>> >>>> $> time /opt/pvfs-2.8.2/bin/pvfs2-cp /tmp/ior.out >>>> /pvfs2-ssd/kmehta/1_dir/ior.out >>>> >>>> >>>> >>>> real 0m6.433s >>>> >>>> user 0m0.101s >>>> >>>> sys 0m3.341s >>>> >>>> >>>> >>>> $> time /opt/pvfs-2.8.2/bin/pvfs2-cp /tmp/ior.out >>>> /pvfs2-ssd/kmehta/1_dir/ior2.out >>>> >>>> >>>> >>>> real 0m6.349s >>>> >>>> user 0m0.105s >>>> >>>> sys 0m2.618s >>>> >>>> >>>> >>>> $> /opt/pvfs-2.8.2/bin/pvfs2-viewdist -f >>>> /pvfs2-ssd/kmehta/1_dir/ior.out >>>> >>>> dist_name = simple_stripe >>>> >>>> dist_params: >>>> >>>> strip_size:65536 >>>> >>>> >>>> >>>> Metadataserver: tcp://192.168.2.95:3334 >>>> >>>> Number of datafiles/servers = 1 >>>> >>>> Datafile 0 - tcp://192.168.2.95:3334, handle: 9223372036854774228 >>>> (7ffffffffffff9d4.bstream) >>>> >>>> >>>> >>>> $> /home/localtester # /opt/pvfs-2.8.2/bin/pvfs2-viewdist >>>> -f >>>> /pvfs2-ssd/kmehta/1_dir/ior2.out >>>> >>>> dist_name = simple_stripe >>>> >>>> dist_params: >>>> >>>> strip_size:65536 >>>> >>>> >>>> >>>> Metadataserver: tcp://192.168.2.94:3334 >>>> >>>> Number of datafiles/servers = 1 >>>> >>>> Datafile 0 - tcp://192.168.2.94:3334, handle: 6917529027641068981 >>>> (5fffffffffffcdb5.bstream) >>>> >>>> >>>> >>>> >>>> >>>> - Kshitij >>>> >>>> >>>> >>>> >>>> >>>> From: [email protected] [mailto:[email protected]] On Behalf Of >>>> Michael Moore >>>> Sent: Wednesday, December 14, 2011 2:16 PM >>>> To: Kyle Schochenmaier >>>> Cc: Kshitij Mehta; [email protected] >>>> Subject: Re: [Pvfs2-users] Tracing pvfs2 internals >>>> >>>> >>>> >>>> Another diagnostic step would be to create a directory that uses a >>>> single datafile (e.g. setfattr -n user.pvfs2.num_dfiles -v "1" >>>> /mnt/1_dir), touch two files in that directory and confirm that >>>> each one uses a different server (e.g. pvfs2-viewdist -f /mnt/1_dir/1.out). >>>> Then perform the same pvfs2-cp test to each file and see if there >>>> is a >>> difference in performance. >>>> >>>> >>>> >>>> Michael >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Dec 14, 2011 at 3:03 PM, Kyle Schochenmaier >>>> <[email protected]> >>>> wrote: >>>> >>>> Hi Kshitij - >>>> >>>> >>>> >>>> That looks extremely low.. do you actually have 27GB of RAM ? >>>> because that looks like the speeds of a local hard drive.. >>>> >>>> Can you try it with a 1GB file instead? >>>> >>>> >>>> >>>> ~Kyle >>>> >>>> >>>> >>>> >>>> >>>> Kyle Schochenmaier >>>> >>>> >>>> >>>> On Wed, Dec 14, 2011 at 1:56 PM, Kshitij Mehta <[email protected]> wrote: >>>> >>>> Ok, these are the results of performing a pvfs2-cp on a 27G file >>>> from /tmp to a directory on /pvfs2-ssd with a stripe size of 4MB. I >>>> see bandwidth of ~42MB/s. Is this expected? >>>> >>>> >>>> >>>> $> time /opt/pvfs-2.8.2/bin/pvfs2-cp /tmp/ior.out.00000000 >>>> /pvfs2-ssd/ss_4mb/ior.out >>>> >>>> >>>> >>>> real 10m55.393s >>>> >>>> user 0m3.075s >>>> >>>> sys 2m6.047s >>>> >>>> >>>> >>>> $> ls -lh /pvfs2-ssd/ss_4mb/ior.out >>>> >>>> -rw-r--r-- 1 root root 27G 2011-12-14 13:37 >>>> /pvfs2-ssd/ss_4mb/ior.out >>>> >>>> >>>> >>>> - Kshitij >>>> >>>> >>>> >>>> From: Kyle Schochenmaier [mailto:[email protected]] >>>> Sent: Wednesday, December 14, 2011 1:19 PM >>>> To: Kshitij Mehta >>>> Cc: Michael Moore; [email protected] >>>> >>>> >>>> Subject: Re: [Pvfs2-users] Tracing pvfs2 internals >>>> >>>> >>>> >>>> Hi Kshitij - >>>> >>>> >>>> >>>> What kind of performance do you get with pvfs2-cp ? If you set the >>>> block size for pvfs2-cp of some large file (1GB+) from /tmp/ on >>>> your client to the pvfs2-fs to 1MB+ do you get decent performance ? >>>> -- we should be testing the performance of in-memory pvfs2 at this point.. >>>> >>>> >>>> >>>> Kyle Schochenmaier >>>> >>>> On Wed, Dec 14, 2011 at 1:09 PM, Kshitij Mehta <[email protected]> wrote: >>>> >>>> 1) what interface are you using with IOR, MPIIO or POSIX? >>>> >>>> >>>> >>>> MPIIO >>>> >>>> >>>> >>>> 2) what protocol are you using, (tcp, ib) and what is the link speed? >>>> >>>> >>>> >>>> IB SDR , with a theoretical of 1 GB/s >>>> >>>> >>>> >>>> 3) is the PVFS2 file system you're comparing to ext4 just the >>>> single host or is it both hosts attached to SSD >>>> >>>> >>>> >>>> Both hosts. >>>> >>>> >>>> >>>> 4) With 32MB transfer size (from IOR, right?) does that match the >>>> stripe size you're using in the PVFS2 file system? >>>> >>>> >>>> >>>> Yes, we ran the test from IOR. The stripe size on PVFS2 >>>> was set to 1 MB. I am seeing similar results when using varying >>>> transfer sizes from 1MB through 1GB, doubling the transfer size in >>>> every >>> run. >>>> >>>> >>>> >>>> 5) are you using directio or alt-aio? >>>> >>>> >>>> >>>> Alt-aio >>>> >>>> >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Kshitij >>>> >>>> >>>> >>>> From: Michael Moore [mailto:[email protected]] >>>> Sent: Wednesday, December 14, 2011 5:21 AM >>>> To: Kshitij Mehta >>>> Cc: Kyle Schochenmaier; [email protected] >>>> >>>> >>>> Subject: Re: [Pvfs2-users] Tracing pvfs2 internals >>>> >>>> >>>> >>>> Hi Kshitij, >>>> >>>> >>>> >>>> A couple other questions and things to look at: >>>> >>>> >>>> >>>> 1) what interface are you using with IOR, MPIIO or POSIX? >>>> >>>> 2) what protocol are you using, (tcp, ib) and what is the link speed? >>>> >>>> 3) is the PVFS2 file system you're comparing to ext4 just the >>>> single host or is it both hosts attached to SSD >>>> 4) With 32MB transfer size (from IOR, right?) does that match the >>>> stripe size you're using in the PVFS2 file system? >>>> >>>> 5) are you using directio or alt-aio? >>>> >>>> >>>> >>>> Beyond that, if you could watch top for something CPU bound or >>>> swapping during testing that may show what's going on. Also, if you >>>> could watch iostat to see what's happening with the disks while >>>> running the test on PVFS2.. >>>> >>>> >>>> >>>> Michael >>>> >>>> >>>> >>>> On Wed, Dec 14, 2011 at 2:43 AM, Kshitij Mehta <[email protected]> wrote: >>>> >>>> I am using a transfer size of 32 MB, which should have shown much >>>> better performance (My apologies for not mentioning this before). >>>> The total file size being written is 8GB. >>>> >>>> - Kshitij >>>> >>>> >>>> On Dec 14, 2011, at 1:34 AM, Kyle Schochenmaier >>>> <[email protected]> >>> wrote: >>>> >>>> Hi Kshitij - >>>> >>>> >>>> >>>> This is the expected behaviour, PVFS2 is not highly optimized for >>>> small writes/reads, which is what IOR is typically performing. So >>>> you will always see degraded performances here compared to the >>>> underlying filesystem's base performance. >>>> >>>> >>>> >>>> There are ways to tune to help optimize for this type of access. >>>> >>>> >>>> >>>> If you set your IOR block accesses to something larger such as 64K >>>> instead of the default (4K?) I think you would see performances >>>> which are >>> closer. >>>> >>>> >>>> >>>> This used to be pretty well documented in the FAQ documents for >>>> PVFS, i'm not sure where the links are now.. >>>> >>>> >>>> >>>> Cheers, >>>> >>>> Kyle Schochenmaier >>>> >>>> On Wed, Dec 14, 2011 at 1:09 AM, Kshitij Mehta <[email protected]> wrote: >>>> >>>> Well , heres why I wanted to trace in the first place. >>>> >>>> I have a test configuration where we have configured PVFS2 over an >>>> SSD storage. There are two I/O servers that talk to the SSD storage >>>> through Infiniband (There are 2 IB channels going into the SSD, and >>>> each storage server can 'see' one half of the SSD). >>>> >>>> Now I used the IOR benchmark to test the write bandwidth. I first >>>> spawn a process on the I/O server such that it writes data to the >>>> underlying ext4 file system on the SSD instead of PVFS2. I see a >>>> bandwidth >>> of ~350 MB/s. >>>> Now I spawn a process on the same I/O server and write data to the >>>> PVFS2 file system configured over the SSD, and I see a write >>>> bandwidth of ~180 MB/s. >>>> >>>> This seems to represent some kind of overhead with PVFS2, but seems >>>> too large. Has anybody else seen similar results? Is the overhead >>>> of >>>> pvfs2 documented? >>>> >>>> Do let me know if something is not clear or if you have additional >>>> questions about the above setup. >>>> >>>> Here are some other details: >>>> I/O servers: dual core with 2G main memory each. >>>> PVFS 2.8.2 >>>> >>>> Thanks, >>>> Kshitij >>>> >>>> >>>> -----Original Message----- >>>> From: Julian Kunkel [mailto:[email protected]] >>>> Sent: Tuesday, December 13, 2011 3:10 AM >>>> To: Kshitij Mehta >>>> Cc: [email protected] >>>> Subject: Re: [Pvfs2-users] Tracing pvfs2 internals >>>> >>>> Dear Kshitij, >>>> we have a version of OrangeFS which is instrumented with HDTrace, >>>> there you can record detailed information about activity of >>>> statemachines >>> and I/O. >>>> For a description see the thesis: >>>> http://wr.informatik.uni-hamburg.de/_media/research:theses:Tien%20D >>>> uc% 20Tien_Tracing%20Internal%20Behavior%20in%20PVFS.pdf >>>> >>>> The code is available in our redmine (here is a link to the wiki): >>>> http://redmine.wr.informatik.uni-hamburg.de/projects/piosimhd/wiki >>>> >>>> I consider the tracing implemented in PVFS as rather robust, since >>>> it is our second implementation with PVFS_hints. >>>> However, you might encounter some issues with the build system. >>>> If you want to try it and you need help, just ask. >>>> >>>> Regards, >>>> Julian Kunkel >>>> >>>> >>>> >>>> 2011/12/13 Kshitij Mehta <[email protected]>: >>>>> Hello, >>>>> >>>>> Is there a way I can trace/measure the internal behavior of pvfs2? >>>>> Suppose I have a simple I/O code that writes to pvfs2, I would >>>>> like to find out how much time exactly do various internal >>>>> operations of >>>>> Pvfs2 take (metadata lookup, creating iovecs, etc.), before data >>>>> is finally pushed to disk. >>>>> >>>>> >>>>> >>>>> Is there a configure option (what does `enabletracing` do in the >>>>> config >>>>> file) ? Or is there any other way to determine this ? >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> Kshitij >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pvfs2-users mailing list >>>>> [email protected] >>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Pvfs2-users mailing list >>>> [email protected] >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Pvfs2-users mailing list >>>> [email protected] >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
