What file sizes and segment sizes are you using for your tests? Evan
-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of [email protected] Sent: Thursday, June 02, 2011 5:07 PM To: [email protected] Cc: [email protected]; Lustre discuss Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance Hello, I was wondering if anyone could replicate the performance of the multithreaded application using the C file that I posted in my previous email. Thanks, Kshitij > Ok I ran the following tests: > > [1] > Application spawns 8 threads. I write to Lustre having 8 OSTs. > Each thread writes data in blocks of 1 Mbyte in a round robin fashion, > i.e. > > T0 writes to offsets 0, 8MB, 16MB, etc. > T1 writes to offsets 1MB, 9MB, 17MB, etc. > The stripe size being 1MByte, every thread ends up writing to only 1 OST. > > I see a bandwidth of 280 Mbytes/sec, similar to the single thread > performance. > > [2] > I also ran the same test such that every thread writes data in blocks > of 8 Mbytes for the same stripe size. (Thus, every thread will write > to every OST). I still get similar performance, ~280Mbytes/sec, so > essentially I see no difference between each thread writing to a > single OST vs each thread writing to all OSTs. > > And as I said before, if all threads write to their own separate file, > the resulting bandwidth is ~700Mbytes/sec. > > I have attached my C file (simple_io_test.c) herewith. Maybe you could > run it and see where the bottleneck is. Comments and instructions for > compilation have been included in the file. Do let me know if you need > any clarification on that. > > Your help is appreciated, > Kshitij > >> This is what my application does: >> >> Each thread has its own file descriptor to the file. >> I use pwrite to ensure non-overlapping regions, as follows: >> >> Thread 0, data_size: 1MB, offset: 0 >> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, >> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB >> >> <repeat cycle> >> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in >> parallel, I dont wait for one cycle to end before the next one >> begins). >> >> I am gonna try the following: >> a) >> Instead of a round-robin distribution of offsets, test with >> sequential >> offsets: >> Thread 0, data_size: 1MB, offset:0 >> Thread 0, data_size: 1MB, offset:1MB >> Thread 0, data_size: 1MB, offset:2MB >> Thread 0, data_size: 1MB, offset:3MB >> >> Thread 1, data_size: 1MB, offset:4MB >> and so on. (I am gonna keep these separate pwrite I/O requests >> instead of merging them or using writev) >> >> b) >> Map the threads to the no. of OSTs using some modulo, as suggested in >> the email below. >> >> c) >> Experiment with fewer no. of OSTs (I currently have 48). >> >> I shall report back with my findings. >> >> Thanks, >> Kshitij >> >>> [Moved to Lustre-discuss] >>> >>> >>> "However, if I spawn 8 threads such that all of them write to the >>> same file (non-overlapping locations), without explicitly >>> synchronizing the writes (i.e. I dont lock the file handle)" >>> >>> >>> How exactly does your multi-threaded application write the data? >>> Are you using pwrite to ensure non-overlapping regions or are they >>> all just doing unlocked write() operations on the same fd to each >>> write (each just transferring size/8)? If it divides the file into >>> N pieces, and each thread does pwrite on its piece, then what each >>> OST sees are multiple streams at wide offsets to the same object, >>> which could impact performance. >>> >>> If on the other hand the file is written sequentially, where each >>> thread grabs the next piece to be written (locking normally used for >>> the current_offset value, so you know where each chunk is actually >>> going), then you get a more sequential pattern at the OST. >>> >>> If the number of threads maps to the number of OSTs (or some modulo, >>> like in your case 6 OSTs per thread), and each thread "owns" the >>> piece of the file that belongs to an OST (ie: for (offset = >>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd, buf, >>> 6MB, offset); ), then you've eliminated the need for application >>> locks (assuming the use of >>> pwrite) and ensured each OST object is being written sequentially. >>> >>> It's quite possible there is some bottleneck on the shared fd. So >>> perhaps the question is not why you aren't scaling with more >>> threads, but why the single file is not able to saturate the client, >>> or why the file BW is not scaling with more OSTs. It is somewhat >>> common for multiple processes (on different nodes) to write >>> non-overlapping regions of the same file; does performance improve >>> if each thread opens its own file descriptor? >>> >>> Kevin >>> >>> >>> Wojciech Turek wrote: >>>> Ok so it looks like you have in total 64 OSTs and your output file >>>> is striped across 48 of them. May I suggest that you limit number >>>> of stripes, lets say a good number to start with would be 8 stripes >>>> and also for best results use OST pools feature to arrange that >>>> each stripe goes to OST owned by different OSS. >>>> >>>> regards, >>>> >>>> Wojciech >>>> >>>> On 23 May 2011 23:09, <[email protected] <mailto:[email protected]>> >>>> wrote: >>>> >>>> Actually, 'lfs check servers' returns 64 entries as well, so I >>>> presume the >>>> system documentation is out of date. >>>> >>>> Again, I am sorry the basic information had been incorrect. >>>> >>>> - Kshitij >>>> >>>> > Run lfs getstripe <your_output_file> and paste the output of >>>> that command >>>> > to >>>> > the mailing list. >>>> > Stripe count of 48 is not possible if you have max 11 OSTs (the >>>> max stripe >>>> > count will be 11) >>>> > If your striping is correct, the bottleneck can be your client >>>> network. >>>> > >>>> > regards, >>>> > >>>> > Wojciech >>>> > >>>> > >>>> > >>>> > On 23 May 2011 22:35, <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> > >>>> >> The stripe count is 48. >>>> >> >>>> >> Just fyi, this is what my application does: >>>> >> A simple I/O test where threads continually write blocks of >>>> size >>>> >> 64Kbytes >>>> >> or 1Mbyte (decided at compile time) till a large file of say, >>>> 16Gbytes >>>> >> is >>>> >> created. >>>> >> >>>> >> Thanks, >>>> >> Kshitij >>>> >> >>>> >> > What is your stripe count on the file, if your default is 1, >>>> you are >>>> >> only >>>> >> > writing to one of the OST's. you can check with the lfs >>>> getstripe >>>> >> > command, you can set the stripe bigger, and hopefully your >>>> >> wide-stripped >>>> >> > file with threaded writes will be faster. >>>> >> > >>>> >> > Evan >>>> >> > >>>> >> > -----Original Message----- >>>> >> > From: [email protected] >>>> <mailto:[email protected]> >>>> >> > [mailto:[email protected] >>>> <mailto:[email protected]>] On Behalf Of >>>> >> > [email protected] <mailto:[email protected]> >>>> >> > Sent: Monday, May 23, 2011 2:28 PM >>>> >> > To: [email protected] >>>> <mailto:[email protected]> >>>> >> > Subject: [Lustre-community] Poor multithreaded I/O >>>> performance >>>> >> > >>>> >> > Hello, >>>> >> > I am running a multithreaded application that writes to a >>>> common >>>> >> shared >>>> >> > file on lustre fs, and this is what I see: >>>> >> > >>>> >> > If I have a single thread in my application, I get a >>>> bandwidth of >>>> >> approx. >>>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>>> spawn 8 >>>> >> > threads such that all of them write to the same file >>>> (non-overlapping >>>> >> > locations), without explicitly synchronizing the writes (i.e. >>>> I dont >>>> >> lock >>>> >> > the file handle), I still get the same bandwidth. >>>> >> > >>>> >> > Now, instead of writing to a shared file, if these threads >>>> write to >>>> >> > separate files, the bandwidth obtained is approx. 700 >>>> Mbytes/sec. >>>> >> > >>>> >> > I would ideally like my multithreaded application to see >>>> similar >>>> >> scaling. >>>> >> > Any ideas why the performance is limited and any workarounds? >>>> >> > >>>> >> > Thank you, >>>> >> > Kshitij >>>> >> > >>>> >> > >>>> >> > _______________________________________________ >>>> >> > Lustre-community mailing list >>>> >> > [email protected] >>>> <mailto:[email protected]> >>>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community >>>> >> > >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> Lustre-community mailing list >>>> >> [email protected] >>>> <mailto:[email protected]> >>>> >> http://lists.lustre.org/mailman/listinfo/lustre-community >>>> >>>> >>>> ------------------------------------------------------------------- >>>> ----- >>>> >>>> _______________________________________________ >>>> Lustre-community mailing list >>>> [email protected] >>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>> >>> >> >> > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
