> are the separate files being striped 8 ways? > Because that would allow them to hit possibly all 64 OST's, while the > shared file case will only hit 8
Yes, I found out that the files are getting striped 8 ways, so we end up hitting 64 OSTs. This is what I tried next: 1. Ran a test case where 6 threads write separate files, each of size 6 GB, to a directory configured over 8 OSTs. Thus the application writes 36GB of data in total, over 48 OSTs. 2. Ran a test case where 8 threads write a common file of size 36GB to a directory configured over 48 OSTs. Thus both tests ultimately write 36GB of data over 48 OSTS. I still see a b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for test 1 (separate files). Thanks, Kshitij > I've been trying to test this, but not finding an obvious error... so > more questions: > > How much RAM do you have on your client, and how much on the OST's some > of my smaller tests go much faster, but I believe that it is cache based > effects. My larger test at 32GB gives pretty consistent results. > > The other thing to consider: are the separate files being striped 8 ways? > Because that would allow them to hit possibly all 64 OST's, while the > shared file case will only hit 8. > > Evan > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Felix, Evan > J > Sent: Friday, June 03, 2011 9:09 AM > To: [email protected] > Cc: Lustre discuss > Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance > > What file sizes and segment sizes are you using for your tests? > > Evan > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of > [email protected] > Sent: Thursday, June 02, 2011 5:07 PM > To: [email protected] > Cc: [email protected]; Lustre discuss > Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance > > Hello, > I was wondering if anyone could replicate the performance of the > multithreaded application using the C file that I posted in my previous > email. > > Thanks, > Kshitij > > >> Ok I ran the following tests: >> >> [1] >> Application spawns 8 threads. I write to Lustre having 8 OSTs. >> Each thread writes data in blocks of 1 Mbyte in a round robin fashion, >> i.e. >> >> T0 writes to offsets 0, 8MB, 16MB, etc. >> T1 writes to offsets 1MB, 9MB, 17MB, etc. >> The stripe size being 1MByte, every thread ends up writing to only 1 >> OST. >> >> I see a bandwidth of 280 Mbytes/sec, similar to the single thread >> performance. >> >> [2] >> I also ran the same test such that every thread writes data in blocks >> of 8 Mbytes for the same stripe size. (Thus, every thread will write >> to every OST). I still get similar performance, ~280Mbytes/sec, so >> essentially I see no difference between each thread writing to a >> single OST vs each thread writing to all OSTs. >> >> And as I said before, if all threads write to their own separate file, >> the resulting bandwidth is ~700Mbytes/sec. >> >> I have attached my C file (simple_io_test.c) herewith. Maybe you could >> run it and see where the bottleneck is. Comments and instructions for >> compilation have been included in the file. Do let me know if you need >> any clarification on that. >> >> Your help is appreciated, >> Kshitij >> >>> This is what my application does: >>> >>> Each thread has its own file descriptor to the file. >>> I use pwrite to ensure non-overlapping regions, as follows: >>> >>> Thread 0, data_size: 1MB, offset: 0 >>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, >>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB >>> >>> <repeat cycle> >>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in >>> parallel, I dont wait for one cycle to end before the next one >>> begins). >>> >>> I am gonna try the following: >>> a) >>> Instead of a round-robin distribution of offsets, test with >>> sequential >>> offsets: >>> Thread 0, data_size: 1MB, offset:0 >>> Thread 0, data_size: 1MB, offset:1MB >>> Thread 0, data_size: 1MB, offset:2MB >>> Thread 0, data_size: 1MB, offset:3MB >>> >>> Thread 1, data_size: 1MB, offset:4MB >>> and so on. (I am gonna keep these separate pwrite I/O requests >>> instead of merging them or using writev) >>> >>> b) >>> Map the threads to the no. of OSTs using some modulo, as suggested in >>> the email below. >>> >>> c) >>> Experiment with fewer no. of OSTs (I currently have 48). >>> >>> I shall report back with my findings. >>> >>> Thanks, >>> Kshitij >>> >>>> [Moved to Lustre-discuss] >>>> >>>> >>>> "However, if I spawn 8 threads such that all of them write to the >>>> same file (non-overlapping locations), without explicitly >>>> synchronizing the writes (i.e. I dont lock the file handle)" >>>> >>>> >>>> How exactly does your multi-threaded application write the data? >>>> Are you using pwrite to ensure non-overlapping regions or are they >>>> all just doing unlocked write() operations on the same fd to each >>>> write (each just transferring size/8)? If it divides the file into >>>> N pieces, and each thread does pwrite on its piece, then what each >>>> OST sees are multiple streams at wide offsets to the same object, >>>> which could impact performance. >>>> >>>> If on the other hand the file is written sequentially, where each >>>> thread grabs the next piece to be written (locking normally used for >>>> the current_offset value, so you know where each chunk is actually >>>> going), then you get a more sequential pattern at the OST. >>>> >>>> If the number of threads maps to the number of OSTs (or some modulo, >>>> like in your case 6 OSTs per thread), and each thread "owns" the >>>> piece of the file that belongs to an OST (ie: for (offset = >>>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd, buf, >>>> 6MB, offset); ), then you've eliminated the need for application >>>> locks (assuming the use of >>>> pwrite) and ensured each OST object is being written sequentially. >>>> >>>> It's quite possible there is some bottleneck on the shared fd. So >>>> perhaps the question is not why you aren't scaling with more >>>> threads, but why the single file is not able to saturate the client, >>>> or why the file BW is not scaling with more OSTs. It is somewhat >>>> common for multiple processes (on different nodes) to write >>>> non-overlapping regions of the same file; does performance improve >>>> if each thread opens its own file descriptor? >>>> >>>> Kevin >>>> >>>> >>>> Wojciech Turek wrote: >>>>> Ok so it looks like you have in total 64 OSTs and your output file >>>>> is striped across 48 of them. May I suggest that you limit number >>>>> of stripes, lets say a good number to start with would be 8 stripes >>>>> and also for best results use OST pools feature to arrange that >>>>> each stripe goes to OST owned by different OSS. >>>>> >>>>> regards, >>>>> >>>>> Wojciech >>>>> >>>>> On 23 May 2011 23:09, <[email protected] <mailto:[email protected]>> >>>>> wrote: >>>>> >>>>> Actually, 'lfs check servers' returns 64 entries as well, so I >>>>> presume the >>>>> system documentation is out of date. >>>>> >>>>> Again, I am sorry the basic information had been incorrect. >>>>> >>>>> - Kshitij >>>>> >>>>> > Run lfs getstripe <your_output_file> and paste the output of >>>>> that command >>>>> > to >>>>> > the mailing list. >>>>> > Stripe count of 48 is not possible if you have max 11 OSTs (the >>>>> max stripe >>>>> > count will be 11) >>>>> > If your striping is correct, the bottleneck can be your client >>>>> network. >>>>> > >>>>> > regards, >>>>> > >>>>> > Wojciech >>>>> > >>>>> > >>>>> > >>>>> > On 23 May 2011 22:35, <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> > >>>>> >> The stripe count is 48. >>>>> >> >>>>> >> Just fyi, this is what my application does: >>>>> >> A simple I/O test where threads continually write blocks of >>>>> size >>>>> >> 64Kbytes >>>>> >> or 1Mbyte (decided at compile time) till a large file of say, >>>>> 16Gbytes >>>>> >> is >>>>> >> created. >>>>> >> >>>>> >> Thanks, >>>>> >> Kshitij >>>>> >> >>>>> >> > What is your stripe count on the file, if your default is >>>>> 1, >>>>> you are >>>>> >> only >>>>> >> > writing to one of the OST's. you can check with the lfs >>>>> getstripe >>>>> >> > command, you can set the stripe bigger, and hopefully your >>>>> >> wide-stripped >>>>> >> > file with threaded writes will be faster. >>>>> >> > >>>>> >> > Evan >>>>> >> > >>>>> >> > -----Original Message----- >>>>> >> > From: [email protected] >>>>> <mailto:[email protected]> >>>>> >> > [mailto:[email protected] >>>>> <mailto:[email protected]>] On Behalf Of >>>>> >> > [email protected] <mailto:[email protected]> >>>>> >> > Sent: Monday, May 23, 2011 2:28 PM >>>>> >> > To: [email protected] >>>>> <mailto:[email protected]> >>>>> >> > Subject: [Lustre-community] Poor multithreaded I/O >>>>> performance >>>>> >> > >>>>> >> > Hello, >>>>> >> > I am running a multithreaded application that writes to a >>>>> common >>>>> >> shared >>>>> >> > file on lustre fs, and this is what I see: >>>>> >> > >>>>> >> > If I have a single thread in my application, I get a >>>>> bandwidth of >>>>> >> approx. >>>>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>>>> spawn 8 >>>>> >> > threads such that all of them write to the same file >>>>> (non-overlapping >>>>> >> > locations), without explicitly synchronizing the writes >>>>> (i.e. >>>>> I dont >>>>> >> lock >>>>> >> > the file handle), I still get the same bandwidth. >>>>> >> > >>>>> >> > Now, instead of writing to a shared file, if these threads >>>>> write to >>>>> >> > separate files, the bandwidth obtained is approx. 700 >>>>> Mbytes/sec. >>>>> >> > >>>>> >> > I would ideally like my multithreaded application to see >>>>> similar >>>>> >> scaling. >>>>> >> > Any ideas why the performance is limited and any >>>>> workarounds? >>>>> >> > >>>>> >> > Thank you, >>>>> >> > Kshitij >>>>> >> > >>>>> >> > >>>>> >> > _______________________________________________ >>>>> >> > Lustre-community mailing list >>>>> >> > [email protected] >>>>> <mailto:[email protected]> >>>>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community >>>>> >> > >>>>> >> >>>>> >> >>>>> >> _______________________________________________ >>>>> >> Lustre-community mailing list >>>>> >> [email protected] >>>>> <mailto:[email protected]> >>>>> >> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>> >>>>> >>>>> ------------------------------------------------------------------- >>>>> ----- >>>>> >>>>> _______________________________________________ >>>>> Lustre-community mailing list >>>>> [email protected] >>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>> >>>> >>> >>> >> > > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
