Its part of the lfs lustre tool, I have not used it myself.. try 'lfs help join'
Evan -----Original Message----- From: Kshitij Mehta [mailto:[email protected]] Sent: Thursday, June 09, 2011 10:58 AM To: [email protected] Cc: Felix, Evan J; Lustre discuss Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance I read in a research paper (http://ft.ornl.gov/pubs-archive/2007-CCGrid-file-joining.pdf) about Lustre's ability to join files in place. Can someone point me to sample code and documentation on this? I couldnt find information in the manual. Being able to join files in place could be a potential solution to the issue I have. Thanks, Kshitij On 06/06/2011 01:20 PM, [email protected] wrote: >> are the separate files being striped 8 ways? >> Because that would allow them to hit possibly all 64 OST's, while >> the shared file case will only hit 8 > Yes, I found out that the files are getting striped 8 ways, so we end > up hitting 64 OSTs. This is what I tried next: > > 1. Ran a test case where 6 threads write separate files, each of size > 6 GB, to a directory configured over 8 OSTs. Thus the application > writes 36GB of data in total, over 48 OSTs. > > 2. Ran a test case where 8 threads write a common file of size 36GB to > a directory configured over 48 OSTs. > > Thus both tests ultimately write 36GB of data over 48 OSTS. I still > see a b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for > test 1 (separate files). > > Thanks, > Kshitij > >> I've been trying to test this, but not finding an obvious error... >> so more questions: >> >> How much RAM do you have on your client, and how much on the OST's >> some of my smaller tests go much faster, but I believe that it is >> cache based effects. My larger test at 32GB gives pretty consistent results. >> >> The other thing to consider: are the separate files being striped 8 ways? >> Because that would allow them to hit possibly all 64 OST's, while >> the shared file case will only hit 8. >> >> Evan >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Felix, >> Evan J >> Sent: Friday, June 03, 2011 9:09 AM >> To: [email protected] >> Cc: Lustre discuss >> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance >> >> What file sizes and segment sizes are you using for your tests? >> >> Evan >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of >> [email protected] >> Sent: Thursday, June 02, 2011 5:07 PM >> To: [email protected] >> Cc: [email protected]; Lustre discuss >> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance >> >> Hello, >> I was wondering if anyone could replicate the performance of the >> multithreaded application using the C file that I posted in my >> previous email. >> >> Thanks, >> Kshitij >> >> >>> Ok I ran the following tests: >>> >>> [1] >>> Application spawns 8 threads. I write to Lustre having 8 OSTs. >>> Each thread writes data in blocks of 1 Mbyte in a round robin >>> fashion, i.e. >>> >>> T0 writes to offsets 0, 8MB, 16MB, etc. >>> T1 writes to offsets 1MB, 9MB, 17MB, etc. >>> The stripe size being 1MByte, every thread ends up writing to only 1 >>> OST. >>> >>> I see a bandwidth of 280 Mbytes/sec, similar to the single thread >>> performance. >>> >>> [2] >>> I also ran the same test such that every thread writes data in >>> blocks of 8 Mbytes for the same stripe size. (Thus, every thread >>> will write to every OST). I still get similar performance, >>> ~280Mbytes/sec, so essentially I see no difference between each >>> thread writing to a single OST vs each thread writing to all OSTs. >>> >>> And as I said before, if all threads write to their own separate >>> file, the resulting bandwidth is ~700Mbytes/sec. >>> >>> I have attached my C file (simple_io_test.c) herewith. Maybe you >>> could run it and see where the bottleneck is. Comments and >>> instructions for compilation have been included in the file. Do let >>> me know if you need any clarification on that. >>> >>> Your help is appreciated, >>> Kshitij >>> >>>> This is what my application does: >>>> >>>> Each thread has its own file descriptor to the file. >>>> I use pwrite to ensure non-overlapping regions, as follows: >>>> >>>> Thread 0, data_size: 1MB, offset: 0 Thread 1, data_size: 1MB, >>>> offset: 1MB Thread 2, data_size: 1MB, >>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB >>>> >>>> <repeat cycle> >>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in >>>> parallel, I dont wait for one cycle to end before the next one >>>> begins). >>>> >>>> I am gonna try the following: >>>> a) >>>> Instead of a round-robin distribution of offsets, test with >>>> sequential >>>> offsets: >>>> Thread 0, data_size: 1MB, offset:0 >>>> Thread 0, data_size: 1MB, offset:1MB Thread 0, data_size: 1MB, >>>> offset:2MB Thread 0, data_size: 1MB, offset:3MB >>>> >>>> Thread 1, data_size: 1MB, offset:4MB and so on. (I am gonna keep >>>> these separate pwrite I/O requests instead of merging them or using >>>> writev) >>>> >>>> b) >>>> Map the threads to the no. of OSTs using some modulo, as suggested >>>> in the email below. >>>> >>>> c) >>>> Experiment with fewer no. of OSTs (I currently have 48). >>>> >>>> I shall report back with my findings. >>>> >>>> Thanks, >>>> Kshitij >>>> >>>>> [Moved to Lustre-discuss] >>>>> >>>>> >>>>> "However, if I spawn 8 threads such that all of them write to the >>>>> same file (non-overlapping locations), without explicitly >>>>> synchronizing the writes (i.e. I dont lock the file handle)" >>>>> >>>>> >>>>> How exactly does your multi-threaded application write the data? >>>>> Are you using pwrite to ensure non-overlapping regions or are they >>>>> all just doing unlocked write() operations on the same fd to each >>>>> write (each just transferring size/8)? If it divides the file >>>>> into N pieces, and each thread does pwrite on its piece, then what >>>>> each OST sees are multiple streams at wide offsets to the same >>>>> object, which could impact performance. >>>>> >>>>> If on the other hand the file is written sequentially, where each >>>>> thread grabs the next piece to be written (locking normally used >>>>> for the current_offset value, so you know where each chunk is >>>>> actually going), then you get a more sequential pattern at the OST. >>>>> >>>>> If the number of threads maps to the number of OSTs (or some >>>>> modulo, like in your case 6 OSTs per thread), and each thread >>>>> "owns" the piece of the file that belongs to an OST (ie: for >>>>> (offset = thread_num * 6MB; offset< size; offset += 48MB) >>>>> pwrite(fd, buf, 6MB, offset); ), then you've eliminated the need >>>>> for application locks (assuming the use of >>>>> pwrite) and ensured each OST object is being written sequentially. >>>>> >>>>> It's quite possible there is some bottleneck on the shared fd. So >>>>> perhaps the question is not why you aren't scaling with more >>>>> threads, but why the single file is not able to saturate the >>>>> client, or why the file BW is not scaling with more OSTs. It is >>>>> somewhat common for multiple processes (on different nodes) to >>>>> write non-overlapping regions of the same file; does performance >>>>> improve if each thread opens its own file descriptor? >>>>> >>>>> Kevin >>>>> >>>>> >>>>> Wojciech Turek wrote: >>>>>> Ok so it looks like you have in total 64 OSTs and your output >>>>>> file is striped across 48 of them. May I suggest that you limit >>>>>> number of stripes, lets say a good number to start with would be >>>>>> 8 stripes and also for best results use OST pools feature to >>>>>> arrange that each stripe goes to OST owned by different OSS. >>>>>> >>>>>> regards, >>>>>> >>>>>> Wojciech >>>>>> >>>>>> On 23 May 2011 23:09,<[email protected]<mailto:[email protected]>> >>>>>> wrote: >>>>>> >>>>>> Actually, 'lfs check servers' returns 64 entries as well, so I >>>>>> presume the >>>>>> system documentation is out of date. >>>>>> >>>>>> Again, I am sorry the basic information had been incorrect. >>>>>> >>>>>> - Kshitij >>>>>> >>>>>> > Run lfs getstripe<your_output_file> and paste the output of >>>>>> that command >>>>>> > to >>>>>> > the mailing list. >>>>>> > Stripe count of 48 is not possible if you have max 11 OSTs (the >>>>>> max stripe >>>>>> > count will be 11) >>>>>> > If your striping is correct, the bottleneck can be your client >>>>>> network. >>>>>> > >>>>>> > regards, >>>>>> > >>>>>> > Wojciech >>>>>> > >>>>>> > >>>>>> > >>>>>> > On 23 May 2011 22:35,<[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> > >>>>>> >> The stripe count is 48. >>>>>> >> >>>>>> >> Just fyi, this is what my application does: >>>>>> >> A simple I/O test where threads continually write blocks >>>>>> of size >>>>>> >> 64Kbytes >>>>>> >> or 1Mbyte (decided at compile time) till a large file of say, >>>>>> 16Gbytes >>>>>> >> is >>>>>> >> created. >>>>>> >> >>>>>> >> Thanks, >>>>>> >> Kshitij >>>>>> >> >>>>>> >> > What is your stripe count on the file, if your >>>>>> default is 1, >>>>>> you are >>>>>> >> only >>>>>> >> > writing to one of the OST's. you can check with the lfs >>>>>> getstripe >>>>>> >> > command, you can set the stripe bigger, and hopefully your >>>>>> >> wide-stripped >>>>>> >> > file with threaded writes will be faster. >>>>>> >> > >>>>>> >> > Evan >>>>>> >> > >>>>>> >> > -----Original Message----- >>>>>> >> > From: [email protected] >>>>>> <mailto:[email protected]> >>>>>> >> > [mailto:[email protected] >>>>>> <mailto:[email protected]>] On Behalf Of >>>>>> >> > [email protected]<mailto:[email protected]> >>>>>> >> > Sent: Monday, May 23, 2011 2:28 PM >>>>>> >> > To: [email protected] >>>>>> <mailto:[email protected]> >>>>>> >> > Subject: [Lustre-community] Poor multithreaded I/O >>>>>> performance >>>>>> >> > >>>>>> >> > Hello, >>>>>> >> > I am running a multithreaded application that writes >>>>>> to a common >>>>>> >> shared >>>>>> >> > file on lustre fs, and this is what I see: >>>>>> >> > >>>>>> >> > If I have a single thread in my application, I get a >>>>>> bandwidth of >>>>>> >> approx. >>>>>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>>>>> spawn 8 >>>>>> >> > threads such that all of them write to the same file >>>>>> (non-overlapping >>>>>> >> > locations), without explicitly synchronizing the >>>>>> writes (i.e. >>>>>> I dont >>>>>> >> lock >>>>>> >> > the file handle), I still get the same bandwidth. >>>>>> >> > >>>>>> >> > Now, instead of writing to a shared file, if these threads >>>>>> write to >>>>>> >> > separate files, the bandwidth obtained is approx. 700 >>>>>> Mbytes/sec. >>>>>> >> > >>>>>> >> > I would ideally like my multithreaded application to >>>>>> see similar >>>>>> >> scaling. >>>>>> >> > Any ideas why the performance is limited and any >>>>>> workarounds? >>>>>> >> > >>>>>> >> > Thank you, >>>>>> >> > Kshitij >>>>>> >> > >>>>>> >> > >>>>>> >> > _______________________________________________ >>>>>> >> > Lustre-community mailing list >>>>>> >> > [email protected] >>>>>> <mailto:[email protected]> >>>>>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>> >> > >>>>>> >> >>>>>> >> >>>>>> >> _______________________________________________ >>>>>> >> Lustre-community mailing list >>>>>> >> [email protected] >>>>>> <mailto:[email protected]> >>>>>> >> >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>> >>>>>> >>>>>> ----------------------------------------------------------------- >>>>>> -- >>>>>> ----- >>>>>> >>>>>> _______________________________________________ >>>>>> Lustre-community mailing list >>>>>> [email protected] >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>> >>>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
