On 2011-06-09, at 11:57 AM, Kshitij Mehta wrote: > I read in a research paper > (http://ft.ornl.gov/pubs-archive/2007-CCGrid-file-joining.pdf) about > Lustre's ability to join files in place. Can someone point me to sample > code and documentation on this? I couldnt find information in the > manual. Being able to join files in place could be a potential solution > to the issue I have.
That feature was mostly experimental, and has been disabled in newer versions of Lustre. > On 06/06/2011 01:20 PM, [email protected] wrote: >>> are the separate files being striped 8 ways? >>> Because that would allow them to hit possibly all 64 OST's, while the >>> shared file case will only hit 8 >> >> Yes, I found out that the files are getting striped 8 ways, so we end up >> hitting 64 OSTs. This is what I tried next: >> >> 1. Ran a test case where 6 threads write separate files, each of size 6 >> GB, to a directory configured over 8 OSTs. Thus the application writes >> 36GB of data in total, over 48 OSTs. >> >> 2. Ran a test case where 8 threads write a common file of size 36GB to a >> directory configured over 48 OSTs. >> >> Thus both tests ultimately write 36GB of data over 48 OSTS. I still see a >> b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for test 1 >> (separate files). >> >> Thanks, >> Kshitij >> >>> I've been trying to test this, but not finding an obvious error... so >>> more questions: >>> >>> How much RAM do you have on your client, and how much on the OST's some >>> of my smaller tests go much faster, but I believe that it is cache based >>> effects. My larger test at 32GB gives pretty consistent results. >>> >>> The other thing to consider: are the separate files being striped 8 ways? >>> Because that would allow them to hit possibly all 64 OST's, while the >>> shared file case will only hit 8. >>> >>> Evan >>> >>> -----Original Message----- >>> From: [email protected] >>> [mailto:[email protected]] On Behalf Of Felix, Evan >>> J >>> Sent: Friday, June 03, 2011 9:09 AM >>> To: [email protected] >>> Cc: Lustre discuss >>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance >>> >>> What file sizes and segment sizes are you using for your tests? >>> >>> Evan >>> >>> -----Original Message----- >>> From: [email protected] >>> [mailto:[email protected]] On Behalf Of >>> [email protected] >>> Sent: Thursday, June 02, 2011 5:07 PM >>> To: [email protected] >>> Cc: [email protected]; Lustre discuss >>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance >>> >>> Hello, >>> I was wondering if anyone could replicate the performance of the >>> multithreaded application using the C file that I posted in my previous >>> email. >>> >>> Thanks, >>> Kshitij >>> >>> >>>> Ok I ran the following tests: >>>> >>>> [1] >>>> Application spawns 8 threads. I write to Lustre having 8 OSTs. >>>> Each thread writes data in blocks of 1 Mbyte in a round robin fashion, >>>> i.e. >>>> >>>> T0 writes to offsets 0, 8MB, 16MB, etc. >>>> T1 writes to offsets 1MB, 9MB, 17MB, etc. >>>> The stripe size being 1MByte, every thread ends up writing to only 1 >>>> OST. >>>> >>>> I see a bandwidth of 280 Mbytes/sec, similar to the single thread >>>> performance. >>>> >>>> [2] >>>> I also ran the same test such that every thread writes data in blocks >>>> of 8 Mbytes for the same stripe size. (Thus, every thread will write >>>> to every OST). I still get similar performance, ~280Mbytes/sec, so >>>> essentially I see no difference between each thread writing to a >>>> single OST vs each thread writing to all OSTs. >>>> >>>> And as I said before, if all threads write to their own separate file, >>>> the resulting bandwidth is ~700Mbytes/sec. >>>> >>>> I have attached my C file (simple_io_test.c) herewith. Maybe you could >>>> run it and see where the bottleneck is. Comments and instructions for >>>> compilation have been included in the file. Do let me know if you need >>>> any clarification on that. >>>> >>>> Your help is appreciated, >>>> Kshitij >>>> >>>>> This is what my application does: >>>>> >>>>> Each thread has its own file descriptor to the file. >>>>> I use pwrite to ensure non-overlapping regions, as follows: >>>>> >>>>> Thread 0, data_size: 1MB, offset: 0 >>>>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, >>>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB >>>>> >>>>> <repeat cycle> >>>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in >>>>> parallel, I dont wait for one cycle to end before the next one >>>>> begins). >>>>> >>>>> I am gonna try the following: >>>>> a) >>>>> Instead of a round-robin distribution of offsets, test with >>>>> sequential >>>>> offsets: >>>>> Thread 0, data_size: 1MB, offset:0 >>>>> Thread 0, data_size: 1MB, offset:1MB >>>>> Thread 0, data_size: 1MB, offset:2MB >>>>> Thread 0, data_size: 1MB, offset:3MB >>>>> >>>>> Thread 1, data_size: 1MB, offset:4MB >>>>> and so on. (I am gonna keep these separate pwrite I/O requests >>>>> instead of merging them or using writev) >>>>> >>>>> b) >>>>> Map the threads to the no. of OSTs using some modulo, as suggested in >>>>> the email below. >>>>> >>>>> c) >>>>> Experiment with fewer no. of OSTs (I currently have 48). >>>>> >>>>> I shall report back with my findings. >>>>> >>>>> Thanks, >>>>> Kshitij >>>>> >>>>>> [Moved to Lustre-discuss] >>>>>> >>>>>> >>>>>> "However, if I spawn 8 threads such that all of them write to the >>>>>> same file (non-overlapping locations), without explicitly >>>>>> synchronizing the writes (i.e. I dont lock the file handle)" >>>>>> >>>>>> >>>>>> How exactly does your multi-threaded application write the data? >>>>>> Are you using pwrite to ensure non-overlapping regions or are they >>>>>> all just doing unlocked write() operations on the same fd to each >>>>>> write (each just transferring size/8)? If it divides the file into >>>>>> N pieces, and each thread does pwrite on its piece, then what each >>>>>> OST sees are multiple streams at wide offsets to the same object, >>>>>> which could impact performance. >>>>>> >>>>>> If on the other hand the file is written sequentially, where each >>>>>> thread grabs the next piece to be written (locking normally used for >>>>>> the current_offset value, so you know where each chunk is actually >>>>>> going), then you get a more sequential pattern at the OST. >>>>>> >>>>>> If the number of threads maps to the number of OSTs (or some modulo, >>>>>> like in your case 6 OSTs per thread), and each thread "owns" the >>>>>> piece of the file that belongs to an OST (ie: for (offset = >>>>>> thread_num * 6MB; offset< size; offset += 48MB) pwrite(fd, buf, >>>>>> 6MB, offset); ), then you've eliminated the need for application >>>>>> locks (assuming the use of >>>>>> pwrite) and ensured each OST object is being written sequentially. >>>>>> >>>>>> It's quite possible there is some bottleneck on the shared fd. So >>>>>> perhaps the question is not why you aren't scaling with more >>>>>> threads, but why the single file is not able to saturate the client, >>>>>> or why the file BW is not scaling with more OSTs. It is somewhat >>>>>> common for multiple processes (on different nodes) to write >>>>>> non-overlapping regions of the same file; does performance improve >>>>>> if each thread opens its own file descriptor? >>>>>> >>>>>> Kevin >>>>>> >>>>>> >>>>>> Wojciech Turek wrote: >>>>>>> Ok so it looks like you have in total 64 OSTs and your output file >>>>>>> is striped across 48 of them. May I suggest that you limit number >>>>>>> of stripes, lets say a good number to start with would be 8 stripes >>>>>>> and also for best results use OST pools feature to arrange that >>>>>>> each stripe goes to OST owned by different OSS. >>>>>>> >>>>>>> regards, >>>>>>> >>>>>>> Wojciech >>>>>>> >>>>>>> On 23 May 2011 23:09,<[email protected]<mailto:[email protected]>> >>>>>>> wrote: >>>>>>> >>>>>>> Actually, 'lfs check servers' returns 64 entries as well, so I >>>>>>> presume the >>>>>>> system documentation is out of date. >>>>>>> >>>>>>> Again, I am sorry the basic information had been incorrect. >>>>>>> >>>>>>> - Kshitij >>>>>>> >>>>>>>> Run lfs getstripe<your_output_file> and paste the output of >>>>>>> that command >>>>>>>> to >>>>>>>> the mailing list. >>>>>>>> Stripe count of 48 is not possible if you have max 11 OSTs (the >>>>>>> max stripe >>>>>>>> count will be 11) >>>>>>>> If your striping is correct, the bottleneck can be your client >>>>>>> network. >>>>>>>> >>>>>>>> regards, >>>>>>>> >>>>>>>> Wojciech >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 23 May 2011 22:35,<[email protected] >>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>>> The stripe count is 48. >>>>>>>>> >>>>>>>>> Just fyi, this is what my application does: >>>>>>>>> A simple I/O test where threads continually write blocks of >>>>>>> size >>>>>>>>> 64Kbytes >>>>>>>>> or 1Mbyte (decided at compile time) till a large file of say, >>>>>>> 16Gbytes >>>>>>>>> is >>>>>>>>> created. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Kshitij >>>>>>>>> >>>>>>>>>> What is your stripe count on the file, if your default is >>>>>>> 1, >>>>>>> you are >>>>>>>>> only >>>>>>>>>> writing to one of the OST's. you can check with the lfs >>>>>>> getstripe >>>>>>>>>> command, you can set the stripe bigger, and hopefully your >>>>>>>>> wide-stripped >>>>>>>>>> file with threaded writes will be faster. >>>>>>>>>> >>>>>>>>>> Evan >>>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: [email protected] >>>>>>> <mailto:[email protected]> >>>>>>>>>> [mailto:[email protected] >>>>>>> <mailto:[email protected]>] On Behalf Of >>>>>>>>>> [email protected]<mailto:[email protected]> >>>>>>>>>> Sent: Monday, May 23, 2011 2:28 PM >>>>>>>>>> To: [email protected] >>>>>>> <mailto:[email protected]> >>>>>>>>>> Subject: [Lustre-community] Poor multithreaded I/O >>>>>>> performance >>>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> I am running a multithreaded application that writes to a >>>>>>> common >>>>>>>>> shared >>>>>>>>>> file on lustre fs, and this is what I see: >>>>>>>>>> >>>>>>>>>> If I have a single thread in my application, I get a >>>>>>> bandwidth of >>>>>>>>> approx. >>>>>>>>>> 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I >>>>>>> spawn 8 >>>>>>>>>> threads such that all of them write to the same file >>>>>>> (non-overlapping >>>>>>>>>> locations), without explicitly synchronizing the writes >>>>>>> (i.e. >>>>>>> I dont >>>>>>>>> lock >>>>>>>>>> the file handle), I still get the same bandwidth. >>>>>>>>>> >>>>>>>>>> Now, instead of writing to a shared file, if these threads >>>>>>> write to >>>>>>>>>> separate files, the bandwidth obtained is approx. 700 >>>>>>> Mbytes/sec. >>>>>>>>>> >>>>>>>>>> I would ideally like my multithreaded application to see >>>>>>> similar >>>>>>>>> scaling. >>>>>>>>>> Any ideas why the performance is limited and any >>>>>>> workarounds? >>>>>>>>>> >>>>>>>>>> Thank you, >>>>>>>>>> Kshitij >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Lustre-community mailing list >>>>>>>>>> [email protected] >>>>>>> <mailto:[email protected]> >>>>>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Lustre-community mailing list >>>>>>>>> [email protected] >>>>>>> <mailto:[email protected]> >>>>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------- >>>>>>> ----- >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Lustre-community mailing list >>>>>>> [email protected] >>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community >>>>>>> >>>>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> [email protected] >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> _______________________________________________ >>> Lustre-discuss mailing list >>> [email protected] >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
