Re: [Lustre-discuss] Poor multithreaded I/O performance

Felix, Evan J Fri, 03 Jun 2011 09:22:30 -0700

What file sizes and segment sizes are you using for your tests?

Evan


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of [email protected]
Sent: Thursday, June 02, 2011 5:07 PM
To: [email protected]
Cc: [email protected]; Lustre discuss
Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance

Hello,
I was wondering if anyone could replicate the performance of the multithreaded 
application using the C file that I posted in my previous email.

Thanks,
Kshitij


> Ok I ran the following tests:
>
> [1]
> Application spawns 8 threads. I write to Lustre having 8 OSTs.
> Each thread writes data in blocks of 1 Mbyte in a round robin fashion, 
> i.e.
>
> T0 writes to offsets 0, 8MB, 16MB, etc.
> T1 writes to offsets 1MB, 9MB, 17MB, etc.
> The stripe size being 1MByte, every thread ends up writing to only 1 OST.
>
> I see a bandwidth of 280 Mbytes/sec, similar to the single thread 
> performance.
>
> [2]
> I also ran the same test such that every thread writes data in blocks 
> of 8 Mbytes for the same stripe size. (Thus, every thread will write 
> to every OST). I still get similar performance, ~280Mbytes/sec, so 
> essentially I see no difference between each thread writing to a 
> single OST vs each thread writing to all OSTs.
>
> And as I said before, if all threads write to their own separate file, 
> the resulting bandwidth is ~700Mbytes/sec.
>
> I have attached my C file (simple_io_test.c) herewith. Maybe you could 
> run it and see where the bottleneck is. Comments and instructions for 
> compilation have been included in the file. Do let me know if you need 
> any clarification on that.
>
> Your help is appreciated,
> Kshitij
>
>> This is what my application does:
>>
>> Each thread has its own file descriptor to the file.
>> I use pwrite to ensure non-overlapping regions, as follows:
>>
>> Thread 0, data_size: 1MB, offset: 0
>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, 
>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB
>>
>> <repeat cycle>
>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in 
>> parallel, I dont wait for one cycle to end before the next one 
>> begins).
>>
>> I am gonna try the following:
>> a)
>> Instead of a round-robin distribution of offsets, test with 
>> sequential
>> offsets:
>> Thread 0, data_size: 1MB, offset:0
>> Thread 0, data_size: 1MB, offset:1MB
>> Thread 0, data_size: 1MB, offset:2MB
>> Thread 0, data_size: 1MB, offset:3MB
>>
>> Thread 1, data_size: 1MB, offset:4MB
>> and so on. (I am gonna keep these separate pwrite I/O requests 
>> instead of merging them or using writev)
>>
>> b)
>> Map the threads to the no. of OSTs using some modulo, as suggested in 
>> the email below.
>>
>> c)
>> Experiment with fewer no. of OSTs (I currently have 48).
>>
>> I shall report back with my findings.
>>
>> Thanks,
>> Kshitij
>>
>>> [Moved to Lustre-discuss]
>>>
>>>
>>> "However, if I spawn 8 threads such that all of them write to the 
>>> same file (non-overlapping locations), without explicitly 
>>> synchronizing the writes (i.e. I dont lock the file handle)"
>>>
>>>
>>> How exactly does your multi-threaded application write the data?  
>>> Are you using pwrite to ensure non-overlapping regions or are they 
>>> all just doing unlocked write() operations on the same fd to each 
>>> write (each just transferring size/8)?  If it divides the file into 
>>> N pieces, and each thread does pwrite on its piece, then what each 
>>> OST sees are multiple streams at wide offsets to the same object, 
>>> which could impact performance.
>>>
>>> If on the other hand the file is written sequentially, where each 
>>> thread grabs the next piece to be written (locking normally used for 
>>> the current_offset value, so you know where each chunk is actually 
>>> going), then you get a more sequential pattern at the OST.
>>>
>>> If the number of threads maps to the number of OSTs (or some modulo, 
>>> like in your case 6 OSTs per thread), and each thread "owns" the 
>>> piece of the file that belongs to an OST (ie: for (offset = 
>>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd, buf, 
>>> 6MB, offset); ), then you've eliminated the need for application 
>>> locks (assuming the use of
>>> pwrite) and ensured each OST object is being written sequentially.
>>>
>>> It's quite possible there is some bottleneck on the shared fd.  So 
>>> perhaps the question is not why you aren't scaling with more 
>>> threads, but why the single file is not able to saturate the client, 
>>> or why the file BW is not scaling with more OSTs.  It is somewhat 
>>> common for multiple processes (on different nodes) to write 
>>> non-overlapping regions of the same file; does performance improve 
>>> if each thread opens its own file descriptor?
>>>
>>> Kevin
>>>
>>>
>>> Wojciech Turek wrote:
>>>> Ok so it looks like you have in total 64 OSTs and your output file 
>>>> is striped across 48 of them. May I suggest that you limit number 
>>>> of stripes, lets say a good number to start with would be 8 stripes 
>>>> and also for best results use OST pools feature to arrange that 
>>>> each stripe goes to OST owned by different OSS.
>>>>
>>>> regards,
>>>>
>>>> Wojciech
>>>>
>>>> On 23 May 2011 23:09, <[email protected] <mailto:[email protected]>>
>>>> wrote:
>>>>
>>>>     Actually, 'lfs check servers' returns 64 entries as well, so I
>>>>     presume the
>>>>     system documentation is out of date.
>>>>
>>>>     Again, I am sorry the basic information had been incorrect.
>>>>
>>>>     - Kshitij
>>>>
>>>>     > Run lfs getstripe <your_output_file> and paste the output of
>>>>     that command
>>>>     > to
>>>>     > the mailing list.
>>>>     > Stripe count of 48 is not possible if you have max 11 OSTs (the
>>>>     max stripe
>>>>     > count will be 11)
>>>>     > If your striping is correct, the bottleneck can be your client
>>>>     network.
>>>>     >
>>>>     > regards,
>>>>     >
>>>>     > Wojciech
>>>>     >
>>>>     >
>>>>     >
>>>>     > On 23 May 2011 22:35, <[email protected]
>>>>     <mailto:[email protected]>> wrote:
>>>>     >
>>>>     >> The stripe count is 48.
>>>>     >>
>>>>     >> Just fyi, this is what my application does:
>>>>     >> A simple I/O test where threads continually write blocks of 
>>>> size
>>>>     >> 64Kbytes
>>>>     >> or 1Mbyte (decided at compile time) till a large file of say,
>>>>     16Gbytes
>>>>     >> is
>>>>     >> created.
>>>>     >>
>>>>     >> Thanks,
>>>>     >> Kshitij
>>>>     >>
>>>>     >> > What is your stripe count on the file,  if your default is 1,
>>>>     you are
>>>>     >> only
>>>>     >> > writing to one of the OST's.  you can check with the lfs
>>>>     getstripe
>>>>     >> > command, you can set the stripe bigger, and hopefully your
>>>>     >> wide-stripped
>>>>     >> > file with threaded writes will be faster.
>>>>     >> >
>>>>     >> > Evan
>>>>     >> >
>>>>     >> > -----Original Message-----
>>>>     >> > From: [email protected]
>>>>     <mailto:[email protected]>
>>>>     >> > [mailto:[email protected]
>>>>     <mailto:[email protected]>] On Behalf Of
>>>>     >> > [email protected] <mailto:[email protected]>
>>>>     >> > Sent: Monday, May 23, 2011 2:28 PM
>>>>     >> > To: [email protected]
>>>>     <mailto:[email protected]>
>>>>     >> > Subject: [Lustre-community] Poor multithreaded I/O 
>>>> performance
>>>>     >> >
>>>>     >> > Hello,
>>>>     >> > I am running a multithreaded application that writes to a 
>>>> common
>>>>     >> shared
>>>>     >> > file on lustre fs, and this is what I see:
>>>>     >> >
>>>>     >> > If I have a single thread in my application, I get a 
>>>> bandwidth of
>>>>     >> approx.
>>>>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I
>>>>     spawn 8
>>>>     >> > threads such that all of them write to the same file
>>>>     (non-overlapping
>>>>     >> > locations), without explicitly synchronizing the writes (i.e.
>>>>     I dont
>>>>     >> lock
>>>>     >> > the file handle), I still get the same bandwidth.
>>>>     >> >
>>>>     >> > Now, instead of writing to a shared file, if these threads
>>>>     write to
>>>>     >> > separate files, the bandwidth obtained is approx. 700 
>>>> Mbytes/sec.
>>>>     >> >
>>>>     >> > I would ideally like my multithreaded application to see 
>>>> similar
>>>>     >> scaling.
>>>>     >> > Any ideas why the performance is limited and any workarounds?
>>>>     >> >
>>>>     >> > Thank you,
>>>>     >> > Kshitij
>>>>     >> >
>>>>     >> >
>>>>     >> > _______________________________________________
>>>>     >> > Lustre-community mailing list
>>>>     >> > [email protected]
>>>>     <mailto:[email protected]>
>>>>     >> > http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>     >> >
>>>>     >>
>>>>     >>
>>>>     >> _______________________________________________
>>>>     >> Lustre-community mailing list
>>>>     >> [email protected]
>>>>     <mailto:[email protected]>
>>>>     >> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>>
>>>> -------------------------------------------------------------------
>>>> -----
>>>>
>>>> _______________________________________________
>>>> Lustre-community mailing list
>>>> [email protected]
>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>
>>
>>
>


_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Poor multithreaded I/O performance

Reply via email to