Re: [Lustre-discuss] Poor multithreaded I/O performance

Andreas Dilger Thu, 09 Jun 2011 13:07:46 -0700

On 2011-06-09, at 11:57 AM, Kshitij Mehta wrote:
> I read in a research paper 
> (http://ft.ornl.gov/pubs-archive/2007-CCGrid-file-joining.pdf) about 
> Lustre's ability to join files in place. Can someone point me to sample 
> code and documentation on this? I couldnt find information in the 
> manual. Being able to join files in place could be a potential solution 
> to the issue I have.


That feature was mostly experimental, and has been disabled in newer
versions of Lustre.

> On 06/06/2011 01:20 PM, [email protected] wrote:
>>> are the separate files being striped 8 ways?
>>>  Because that would allow them to hit possibly all 64 OST's, while the
>>> shared file case will only hit 8
>> 
>> Yes, I found out that the files are getting striped 8 ways, so we end up
>> hitting 64 OSTs. This is what I tried next:
>> 
>> 1. Ran a test case where 6 threads write separate files, each of size 6
>> GB, to a directory configured over 8 OSTs. Thus the application writes
>> 36GB of data in total, over 48 OSTs.
>> 
>> 2. Ran a test case where 8 threads write a common file of size 36GB to a
>> directory configured over 48 OSTs.
>> 
>> Thus both tests ultimately write 36GB of data over 48 OSTS. I still see a
>> b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for test 1
>> (separate files).
>> 
>> Thanks,
>> Kshitij
>> 
>>> I've been trying to test this, but not finding an obvious error...  so
>>> more questions:
>>> 
>>> How much RAM do you have on your client, and how much on the OST's  some
>>> of my smaller tests go much faster, but I believe that it is cache based
>>> effects.  My larger test at 32GB gives pretty consistent results.
>>> 
>>> The other thing to consider:  are the separate files being striped 8 ways?
>>>  Because that would allow them to hit possibly all 64 OST's, while the
>>> shared file case will only hit 8.
>>> 
>>> Evan
>>> 
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]] On Behalf Of Felix, Evan
>>> J
>>> Sent: Friday, June 03, 2011 9:09 AM
>>> To: [email protected]
>>> Cc: Lustre discuss
>>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>>> 
>>> What file sizes and segment sizes are you using for your tests?
>>> 
>>> Evan
>>> 
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]] On Behalf Of
>>> [email protected]
>>> Sent: Thursday, June 02, 2011 5:07 PM
>>> To: [email protected]
>>> Cc: [email protected]; Lustre discuss
>>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>>> 
>>> Hello,
>>> I was wondering if anyone could replicate the performance of the
>>> multithreaded application using the C file that I posted in my previous
>>> email.
>>> 
>>> Thanks,
>>> Kshitij
>>> 
>>> 
>>>> Ok I ran the following tests:
>>>> 
>>>> [1]
>>>> Application spawns 8 threads. I write to Lustre having 8 OSTs.
>>>> Each thread writes data in blocks of 1 Mbyte in a round robin fashion,
>>>> i.e.
>>>> 
>>>> T0 writes to offsets 0, 8MB, 16MB, etc.
>>>> T1 writes to offsets 1MB, 9MB, 17MB, etc.
>>>> The stripe size being 1MByte, every thread ends up writing to only 1
>>>> OST.
>>>> 
>>>> I see a bandwidth of 280 Mbytes/sec, similar to the single thread
>>>> performance.
>>>> 
>>>> [2]
>>>> I also ran the same test such that every thread writes data in blocks
>>>> of 8 Mbytes for the same stripe size. (Thus, every thread will write
>>>> to every OST). I still get similar performance, ~280Mbytes/sec, so
>>>> essentially I see no difference between each thread writing to a
>>>> single OST vs each thread writing to all OSTs.
>>>> 
>>>> And as I said before, if all threads write to their own separate file,
>>>> the resulting bandwidth is ~700Mbytes/sec.
>>>> 
>>>> I have attached my C file (simple_io_test.c) herewith. Maybe you could
>>>> run it and see where the bottleneck is. Comments and instructions for
>>>> compilation have been included in the file. Do let me know if you need
>>>> any clarification on that.
>>>> 
>>>> Your help is appreciated,
>>>> Kshitij
>>>> 
>>>>> This is what my application does:
>>>>> 
>>>>> Each thread has its own file descriptor to the file.
>>>>> I use pwrite to ensure non-overlapping regions, as follows:
>>>>> 
>>>>> Thread 0, data_size: 1MB, offset: 0
>>>>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB,
>>>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB
>>>>> 
>>>>> <repeat cycle>
>>>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in
>>>>> parallel, I dont wait for one cycle to end before the next one
>>>>> begins).
>>>>> 
>>>>> I am gonna try the following:
>>>>> a)
>>>>> Instead of a round-robin distribution of offsets, test with
>>>>> sequential
>>>>> offsets:
>>>>> Thread 0, data_size: 1MB, offset:0
>>>>> Thread 0, data_size: 1MB, offset:1MB
>>>>> Thread 0, data_size: 1MB, offset:2MB
>>>>> Thread 0, data_size: 1MB, offset:3MB
>>>>> 
>>>>> Thread 1, data_size: 1MB, offset:4MB
>>>>> and so on. (I am gonna keep these separate pwrite I/O requests
>>>>> instead of merging them or using writev)
>>>>> 
>>>>> b)
>>>>> Map the threads to the no. of OSTs using some modulo, as suggested in
>>>>> the email below.
>>>>> 
>>>>> c)
>>>>> Experiment with fewer no. of OSTs (I currently have 48).
>>>>> 
>>>>> I shall report back with my findings.
>>>>> 
>>>>> Thanks,
>>>>> Kshitij
>>>>> 
>>>>>> [Moved to Lustre-discuss]
>>>>>> 
>>>>>> 
>>>>>> "However, if I spawn 8 threads such that all of them write to the
>>>>>> same file (non-overlapping locations), without explicitly
>>>>>> synchronizing the writes (i.e. I dont lock the file handle)"
>>>>>> 
>>>>>> 
>>>>>> How exactly does your multi-threaded application write the data?
>>>>>> Are you using pwrite to ensure non-overlapping regions or are they
>>>>>> all just doing unlocked write() operations on the same fd to each
>>>>>> write (each just transferring size/8)?  If it divides the file into
>>>>>> N pieces, and each thread does pwrite on its piece, then what each
>>>>>> OST sees are multiple streams at wide offsets to the same object,
>>>>>> which could impact performance.
>>>>>> 
>>>>>> If on the other hand the file is written sequentially, where each
>>>>>> thread grabs the next piece to be written (locking normally used for
>>>>>> the current_offset value, so you know where each chunk is actually
>>>>>> going), then you get a more sequential pattern at the OST.
>>>>>> 
>>>>>> If the number of threads maps to the number of OSTs (or some modulo,
>>>>>> like in your case 6 OSTs per thread), and each thread "owns" the
>>>>>> piece of the file that belongs to an OST (ie: for (offset =
>>>>>> thread_num * 6MB; offset<  size; offset += 48MB) pwrite(fd, buf,
>>>>>> 6MB, offset); ), then you've eliminated the need for application
>>>>>> locks (assuming the use of
>>>>>> pwrite) and ensured each OST object is being written sequentially.
>>>>>> 
>>>>>> It's quite possible there is some bottleneck on the shared fd.  So
>>>>>> perhaps the question is not why you aren't scaling with more
>>>>>> threads, but why the single file is not able to saturate the client,
>>>>>> or why the file BW is not scaling with more OSTs.  It is somewhat
>>>>>> common for multiple processes (on different nodes) to write
>>>>>> non-overlapping regions of the same file; does performance improve
>>>>>> if each thread opens its own file descriptor?
>>>>>> 
>>>>>> Kevin
>>>>>> 
>>>>>> 
>>>>>> Wojciech Turek wrote:
>>>>>>> Ok so it looks like you have in total 64 OSTs and your output file
>>>>>>> is striped across 48 of them. May I suggest that you limit number
>>>>>>> of stripes, lets say a good number to start with would be 8 stripes
>>>>>>> and also for best results use OST pools feature to arrange that
>>>>>>> each stripe goes to OST owned by different OSS.
>>>>>>> 
>>>>>>> regards,
>>>>>>> 
>>>>>>> Wojciech
>>>>>>> 
>>>>>>> On 23 May 2011 23:09,<[email protected]<mailto:[email protected]>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>     Actually, 'lfs check servers' returns 64 entries as well, so I
>>>>>>>     presume the
>>>>>>>     system documentation is out of date.
>>>>>>> 
>>>>>>>     Again, I am sorry the basic information had been incorrect.
>>>>>>> 
>>>>>>>     - Kshitij
>>>>>>> 
>>>>>>>> Run lfs getstripe<your_output_file>  and paste the output of
>>>>>>>     that command
>>>>>>>> to
>>>>>>>> the mailing list.
>>>>>>>> Stripe count of 48 is not possible if you have max 11 OSTs (the
>>>>>>>     max stripe
>>>>>>>> count will be 11)
>>>>>>>> If your striping is correct, the bottleneck can be your client
>>>>>>>     network.
>>>>>>>> 
>>>>>>>> regards,
>>>>>>>> 
>>>>>>>> Wojciech
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 23 May 2011 22:35,<[email protected]
>>>>>>>     <mailto:[email protected]>>  wrote:
>>>>>>>> 
>>>>>>>>> The stripe count is 48.
>>>>>>>>> 
>>>>>>>>> Just fyi, this is what my application does:
>>>>>>>>> A simple I/O test where threads continually write blocks of
>>>>>>> size
>>>>>>>>> 64Kbytes
>>>>>>>>> or 1Mbyte (decided at compile time) till a large file of say,
>>>>>>>     16Gbytes
>>>>>>>>> is
>>>>>>>>> created.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Kshitij
>>>>>>>>> 
>>>>>>>>>> What is your stripe count on the file,  if your default is
>>>>>>> 1,
>>>>>>>     you are
>>>>>>>>> only
>>>>>>>>>> writing to one of the OST's.  you can check with the lfs
>>>>>>>     getstripe
>>>>>>>>>> command, you can set the stripe bigger, and hopefully your
>>>>>>>>> wide-stripped
>>>>>>>>>> file with threaded writes will be faster.
>>>>>>>>>> 
>>>>>>>>>> Evan
>>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: [email protected]
>>>>>>>     <mailto:[email protected]>
>>>>>>>>>> [mailto:[email protected]
>>>>>>>     <mailto:[email protected]>] On Behalf Of
>>>>>>>>>> [email protected]<mailto:[email protected]>
>>>>>>>>>> Sent: Monday, May 23, 2011 2:28 PM
>>>>>>>>>> To: [email protected]
>>>>>>>     <mailto:[email protected]>
>>>>>>>>>> Subject: [Lustre-community] Poor multithreaded I/O
>>>>>>> performance
>>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> I am running a multithreaded application that writes to a
>>>>>>> common
>>>>>>>>> shared
>>>>>>>>>> file on lustre fs, and this is what I see:
>>>>>>>>>> 
>>>>>>>>>> If I have a single thread in my application, I get a
>>>>>>> bandwidth of
>>>>>>>>> approx.
>>>>>>>>>> 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I
>>>>>>>     spawn 8
>>>>>>>>>> threads such that all of them write to the same file
>>>>>>>     (non-overlapping
>>>>>>>>>> locations), without explicitly synchronizing the writes
>>>>>>> (i.e.
>>>>>>>     I dont
>>>>>>>>> lock
>>>>>>>>>> the file handle), I still get the same bandwidth.
>>>>>>>>>> 
>>>>>>>>>> Now, instead of writing to a shared file, if these threads
>>>>>>>     write to
>>>>>>>>>> separate files, the bandwidth obtained is approx. 700
>>>>>>> Mbytes/sec.
>>>>>>>>>> 
>>>>>>>>>> I would ideally like my multithreaded application to see
>>>>>>> similar
>>>>>>>>> scaling.
>>>>>>>>>> Any ideas why the performance is limited and any
>>>>>>> workarounds?
>>>>>>>>>> 
>>>>>>>>>> Thank you,
>>>>>>>>>> Kshitij
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Lustre-community mailing list
>>>>>>>>>> [email protected]
>>>>>>>     <mailto:[email protected]>
>>>>>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Lustre-community mailing list
>>>>>>>>> [email protected]
>>>>>>>     <mailto:[email protected]>
>>>>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>> 
>>>>>>> 
>>>>>>> -------------------------------------------------------------------
>>>>>>> -----
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Lustre-community mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>> 
>>>>> 
>>> 
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> [email protected]
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> [email protected]
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> 
> _______________________________________________
> Lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.



_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Poor multithreaded I/O performance

Reply via email to