Re: [Lustre-discuss] Poor multithreaded I/O performance

Felix, Evan J Fri, 10 Jun 2011 07:16:02 -0700

Its part of the lfs lustre tool, I have not used it myself..  try 'lfs help 
join'


Evan

-----Original Message-----
From: Kshitij Mehta [mailto:[email protected]] 
Sent: Thursday, June 09, 2011 10:58 AM
To: [email protected]
Cc: Felix, Evan J; Lustre discuss
Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance

I read in a research paper
(http://ft.ornl.gov/pubs-archive/2007-CCGrid-file-joining.pdf) about Lustre's 
ability to join files in place. Can someone point me to sample code and 
documentation on this? I couldnt find information in the manual. Being able to 
join files in place could be a potential solution to the issue I have.

Thanks,
Kshitij

On 06/06/2011 01:20 PM, [email protected] wrote:
>> are the separate files being striped 8 ways?
>>   Because that would allow them to hit possibly all 64 OST's, while 
>> the shared file case will only hit 8
> Yes, I found out that the files are getting striped 8 ways, so we end 
> up hitting 64 OSTs. This is what I tried next:
>
> 1. Ran a test case where 6 threads write separate files, each of size 
> 6 GB, to a directory configured over 8 OSTs. Thus the application 
> writes 36GB of data in total, over 48 OSTs.
>
> 2. Ran a test case where 8 threads write a common file of size 36GB to 
> a directory configured over 48 OSTs.
>
> Thus both tests ultimately write 36GB of data over 48 OSTS. I still 
> see a b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for 
> test 1 (separate files).
>
> Thanks,
> Kshitij
>
>> I've been trying to test this, but not finding an obvious error...  
>> so more questions:
>>
>> How much RAM do you have on your client, and how much on the OST's  
>> some of my smaller tests go much faster, but I believe that it is 
>> cache based effects.  My larger test at 32GB gives pretty consistent results.
>>
>> The other thing to consider:  are the separate files being striped 8 ways?
>>   Because that would allow them to hit possibly all 64 OST's, while 
>> the shared file case will only hit 8.
>>
>> Evan
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Felix, 
>> Evan J
>> Sent: Friday, June 03, 2011 9:09 AM
>> To: [email protected]
>> Cc: Lustre discuss
>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>>
>> What file sizes and segment sizes are you using for your tests?
>>
>> Evan
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of 
>> [email protected]
>> Sent: Thursday, June 02, 2011 5:07 PM
>> To: [email protected]
>> Cc: [email protected]; Lustre discuss
>> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>>
>> Hello,
>> I was wondering if anyone could replicate the performance of the 
>> multithreaded application using the C file that I posted in my 
>> previous email.
>>
>> Thanks,
>> Kshitij
>>
>>
>>> Ok I ran the following tests:
>>>
>>> [1]
>>> Application spawns 8 threads. I write to Lustre having 8 OSTs.
>>> Each thread writes data in blocks of 1 Mbyte in a round robin 
>>> fashion, i.e.
>>>
>>> T0 writes to offsets 0, 8MB, 16MB, etc.
>>> T1 writes to offsets 1MB, 9MB, 17MB, etc.
>>> The stripe size being 1MByte, every thread ends up writing to only 1 
>>> OST.
>>>
>>> I see a bandwidth of 280 Mbytes/sec, similar to the single thread 
>>> performance.
>>>
>>> [2]
>>> I also ran the same test such that every thread writes data in 
>>> blocks of 8 Mbytes for the same stripe size. (Thus, every thread 
>>> will write to every OST). I still get similar performance, 
>>> ~280Mbytes/sec, so essentially I see no difference between each 
>>> thread writing to a single OST vs each thread writing to all OSTs.
>>>
>>> And as I said before, if all threads write to their own separate 
>>> file, the resulting bandwidth is ~700Mbytes/sec.
>>>
>>> I have attached my C file (simple_io_test.c) herewith. Maybe you 
>>> could run it and see where the bottleneck is. Comments and 
>>> instructions for compilation have been included in the file. Do let 
>>> me know if you need any clarification on that.
>>>
>>> Your help is appreciated,
>>> Kshitij
>>>
>>>> This is what my application does:
>>>>
>>>> Each thread has its own file descriptor to the file.
>>>> I use pwrite to ensure non-overlapping regions, as follows:
>>>>
>>>> Thread 0, data_size: 1MB, offset: 0 Thread 1, data_size: 1MB, 
>>>> offset: 1MB Thread 2, data_size: 1MB,
>>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB
>>>>
>>>> <repeat cycle>
>>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in 
>>>> parallel, I dont wait for one cycle to end before the next one 
>>>> begins).
>>>>
>>>> I am gonna try the following:
>>>> a)
>>>> Instead of a round-robin distribution of offsets, test with 
>>>> sequential
>>>> offsets:
>>>> Thread 0, data_size: 1MB, offset:0
>>>> Thread 0, data_size: 1MB, offset:1MB Thread 0, data_size: 1MB, 
>>>> offset:2MB Thread 0, data_size: 1MB, offset:3MB
>>>>
>>>> Thread 1, data_size: 1MB, offset:4MB and so on. (I am gonna keep 
>>>> these separate pwrite I/O requests instead of merging them or using 
>>>> writev)
>>>>
>>>> b)
>>>> Map the threads to the no. of OSTs using some modulo, as suggested 
>>>> in the email below.
>>>>
>>>> c)
>>>> Experiment with fewer no. of OSTs (I currently have 48).
>>>>
>>>> I shall report back with my findings.
>>>>
>>>> Thanks,
>>>> Kshitij
>>>>
>>>>> [Moved to Lustre-discuss]
>>>>>
>>>>>
>>>>> "However, if I spawn 8 threads such that all of them write to the 
>>>>> same file (non-overlapping locations), without explicitly 
>>>>> synchronizing the writes (i.e. I dont lock the file handle)"
>>>>>
>>>>>
>>>>> How exactly does your multi-threaded application write the data?
>>>>> Are you using pwrite to ensure non-overlapping regions or are they 
>>>>> all just doing unlocked write() operations on the same fd to each 
>>>>> write (each just transferring size/8)?  If it divides the file 
>>>>> into N pieces, and each thread does pwrite on its piece, then what 
>>>>> each OST sees are multiple streams at wide offsets to the same 
>>>>> object, which could impact performance.
>>>>>
>>>>> If on the other hand the file is written sequentially, where each 
>>>>> thread grabs the next piece to be written (locking normally used 
>>>>> for the current_offset value, so you know where each chunk is 
>>>>> actually going), then you get a more sequential pattern at the OST.
>>>>>
>>>>> If the number of threads maps to the number of OSTs (or some 
>>>>> modulo, like in your case 6 OSTs per thread), and each thread 
>>>>> "owns" the piece of the file that belongs to an OST (ie: for 
>>>>> (offset = thread_num * 6MB; offset<  size; offset += 48MB) 
>>>>> pwrite(fd, buf, 6MB, offset); ), then you've eliminated the need 
>>>>> for application locks (assuming the use of
>>>>> pwrite) and ensured each OST object is being written sequentially.
>>>>>
>>>>> It's quite possible there is some bottleneck on the shared fd.  So 
>>>>> perhaps the question is not why you aren't scaling with more 
>>>>> threads, but why the single file is not able to saturate the 
>>>>> client, or why the file BW is not scaling with more OSTs.  It is 
>>>>> somewhat common for multiple processes (on different nodes) to 
>>>>> write non-overlapping regions of the same file; does performance 
>>>>> improve if each thread opens its own file descriptor?
>>>>>
>>>>> Kevin
>>>>>
>>>>>
>>>>> Wojciech Turek wrote:
>>>>>> Ok so it looks like you have in total 64 OSTs and your output 
>>>>>> file is striped across 48 of them. May I suggest that you limit 
>>>>>> number of stripes, lets say a good number to start with would be 
>>>>>> 8 stripes and also for best results use OST pools feature to 
>>>>>> arrange that each stripe goes to OST owned by different OSS.
>>>>>>
>>>>>> regards,
>>>>>>
>>>>>> Wojciech
>>>>>>
>>>>>> On 23 May 2011 23:09,<[email protected]<mailto:[email protected]>>
>>>>>> wrote:
>>>>>>
>>>>>>      Actually, 'lfs check servers' returns 64 entries as well, so I
>>>>>>      presume the
>>>>>>      system documentation is out of date.
>>>>>>
>>>>>>      Again, I am sorry the basic information had been incorrect.
>>>>>>
>>>>>>      - Kshitij
>>>>>>
>>>>>>      >  Run lfs getstripe<your_output_file>  and paste the output of
>>>>>>      that command
>>>>>>      >  to
>>>>>>      >  the mailing list.
>>>>>>      >  Stripe count of 48 is not possible if you have max 11 OSTs (the
>>>>>>      max stripe
>>>>>>      >  count will be 11)
>>>>>>      >  If your striping is correct, the bottleneck can be your client
>>>>>>      network.
>>>>>>      >
>>>>>>      >  regards,
>>>>>>      >
>>>>>>      >  Wojciech
>>>>>>      >
>>>>>>      >
>>>>>>      >
>>>>>>      >  On 23 May 2011 22:35,<[email protected]
>>>>>>      <mailto:[email protected]>>  wrote:
>>>>>>      >
>>>>>>      >>  The stripe count is 48.
>>>>>>      >>
>>>>>>      >>  Just fyi, this is what my application does:
>>>>>>      >>  A simple I/O test where threads continually write blocks 
>>>>>> of size
>>>>>>      >>  64Kbytes
>>>>>>      >>  or 1Mbyte (decided at compile time) till a large file of say,
>>>>>>      16Gbytes
>>>>>>      >>  is
>>>>>>      >>  created.
>>>>>>      >>
>>>>>>      >>  Thanks,
>>>>>>      >>  Kshitij
>>>>>>      >>
>>>>>>      >>  >  What is your stripe count on the file,  if your 
>>>>>> default is 1,
>>>>>>      you are
>>>>>>      >>  only
>>>>>>      >>  >  writing to one of the OST's.  you can check with the lfs
>>>>>>      getstripe
>>>>>>      >>  >  command, you can set the stripe bigger, and hopefully your
>>>>>>      >>  wide-stripped
>>>>>>      >>  >  file with threaded writes will be faster.
>>>>>>      >>  >
>>>>>>      >>  >  Evan
>>>>>>      >>  >
>>>>>>      >>  >  -----Original Message-----
>>>>>>      >>  >  From: [email protected]
>>>>>>      <mailto:[email protected]>
>>>>>>      >>  >  [mailto:[email protected]
>>>>>>      <mailto:[email protected]>] On Behalf Of
>>>>>>      >>  >  [email protected]<mailto:[email protected]>
>>>>>>      >>  >  Sent: Monday, May 23, 2011 2:28 PM
>>>>>>      >>  >  To: [email protected]
>>>>>>      <mailto:[email protected]>
>>>>>>      >>  >  Subject: [Lustre-community] Poor multithreaded I/O 
>>>>>> performance
>>>>>>      >>  >
>>>>>>      >>  >  Hello,
>>>>>>      >>  >  I am running a multithreaded application that writes 
>>>>>> to a common
>>>>>>      >>  shared
>>>>>>      >>  >  file on lustre fs, and this is what I see:
>>>>>>      >>  >
>>>>>>      >>  >  If I have a single thread in my application, I get a 
>>>>>> bandwidth of
>>>>>>      >>  approx.
>>>>>>      >>  >  250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I
>>>>>>      spawn 8
>>>>>>      >>  >  threads such that all of them write to the same file
>>>>>>      (non-overlapping
>>>>>>      >>  >  locations), without explicitly synchronizing the 
>>>>>> writes (i.e.
>>>>>>      I dont
>>>>>>      >>  lock
>>>>>>      >>  >  the file handle), I still get the same bandwidth.
>>>>>>      >>  >
>>>>>>      >>  >  Now, instead of writing to a shared file, if these threads
>>>>>>      write to
>>>>>>      >>  >  separate files, the bandwidth obtained is approx. 700 
>>>>>> Mbytes/sec.
>>>>>>      >>  >
>>>>>>      >>  >  I would ideally like my multithreaded application to 
>>>>>> see similar
>>>>>>      >>  scaling.
>>>>>>      >>  >  Any ideas why the performance is limited and any 
>>>>>> workarounds?
>>>>>>      >>  >
>>>>>>      >>  >  Thank you,
>>>>>>      >>  >  Kshitij
>>>>>>      >>  >
>>>>>>      >>  >
>>>>>>      >>  >  _______________________________________________
>>>>>>      >>  >  Lustre-community mailing list
>>>>>>      >>  >  [email protected]
>>>>>>      <mailto:[email protected]>
>>>>>>      >>  >  http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>      >>  >
>>>>>>      >>
>>>>>>      >>
>>>>>>      >>  _______________________________________________
>>>>>>      >>  Lustre-community mailing list
>>>>>>      >>  [email protected]
>>>>>>      <mailto:[email protected]>
>>>>>>      >>  
>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------
>>>>>> --
>>>>>> -----
>>>>>>
>>>>>> _______________________________________________
>>>>>> Lustre-community mailing list
>>>>>> [email protected] 
>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>>
>>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> [email protected]
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> [email protected]
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Poor multithreaded I/O performance

Reply via email to