Re: [Pvfs2-users] Conceptual question: Why larger block sizes perform better

Becky Ligon Fri, 21 Oct 2011 11:32:17 -0700

On Fri, Oct 21, 2011 at 1:34 PM, Kshitij Mehta <[email protected]> wrote:


>  Apologies for opening an old thread,
>
>
> By default, PVFS uses 8, 256k buffers to transfer data to a server.  Once
> the connection is made, PVFS transmits data to the server using these
> 256k-sized buffers as fast as it can.  You can think of the 8 buffers as the
> PVFS window size (if you are familiar with TCP terminology).  With 20 I/O
> servers, you have 20 of these windows pushing out data over the network just
> as fast as possible.
>
>
> How does pvfs2 write non-contiguous data chunks to a single server? Using a
> list I/O interface like writev? Or does it issue separate write calls for
> every 64K chunk of data to be written to a server?
>
>
   The application writer doesn't have to do anything, other than issue a
single write.  The PVFS client takes any given write request and distributes
it across a set of servers in a simple, round robin way.  The client breaks
the data into chunks and issues writes to each server simultaneously.  For
example, if we have 1MB of data coming in, and the stripe size is 64k, and
there are 4 I/O servers, then each server gets 4-64k chunks.  You can think
of the 1MB data as an array of 64k chunks from 0 to 15.  Server 1 would get
chunks 0,4,8,12, server 2 gets chunks 1,5,9,13, server 3 gets chunk
2,6,10,14, and server 4 gets 3,7,11,15.  The client sends all 4 chunks to a
single server all at the same time.  All of this is done transparently.  The
client waits for each of the 4 servers to respond with a status and THEN
returns back to the application.

What I just described is the default behavior, which is a simple stripe
distribution.  However, you can define your own distribution, and PVFS will
chunk the data as prescribed by that distribution.   At this point, I don't
know of anyone using a different distribution.




> Also, is this documented somewhere, or do you generally look at the source
> code to figure such things out?
>
I have learned all of this by looking at the code because I am a PVFS
developer; however, there is a doc directory that you can "make" that will
also describe some of the major functionality.

Hope this helps!

Becky



>
> Thanks,
> Kshitij
>
>
> On 10/09/2011 03:03 PM, Becky Ligon wrote:
>
> The dd block size determines how much data is given to PVFS2 in any one
> write request.  Thus, if the write request is given 2MB of data, that data
> is divided up and sent to the 20 I/O servers all at the same time (see note
> below).   If the write request is given only 64K of data, then a request is
> sent to the one server where the next 64k is to be written.   So, throughput
> for larger requests generally perform better than for small requests,
> depending on your network delay, how busy your servers are, and the number
> of I/O servers in your filesystem.   There is also some overhead associated
> with moving the data from user space to kernel space; so, you incur more OS
> overhead using 64k blocks than you would with 2MB blocks.
>
>  For example, if you use the linux command "cp" and compare its
> performance with "pvfs2-cp" to copy a large amount of data from a unix
> filesystem into a PVFS filesystem, you will immediately notice that pvfs2-cp
> is faster than cp.  pvfs2-cp performs better than cp because it uses a
> default buffer size of 10MB, while cp uses the stripe size, in your case
> 64k.  So, it will take cp longer to transfer the same amount of data than it
> will with pvfs2-cp.
>
>
>  NOTE:  By default, PVFS uses 8, 256k buffers to transfer data to a
> server.  Once the connection is made, PVFS transmits data to the server
> using these 256k-sized buffers as fast as it can.  You can think of the 8
> buffers as the PVFS window size (if you are familiar with TCP terminology).
>  With 20 I/O servers, you have 20 of these windows pushing out data over the
> network just as fast as possible.
>
>  Hope this helps!
> Becky
>
> On Sun, Oct 9, 2011 at 5:34 AM, belcampo <[email protected]> wrote:
>
>> On 10/06/2011 10:36 PM, Kshitij Mehta wrote:
>>
>>> Hello,
>>> I have a pvfs2 file system configured over 20 IO servers with a default
>>> stripe size of 64Kbytes.
>>> I am running a simple test program where I write a matrix to file.
>>>
>>> This is what I see:
>>> If the 1GByte matrix is written in block sizes of 2Mbytes, the
>>> performance is much better than writing the matrix in blocks of 64Kbytes. I
>>> am not sure I understand why. Since the stripe size is 64KB, every block of
>>> 2MB eventually gets broken into 64KB blocks which are written to the IO
>>> servers, so the performance should nearly be equal. I would understand why
>>> writing block size < stripe_size should perform badly, but when the block
>>> size exceeds the stripe size, I expect the performance to peak out.
>>>
>>> Can someone explain what happens here? Your help is appreciated.
>>>
>>  I can't explain only confirm. I also did some test with following
>> results.
>>
>> with pvfs2-cp
>>        18.18
>> over pvfs2fuse
>> dd blocksize    MB/s
>> 4k      4.4
>> 8k      6.3
>> 16k     7.3
>> 32k     8.8
>> 64k     9.9
>> 128k    18.7
>> 256k    18.7
>> 512k    18.8
>> 1024k   18.8
>> 2048k   18.8
>>
>> over pvfs2fuse
>> cp      8.2
>> rsync   14.8
>>
>> over nfs
>> cp      10.6
>> rsync   11.0
>>
>> Further earlier was/is mentioned that ongoing effort is put in optimizing
>> pvfs2/orangefs with small file-sizes. So AFAIK it is by design, but not
>> knowing the reasoning behind it.
>>
>>
>>> Best,
>>> Kshitij Mehta
>>>
>>> _______________________________________________
>>> Pvfs2-users mailing list
>>> [email protected]
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>
>>
>> _______________________________________________
>> Pvfs2-users mailing list
>> [email protected]
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>
>
>
>
>  --
> Becky Ligon
> OrangeFS Support and Development
> Omnibond Systems
> Anderson, South Carolina
>
>
>
>
> _______________________________________________
> Pvfs2-users mailing 
> [email protected]http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
>
>


-- 
Becky Ligon
OrangeFS Support and Development
Omnibond Systems
Anderson, South Carolina

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] Conceptual question: Why larger block sizes perform better

Reply via email to