Re: [Pvfs2-users] Conceptual question: Why larger block sizes perform better

Becky Ligon Wed, 02 Nov 2011 07:12:21 -0700

The client sends the data to the server using the communications layer.
 So, if your environment is using TCP/IP, then we are using sockets.  On
the server side, once a chunk of data is received, it then writes the data
to the system level data file using pwrite.


Becky



On Tue, Nov 1, 2011 at 4:51 PM, Kshitij Mehta <[email protected]> wrote:

> **
> Hi,
> Thanks for your reply. But heres what I am trying to understand:
>
>
> Server 1 would get chunks 0,4,8,12, server 2 gets chunks 1,5,9,13, server
> 3 gets chunk 2,6,10,14, and server 4 gets 3,7,11,15.  The client sends all
> 4 chunks to a single server all at the same time.  All of this is done
> transparently.
>
>
> Using what low-level I/O interface does the client send the 4 chunks to a
> server? Does it use write()/pwrite() in succession 4 times, or does it use
> a list I/O interface like pwritev() to send all 4 chunks in a *single I/O
> call*? (I understand that all this is kept transparent from the
> application writer, I am trying to understand the low level details of
> pvfs2 because I am trying to model its I/O operations as part of a project).
>
> Thanks again,
>
> Kshitij Mehta
> PhD candidate
> Parallel Software Technologies Lab
> Dept. of Computer Science
> University of Houston
> Houston, Texas, USA
>
>
>
>
>
> On 10/21/2011 01:30 PM, Becky Ligon wrote:
>
>
>
> On Fri, Oct 21, 2011 at 1:34 PM, Kshitij Mehta <[email protected]> wrote:
>
>>  Apologies for opening an old thread,
>>
>>
>> By default, PVFS uses 8, 256k buffers to transfer data to a server.  Once
>> the connection is made, PVFS transmits data to the server using these
>> 256k-sized buffers as fast as it can.  You can think of the 8 buffers as
>> the PVFS window size (if you are familiar with TCP terminology).  With 20
>> I/O servers, you have 20 of these windows pushing out data over the network
>> just as fast as possible.
>>
>>
>>  How does pvfs2 write non-contiguous data chunks to a single server?
>> Using a list I/O interface like writev? Or does it issue separate write
>> calls for every 64K chunk of data to be written to a server?
>>
>>
>    The application writer doesn't have to do anything, other than issue a
> single write.  The PVFS client takes any given write request and
> distributes it across a set of servers in a simple, round robin way.  The
> client breaks the data into chunks and issues writes to each server
> simultaneously.  For example, if we have 1MB of data coming in, and the
> stripe size is 64k, and there are 4 I/O servers, then each server gets
> 4-64k chunks.  You can think of the 1MB data as an array of 64k chunks from
> 0 to 15.  Server 1 would get chunks 0,4,8,12, server 2 gets chunks
> 1,5,9,13, server 3 gets chunk 2,6,10,14, and server 4 gets 3,7,11,15.  The
> client sends all 4 chunks to a single server all at the same time.  All of
> this is done transparently.  The client waits for each of the 4 servers to
> respond with a status and THEN returns back to the application.
>
> What I just described is the default behavior, which is a simple stripe
> distribution.  However, you can define your own distribution, and PVFS will
> chunk the data as prescribed by that distribution.   At this point, I don't
> know of anyone using a different distribution.
>
>
>
>
>>  Also, is this documented somewhere, or do you generally look at the
>> source code to figure such things out?
>>
> I have learned all of this by looking at the code because I am a PVFS
> developer; however, there is a doc directory that you can "make" that will
> also describe some of the major functionality.
>
> Hope this helps!
>
> Becky
>
>
>
>>
>> Thanks,
>>  Kshitij
>>
>>
>> On 10/09/2011 03:03 PM, Becky Ligon wrote:
>>
>> The dd block size determines how much data is given to PVFS2 in any one
>> write request.  Thus, if the write request is given 2MB of data, that data
>> is divided up and sent to the 20 I/O servers all at the same time (see note
>> below).   If the write request is given only 64K of data, then a request is
>> sent to the one server where the next 64k is to be written.   So,
>> throughput for larger requests generally perform better than for small
>> requests, depending on your network delay, how busy your servers are, and
>> the number of I/O servers in your filesystem.   There is also some overhead
>> associated with moving the data from user space to kernel space; so, you
>> incur more OS overhead using 64k blocks than you would with 2MB blocks.
>>
>>  For example, if you use the linux command "cp" and compare its
>> performance with "pvfs2-cp" to copy a large amount of data from a unix
>> filesystem into a PVFS filesystem, you will immediately notice that
>> pvfs2-cp is faster than cp.  pvfs2-cp performs better than cp because it
>> uses a default buffer size of 10MB, while cp uses the stripe size, in your
>> case 64k.  So, it will take cp longer to transfer the same amount of data
>> than it will with pvfs2-cp.
>>
>>
>>  NOTE:  By default, PVFS uses 8, 256k buffers to transfer data to a
>> server.  Once the connection is made, PVFS transmits data to the server
>> using these 256k-sized buffers as fast as it can.  You can think of the 8
>> buffers as the PVFS window size (if you are familiar with TCP terminology).
>>  With 20 I/O servers, you have 20 of these windows pushing out data over
>> the network just as fast as possible.
>>
>>  Hope this helps!
>> Becky
>>
>> On Sun, Oct 9, 2011 at 5:34 AM, belcampo <[email protected]> wrote:
>>
>>> On 10/06/2011 10:36 PM, Kshitij Mehta wrote:
>>>
>>>> Hello,
>>>> I have a pvfs2 file system configured over 20 IO servers with a default
>>>> stripe size of 64Kbytes.
>>>> I am running a simple test program where I write a matrix to file.
>>>>
>>>> This is what I see:
>>>> If the 1GByte matrix is written in block sizes of 2Mbytes, the
>>>> performance is much better than writing the matrix in blocks of 64Kbytes. I
>>>> am not sure I understand why. Since the stripe size is 64KB, every block of
>>>> 2MB eventually gets broken into 64KB blocks which are written to the IO
>>>> servers, so the performance should nearly be equal. I would understand why
>>>> writing block size < stripe_size should perform badly, but when the block
>>>> size exceeds the stripe size, I expect the performance to peak out.
>>>>
>>>> Can someone explain what happens here? Your help is appreciated.
>>>>
>>>  I can't explain only confirm. I also did some test with following
>>> results.
>>>
>>> with pvfs2-cp
>>>        18.18
>>> over pvfs2fuse
>>> dd blocksize    MB/s
>>> 4k      4.4
>>> 8k      6.3
>>> 16k     7.3
>>> 32k     8.8
>>> 64k     9.9
>>> 128k    18.7
>>> 256k    18.7
>>> 512k    18.8
>>> 1024k   18.8
>>> 2048k   18.8
>>>
>>> over pvfs2fuse
>>> cp      8.2
>>> rsync   14.8
>>>
>>> over nfs
>>> cp      10.6
>>> rsync   11.0
>>>
>>> Further earlier was/is mentioned that ongoing effort is put in
>>> optimizing pvfs2/orangefs with small file-sizes. So AFAIK it is by design,
>>> but not knowing the reasoning behind it.
>>>
>>>
>>>> Best,
>>>> Kshitij Mehta
>>>>
>>>> _______________________________________________
>>>> Pvfs2-users mailing list
>>>> [email protected]
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>
>>>
>>> _______________________________________________
>>> Pvfs2-users mailing list
>>> [email protected]
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>
>>
>>
>>
>>  --
>> Becky Ligon
>> OrangeFS Support and Development
>> Omnibond Systems
>> Anderson, South Carolina
>>
>>
>>
>>
>> _______________________________________________
>> Pvfs2-users mailing 
>> [email protected]http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>
>>
>>
>
>
> --
> Becky Ligon
> OrangeFS Support and Development
> Omnibond Systems
> Anderson, South Carolina
>
>
>
>


-- 
Becky Ligon
OrangeFS Support and Development
Omnibond Systems
Anderson, South Carolina

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] Conceptual question: Why larger block sizes perform better

Reply via email to