Re: [Pvfs2-users] Conceptual question: Why larger block sizes perform better

Becky Ligon Wed, 02 Nov 2011 13:43:28 -0700

After looking further into the client-side code, I found where the tcp
interface is calling writev(), giving it a socket, a vector, and a count).


Sorry that I didn't post this earlier.

Becky

On Wed, Nov 2, 2011 at 10:10 AM, Becky Ligon <[email protected]> wrote:

>
>
> The client sends the data to the server using the communications layer.
>  So, if your environment is using TCP/IP, then we are using sockets.  On
> the server side, once a chunk of data is received, it then writes the data
> to the system level data file using pwrite.
>
> Becky
>
>
>
> On Tue, Nov 1, 2011 at 4:51 PM, Kshitij Mehta <[email protected]> wrote:
>
>> **
>> Hi,
>> Thanks for your reply. But heres what I am trying to understand:
>>
>>
>> Server 1 would get chunks 0,4,8,12, server 2 gets chunks 1,5,9,13, server
>> 3 gets chunk 2,6,10,14, and server 4 gets 3,7,11,15.  The client sends all
>> 4 chunks to a single server all at the same time.  All of this is done
>> transparently.
>>
>>
>> Using what low-level I/O interface does the client send the 4 chunks to a
>> server? Does it use write()/pwrite() in succession 4 times, or does it use
>> a list I/O interface like pwritev() to send all 4 chunks in a *single
>> I/O call*? (I understand that all this is kept transparent from the
>> application writer, I am trying to understand the low level details of
>> pvfs2 because I am trying to model its I/O operations as part of a project).
>>
>> Thanks again,
>>
>> Kshitij Mehta
>> PhD candidate
>> Parallel Software Technologies Lab
>> Dept. of Computer Science
>> University of Houston
>> Houston, Texas, USA
>>
>>
>>
>>
>>
>> On 10/21/2011 01:30 PM, Becky Ligon wrote:
>>
>>
>>
>> On Fri, Oct 21, 2011 at 1:34 PM, Kshitij Mehta <[email protected]> wrote:
>>
>>>  Apologies for opening an old thread,
>>>
>>>
>>> By default, PVFS uses 8, 256k buffers to transfer data to a server.
>>>  Once the connection is made, PVFS transmits data to the server using these
>>> 256k-sized buffers as fast as it can.  You can think of the 8 buffers as
>>> the PVFS window size (if you are familiar with TCP terminology).  With 20
>>> I/O servers, you have 20 of these windows pushing out data over the network
>>> just as fast as possible.
>>>
>>>
>>>  How does pvfs2 write non-contiguous data chunks to a single server?
>>> Using a list I/O interface like writev? Or does it issue separate write
>>> calls for every 64K chunk of data to be written to a server?
>>>
>>>
>>    The application writer doesn't have to do anything, other than issue a
>> single write.  The PVFS client takes any given write request and
>> distributes it across a set of servers in a simple, round robin way.  The
>> client breaks the data into chunks and issues writes to each server
>> simultaneously.  For example, if we have 1MB of data coming in, and the
>> stripe size is 64k, and there are 4 I/O servers, then each server gets
>> 4-64k chunks.  You can think of the 1MB data as an array of 64k chunks from
>> 0 to 15.  Server 1 would get chunks 0,4,8,12, server 2 gets chunks
>> 1,5,9,13, server 3 gets chunk 2,6,10,14, and server 4 gets 3,7,11,15.  The
>> client sends all 4 chunks to a single server all at the same time.  All of
>> this is done transparently.  The client waits for each of the 4 servers to
>> respond with a status and THEN returns back to the application.
>>
>> What I just described is the default behavior, which is a simple stripe
>> distribution.  However, you can define your own distribution, and PVFS will
>> chunk the data as prescribed by that distribution.   At this point, I don't
>> know of anyone using a different distribution.
>>
>>
>>
>>
>>>  Also, is this documented somewhere, or do you generally look at the
>>> source code to figure such things out?
>>>
>> I have learned all of this by looking at the code because I am a PVFS
>> developer; however, there is a doc directory that you can "make" that will
>> also describe some of the major functionality.
>>
>> Hope this helps!
>>
>> Becky
>>
>>
>>
>>>
>>> Thanks,
>>>  Kshitij
>>>
>>>
>>> On 10/09/2011 03:03 PM, Becky Ligon wrote:
>>>
>>> The dd block size determines how much data is given to PVFS2 in any one
>>> write request.  Thus, if the write request is given 2MB of data, that data
>>> is divided up and sent to the 20 I/O servers all at the same time (see note
>>> below).   If the write request is given only 64K of data, then a request is
>>> sent to the one server where the next 64k is to be written.   So,
>>> throughput for larger requests generally perform better than for small
>>> requests, depending on your network delay, how busy your servers are, and
>>> the number of I/O servers in your filesystem.   There is also some overhead
>>> associated with moving the data from user space to kernel space; so, you
>>> incur more OS overhead using 64k blocks than you would with 2MB blocks.
>>>
>>>  For example, if you use the linux command "cp" and compare its
>>> performance with "pvfs2-cp" to copy a large amount of data from a unix
>>> filesystem into a PVFS filesystem, you will immediately notice that
>>> pvfs2-cp is faster than cp.  pvfs2-cp performs better than cp because it
>>> uses a default buffer size of 10MB, while cp uses the stripe size, in your
>>> case 64k.  So, it will take cp longer to transfer the same amount of data
>>> than it will with pvfs2-cp.
>>>
>>>
>>>  NOTE:  By default, PVFS uses 8, 256k buffers to transfer data to a
>>> server.  Once the connection is made, PVFS transmits data to the server
>>> using these 256k-sized buffers as fast as it can.  You can think of the 8
>>> buffers as the PVFS window size (if you are familiar with TCP terminology).
>>>  With 20 I/O servers, you have 20 of these windows pushing out data over
>>> the network just as fast as possible.
>>>
>>>  Hope this helps!
>>> Becky
>>>
>>> On Sun, Oct 9, 2011 at 5:34 AM, belcampo <[email protected]> wrote:
>>>
>>>> On 10/06/2011 10:36 PM, Kshitij Mehta wrote:
>>>>
>>>>> Hello,
>>>>> I have a pvfs2 file system configured over 20 IO servers with a
>>>>> default stripe size of 64Kbytes.
>>>>> I am running a simple test program where I write a matrix to file.
>>>>>
>>>>> This is what I see:
>>>>> If the 1GByte matrix is written in block sizes of 2Mbytes, the
>>>>> performance is much better than writing the matrix in blocks of 64Kbytes. 
>>>>> I
>>>>> am not sure I understand why. Since the stripe size is 64KB, every block 
>>>>> of
>>>>> 2MB eventually gets broken into 64KB blocks which are written to the IO
>>>>> servers, so the performance should nearly be equal. I would understand why
>>>>> writing block size < stripe_size should perform badly, but when the block
>>>>> size exceeds the stripe size, I expect the performance to peak out.
>>>>>
>>>>> Can someone explain what happens here? Your help is appreciated.
>>>>>
>>>>  I can't explain only confirm. I also did some test with following
>>>> results.
>>>>
>>>> with pvfs2-cp
>>>>        18.18
>>>> over pvfs2fuse
>>>> dd blocksize    MB/s
>>>> 4k      4.4
>>>> 8k      6.3
>>>> 16k     7.3
>>>> 32k     8.8
>>>> 64k     9.9
>>>> 128k    18.7
>>>> 256k    18.7
>>>> 512k    18.8
>>>> 1024k   18.8
>>>> 2048k   18.8
>>>>
>>>> over pvfs2fuse
>>>> cp      8.2
>>>> rsync   14.8
>>>>
>>>> over nfs
>>>> cp      10.6
>>>> rsync   11.0
>>>>
>>>> Further earlier was/is mentioned that ongoing effort is put in
>>>> optimizing pvfs2/orangefs with small file-sizes. So AFAIK it is by design,
>>>> but not knowing the reasoning behind it.
>>>>
>>>>
>>>>> Best,
>>>>> Kshitij Mehta
>>>>>
>>>>> _______________________________________________
>>>>> Pvfs2-users mailing list
>>>>> [email protected]
>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>>
>>>>
>>>> _______________________________________________
>>>> Pvfs2-users mailing list
>>>> [email protected]
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>
>>>
>>>
>>>
>>>  --
>>> Becky Ligon
>>> OrangeFS Support and Development
>>> Omnibond Systems
>>> Anderson, South Carolina
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Pvfs2-users mailing 
>>> [email protected]http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>
>>>
>>>
>>
>>
>> --
>> Becky Ligon
>> OrangeFS Support and Development
>> Omnibond Systems
>> Anderson, South Carolina
>>
>>
>>
>>
>
>
> --
> Becky Ligon
> OrangeFS Support and Development
> Omnibond Systems
> Anderson, South Carolina
>
>
>


-- 
Becky Ligon
OrangeFS Support and Development
Omnibond Systems
Anderson, South Carolina

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] Conceptual question: Why larger block sizes perform better

Reply via email to