After looking further into the client-side code, I found where the tcp interface is calling writev(), giving it a socket, a vector, and a count).
Sorry that I didn't post this earlier. Becky On Wed, Nov 2, 2011 at 10:10 AM, Becky Ligon <[email protected]> wrote: > > > The client sends the data to the server using the communications layer. > So, if your environment is using TCP/IP, then we are using sockets. On > the server side, once a chunk of data is received, it then writes the data > to the system level data file using pwrite. > > Becky > > > > On Tue, Nov 1, 2011 at 4:51 PM, Kshitij Mehta <[email protected]> wrote: > >> ** >> Hi, >> Thanks for your reply. But heres what I am trying to understand: >> >> >> Server 1 would get chunks 0,4,8,12, server 2 gets chunks 1,5,9,13, server >> 3 gets chunk 2,6,10,14, and server 4 gets 3,7,11,15. The client sends all >> 4 chunks to a single server all at the same time. All of this is done >> transparently. >> >> >> Using what low-level I/O interface does the client send the 4 chunks to a >> server? Does it use write()/pwrite() in succession 4 times, or does it use >> a list I/O interface like pwritev() to send all 4 chunks in a *single >> I/O call*? (I understand that all this is kept transparent from the >> application writer, I am trying to understand the low level details of >> pvfs2 because I am trying to model its I/O operations as part of a project). >> >> Thanks again, >> >> Kshitij Mehta >> PhD candidate >> Parallel Software Technologies Lab >> Dept. of Computer Science >> University of Houston >> Houston, Texas, USA >> >> >> >> >> >> On 10/21/2011 01:30 PM, Becky Ligon wrote: >> >> >> >> On Fri, Oct 21, 2011 at 1:34 PM, Kshitij Mehta <[email protected]> wrote: >> >>> Apologies for opening an old thread, >>> >>> >>> By default, PVFS uses 8, 256k buffers to transfer data to a server. >>> Once the connection is made, PVFS transmits data to the server using these >>> 256k-sized buffers as fast as it can. You can think of the 8 buffers as >>> the PVFS window size (if you are familiar with TCP terminology). With 20 >>> I/O servers, you have 20 of these windows pushing out data over the network >>> just as fast as possible. >>> >>> >>> How does pvfs2 write non-contiguous data chunks to a single server? >>> Using a list I/O interface like writev? Or does it issue separate write >>> calls for every 64K chunk of data to be written to a server? >>> >>> >> The application writer doesn't have to do anything, other than issue a >> single write. The PVFS client takes any given write request and >> distributes it across a set of servers in a simple, round robin way. The >> client breaks the data into chunks and issues writes to each server >> simultaneously. For example, if we have 1MB of data coming in, and the >> stripe size is 64k, and there are 4 I/O servers, then each server gets >> 4-64k chunks. You can think of the 1MB data as an array of 64k chunks from >> 0 to 15. Server 1 would get chunks 0,4,8,12, server 2 gets chunks >> 1,5,9,13, server 3 gets chunk 2,6,10,14, and server 4 gets 3,7,11,15. The >> client sends all 4 chunks to a single server all at the same time. All of >> this is done transparently. The client waits for each of the 4 servers to >> respond with a status and THEN returns back to the application. >> >> What I just described is the default behavior, which is a simple stripe >> distribution. However, you can define your own distribution, and PVFS will >> chunk the data as prescribed by that distribution. At this point, I don't >> know of anyone using a different distribution. >> >> >> >> >>> Also, is this documented somewhere, or do you generally look at the >>> source code to figure such things out? >>> >> I have learned all of this by looking at the code because I am a PVFS >> developer; however, there is a doc directory that you can "make" that will >> also describe some of the major functionality. >> >> Hope this helps! >> >> Becky >> >> >> >>> >>> Thanks, >>> Kshitij >>> >>> >>> On 10/09/2011 03:03 PM, Becky Ligon wrote: >>> >>> The dd block size determines how much data is given to PVFS2 in any one >>> write request. Thus, if the write request is given 2MB of data, that data >>> is divided up and sent to the 20 I/O servers all at the same time (see note >>> below). If the write request is given only 64K of data, then a request is >>> sent to the one server where the next 64k is to be written. So, >>> throughput for larger requests generally perform better than for small >>> requests, depending on your network delay, how busy your servers are, and >>> the number of I/O servers in your filesystem. There is also some overhead >>> associated with moving the data from user space to kernel space; so, you >>> incur more OS overhead using 64k blocks than you would with 2MB blocks. >>> >>> For example, if you use the linux command "cp" and compare its >>> performance with "pvfs2-cp" to copy a large amount of data from a unix >>> filesystem into a PVFS filesystem, you will immediately notice that >>> pvfs2-cp is faster than cp. pvfs2-cp performs better than cp because it >>> uses a default buffer size of 10MB, while cp uses the stripe size, in your >>> case 64k. So, it will take cp longer to transfer the same amount of data >>> than it will with pvfs2-cp. >>> >>> >>> NOTE: By default, PVFS uses 8, 256k buffers to transfer data to a >>> server. Once the connection is made, PVFS transmits data to the server >>> using these 256k-sized buffers as fast as it can. You can think of the 8 >>> buffers as the PVFS window size (if you are familiar with TCP terminology). >>> With 20 I/O servers, you have 20 of these windows pushing out data over >>> the network just as fast as possible. >>> >>> Hope this helps! >>> Becky >>> >>> On Sun, Oct 9, 2011 at 5:34 AM, belcampo <[email protected]> wrote: >>> >>>> On 10/06/2011 10:36 PM, Kshitij Mehta wrote: >>>> >>>>> Hello, >>>>> I have a pvfs2 file system configured over 20 IO servers with a >>>>> default stripe size of 64Kbytes. >>>>> I am running a simple test program where I write a matrix to file. >>>>> >>>>> This is what I see: >>>>> If the 1GByte matrix is written in block sizes of 2Mbytes, the >>>>> performance is much better than writing the matrix in blocks of 64Kbytes. >>>>> I >>>>> am not sure I understand why. Since the stripe size is 64KB, every block >>>>> of >>>>> 2MB eventually gets broken into 64KB blocks which are written to the IO >>>>> servers, so the performance should nearly be equal. I would understand why >>>>> writing block size < stripe_size should perform badly, but when the block >>>>> size exceeds the stripe size, I expect the performance to peak out. >>>>> >>>>> Can someone explain what happens here? Your help is appreciated. >>>>> >>>> I can't explain only confirm. I also did some test with following >>>> results. >>>> >>>> with pvfs2-cp >>>> 18.18 >>>> over pvfs2fuse >>>> dd blocksize MB/s >>>> 4k 4.4 >>>> 8k 6.3 >>>> 16k 7.3 >>>> 32k 8.8 >>>> 64k 9.9 >>>> 128k 18.7 >>>> 256k 18.7 >>>> 512k 18.8 >>>> 1024k 18.8 >>>> 2048k 18.8 >>>> >>>> over pvfs2fuse >>>> cp 8.2 >>>> rsync 14.8 >>>> >>>> over nfs >>>> cp 10.6 >>>> rsync 11.0 >>>> >>>> Further earlier was/is mentioned that ongoing effort is put in >>>> optimizing pvfs2/orangefs with small file-sizes. So AFAIK it is by design, >>>> but not knowing the reasoning behind it. >>>> >>>> >>>>> Best, >>>>> Kshitij Mehta >>>>> >>>>> _______________________________________________ >>>>> Pvfs2-users mailing list >>>>> [email protected] >>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>>> >>>> >>>> _______________________________________________ >>>> Pvfs2-users mailing list >>>> [email protected] >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>> >>> >>> >>> >>> -- >>> Becky Ligon >>> OrangeFS Support and Development >>> Omnibond Systems >>> Anderson, South Carolina >>> >>> >>> >>> >>> _______________________________________________ >>> Pvfs2-users mailing >>> [email protected]http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>> >>> >>> >> >> >> -- >> Becky Ligon >> OrangeFS Support and Development >> Omnibond Systems >> Anderson, South Carolina >> >> >> >> > > > -- > Becky Ligon > OrangeFS Support and Development > Omnibond Systems > Anderson, South Carolina > > > -- Becky Ligon OrangeFS Support and Development Omnibond Systems Anderson, South Carolina
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
