The client sends the data to the server using the communications layer. So, if your environment is using TCP/IP, then we are using sockets. On the server side, once a chunk of data is received, it then writes the data to the system level data file using pwrite.
Becky On Tue, Nov 1, 2011 at 4:51 PM, Kshitij Mehta <[email protected]> wrote: > ** > Hi, > Thanks for your reply. But heres what I am trying to understand: > > > Server 1 would get chunks 0,4,8,12, server 2 gets chunks 1,5,9,13, server > 3 gets chunk 2,6,10,14, and server 4 gets 3,7,11,15. The client sends all > 4 chunks to a single server all at the same time. All of this is done > transparently. > > > Using what low-level I/O interface does the client send the 4 chunks to a > server? Does it use write()/pwrite() in succession 4 times, or does it use > a list I/O interface like pwritev() to send all 4 chunks in a *single I/O > call*? (I understand that all this is kept transparent from the > application writer, I am trying to understand the low level details of > pvfs2 because I am trying to model its I/O operations as part of a project). > > Thanks again, > > Kshitij Mehta > PhD candidate > Parallel Software Technologies Lab > Dept. of Computer Science > University of Houston > Houston, Texas, USA > > > > > > On 10/21/2011 01:30 PM, Becky Ligon wrote: > > > > On Fri, Oct 21, 2011 at 1:34 PM, Kshitij Mehta <[email protected]> wrote: > >> Apologies for opening an old thread, >> >> >> By default, PVFS uses 8, 256k buffers to transfer data to a server. Once >> the connection is made, PVFS transmits data to the server using these >> 256k-sized buffers as fast as it can. You can think of the 8 buffers as >> the PVFS window size (if you are familiar with TCP terminology). With 20 >> I/O servers, you have 20 of these windows pushing out data over the network >> just as fast as possible. >> >> >> How does pvfs2 write non-contiguous data chunks to a single server? >> Using a list I/O interface like writev? Or does it issue separate write >> calls for every 64K chunk of data to be written to a server? >> >> > The application writer doesn't have to do anything, other than issue a > single write. The PVFS client takes any given write request and > distributes it across a set of servers in a simple, round robin way. The > client breaks the data into chunks and issues writes to each server > simultaneously. For example, if we have 1MB of data coming in, and the > stripe size is 64k, and there are 4 I/O servers, then each server gets > 4-64k chunks. You can think of the 1MB data as an array of 64k chunks from > 0 to 15. Server 1 would get chunks 0,4,8,12, server 2 gets chunks > 1,5,9,13, server 3 gets chunk 2,6,10,14, and server 4 gets 3,7,11,15. The > client sends all 4 chunks to a single server all at the same time. All of > this is done transparently. The client waits for each of the 4 servers to > respond with a status and THEN returns back to the application. > > What I just described is the default behavior, which is a simple stripe > distribution. However, you can define your own distribution, and PVFS will > chunk the data as prescribed by that distribution. At this point, I don't > know of anyone using a different distribution. > > > > >> Also, is this documented somewhere, or do you generally look at the >> source code to figure such things out? >> > I have learned all of this by looking at the code because I am a PVFS > developer; however, there is a doc directory that you can "make" that will > also describe some of the major functionality. > > Hope this helps! > > Becky > > > >> >> Thanks, >> Kshitij >> >> >> On 10/09/2011 03:03 PM, Becky Ligon wrote: >> >> The dd block size determines how much data is given to PVFS2 in any one >> write request. Thus, if the write request is given 2MB of data, that data >> is divided up and sent to the 20 I/O servers all at the same time (see note >> below). If the write request is given only 64K of data, then a request is >> sent to the one server where the next 64k is to be written. So, >> throughput for larger requests generally perform better than for small >> requests, depending on your network delay, how busy your servers are, and >> the number of I/O servers in your filesystem. There is also some overhead >> associated with moving the data from user space to kernel space; so, you >> incur more OS overhead using 64k blocks than you would with 2MB blocks. >> >> For example, if you use the linux command "cp" and compare its >> performance with "pvfs2-cp" to copy a large amount of data from a unix >> filesystem into a PVFS filesystem, you will immediately notice that >> pvfs2-cp is faster than cp. pvfs2-cp performs better than cp because it >> uses a default buffer size of 10MB, while cp uses the stripe size, in your >> case 64k. So, it will take cp longer to transfer the same amount of data >> than it will with pvfs2-cp. >> >> >> NOTE: By default, PVFS uses 8, 256k buffers to transfer data to a >> server. Once the connection is made, PVFS transmits data to the server >> using these 256k-sized buffers as fast as it can. You can think of the 8 >> buffers as the PVFS window size (if you are familiar with TCP terminology). >> With 20 I/O servers, you have 20 of these windows pushing out data over >> the network just as fast as possible. >> >> Hope this helps! >> Becky >> >> On Sun, Oct 9, 2011 at 5:34 AM, belcampo <[email protected]> wrote: >> >>> On 10/06/2011 10:36 PM, Kshitij Mehta wrote: >>> >>>> Hello, >>>> I have a pvfs2 file system configured over 20 IO servers with a default >>>> stripe size of 64Kbytes. >>>> I am running a simple test program where I write a matrix to file. >>>> >>>> This is what I see: >>>> If the 1GByte matrix is written in block sizes of 2Mbytes, the >>>> performance is much better than writing the matrix in blocks of 64Kbytes. I >>>> am not sure I understand why. Since the stripe size is 64KB, every block of >>>> 2MB eventually gets broken into 64KB blocks which are written to the IO >>>> servers, so the performance should nearly be equal. I would understand why >>>> writing block size < stripe_size should perform badly, but when the block >>>> size exceeds the stripe size, I expect the performance to peak out. >>>> >>>> Can someone explain what happens here? Your help is appreciated. >>>> >>> I can't explain only confirm. I also did some test with following >>> results. >>> >>> with pvfs2-cp >>> 18.18 >>> over pvfs2fuse >>> dd blocksize MB/s >>> 4k 4.4 >>> 8k 6.3 >>> 16k 7.3 >>> 32k 8.8 >>> 64k 9.9 >>> 128k 18.7 >>> 256k 18.7 >>> 512k 18.8 >>> 1024k 18.8 >>> 2048k 18.8 >>> >>> over pvfs2fuse >>> cp 8.2 >>> rsync 14.8 >>> >>> over nfs >>> cp 10.6 >>> rsync 11.0 >>> >>> Further earlier was/is mentioned that ongoing effort is put in >>> optimizing pvfs2/orangefs with small file-sizes. So AFAIK it is by design, >>> but not knowing the reasoning behind it. >>> >>> >>>> Best, >>>> Kshitij Mehta >>>> >>>> _______________________________________________ >>>> Pvfs2-users mailing list >>>> [email protected] >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>> >>> >>> _______________________________________________ >>> Pvfs2-users mailing list >>> [email protected] >>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>> >> >> >> >> -- >> Becky Ligon >> OrangeFS Support and Development >> Omnibond Systems >> Anderson, South Carolina >> >> >> >> >> _______________________________________________ >> Pvfs2-users mailing >> [email protected]http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >> >> >> > > > -- > Becky Ligon > OrangeFS Support and Development > Omnibond Systems > Anderson, South Carolina > > > > -- Becky Ligon OrangeFS Support and Development Omnibond Systems Anderson, South Carolina
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
