Hi, an email I send yesterday that did not make it on the list because I changed my subscription mail address :-)
>From your description I assume you use Gig Ethernet, correct me if I'm wrong. Latency and throughput can be used to estimate performance, this is what you will find in this email. First some description about the I/O path in PVFS: Depending on the amount of data accessed PVFS/orangeFS either uses a eager or a rendezvous communication protocol. The eager protocol can be used if data fits on the initial request. In the source code you will find: src/io/bmi/bmi_tcp/bmi-tcp.c: TCP_MODE_EAGER_LIMIT = 16384, /* 16K */ If data does not fit into the initial request, then the write path requires an additional round-trip time: First, the server ACKs the request, then the client starts to transmit data. At the end of the write call the server notifies the client that data is successfully written. Under the assumption you access data in 64 KiB chunks (which is the exact stripe size) you will hit one server per 64 KiB of size iff the offset matches the chunks. In this case you would need two round-trips to transmit 64 KByte of data. GigE has a max throughput of roughly 117 MiB/s and a round-trip of maybe 0.1 ms. In the write path all your data can be cached on the server cache, I assume it fits (which is very likely). If you use the equation: size / throughput you get: 64 KB / 1024 KiB/MiB / 117 MiB/s .00053418803418803418 => 0.53 ms to transmit data. But you have also 0.2 ms round-trip time! => One 64 KiB Block does not suffice to achieve optimal performance (if access is synchronous and not cached on the client side, which is true for PVFS). In fact the best performance you should see would be about 85 MiB/s (64 / 1024 / 0.000734 s). When you hit more servers, then the round-trip time still stays the same, but you transfer more data, e.g. with 128 KiB a maximum of 98.55 MiB is possible. In fact your observed performance of 18.X MiB/s is far below the limit of 85 MiB/s, on one hand FUSE has a maximum access granularity, depending on the version it might be about 256 KiB, that explains why you don't see improvement above this granularity. To see why your performance is so low compared to the rough theoretical bound you could try to put your file system on top of /dev/shm (or another tmpfs), and rerun the test. Then block I/O does not matter any how and the theoretical limits as discussed above are more accurate. Also you might use iperf (or another network performance tool) to measure the raw-throughput and latency of your interconnect to see if that works properly and to use those values in the equations above. You might also use wireshark to see the client-server communication and to assess if the number of communication exchanges between client and server matches your expectations (mine would be 2 and that just one server is hit). If you have further interest in such kind of performance modelling and analysis I can point you to some literature. regards, Julian Kunkel _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
