Re: [Pvfs2-users] Conceptual question: Why larger block sizes perform better

Julian Kunkel Mon, 10 Oct 2011 08:15:25 -0700

Hi,
an email I send yesterday that did not make it on the list because I
changed my subscription mail address :-)


>From your description I assume you use Gig Ethernet, correct me if I'm wrong.

Latency and throughput can be used to estimate performance, this is
what you will find in this email.  First some description about the
I/O path in PVFS:
Depending on the amount of data accessed PVFS/orangeFS either uses a
eager or a rendezvous  communication protocol.
The eager protocol can be used if data fits on the initial request. In
the source code you will find:
src/io/bmi/bmi_tcp/bmi-tcp.c:    TCP_MODE_EAGER_LIMIT = 16384,  /* 16K */
If data does not fit into the initial request, then the write path
requires an additional round-trip time: First, the server ACKs the
request, then the client starts to transmit data.
At the end of the write call the server notifies the client that data
is successfully written.

Under the assumption you access data in 64 KiB chunks (which is the
exact stripe size) you will hit one server per 64 KiB of size iff the
offset matches the chunks.
In this case you would need two round-trips to transmit 64 KByte of data.
GigE has a max throughput of roughly 117 MiB/s and a round-trip of maybe 0.1 ms.
In the write path all your data can be cached on the server cache, I
assume it fits (which is very likely).
If you use the equation:
size / throughput you get:
64 KB / 1024 KiB/MiB / 117 MiB/s
.00053418803418803418 => 0.53 ms to transmit data. But you have also
0.2 ms round-trip time! => One 64 KiB Block does not suffice to
achieve optimal performance (if access is synchronous and not cached
on the client side, which is true for PVFS).
In fact the best performance you should see would be about 85 MiB/s
(64 / 1024 / 0.000734 s).
When you hit more servers, then the round-trip time still stays the
same, but you transfer more data,  e.g. with 128 KiB a maximum of
98.55 MiB is possible.

In fact your observed performance of 18.X MiB/s is far below the limit
of 85 MiB/s, on one hand FUSE has a maximum access granularity,
depending on the version it might be about 256 KiB, that explains why
you don't see improvement above this granularity.
To see why your performance is so low compared to the rough
theoretical bound you could try to put your file system on top of
/dev/shm (or another tmpfs), and rerun the test.
Then block I/O does not matter any how and the theoretical limits as
discussed above are more accurate.

Also you might use iperf (or another network performance tool) to
measure the raw-throughput and latency of your interconnect to see if
that works properly and to use those values in the equations above.
You might also use wireshark to see the client-server communication
and to assess if the number of communication exchanges between client
and server matches your expectations (mine would be 2 and that just
one server is hit).

If you have further interest in such kind of performance modelling and
analysis I can point you to some literature.

regards,
Julian Kunkel
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] Conceptual question: Why larger block sizes perform better

Reply via email to