Re: [Openstack] Swift performance for very small objects

Rick Jones Mon, 21 May 2012 10:26:19 -0700

On 05/19/2012 06:34 PM, Paulo Ricardo Motta Gomes wrote:

Hello,


I'm doing some experiments in a Swift cluster testbed of 9 nodes/devices
and 3 zones (3 nodes on each zone).

In one of my tests, I noticed that PUTs of very small objects are
extremely inefficient.

- 5000 PUTs of objects with an average size of 40K - total of 195MB -
took 67s (avg time per request: 0.0135s)
- 5000 PUTS of objects with an average size of 190 bytes - total of
930KB - took 60s (avg time per request: 0.0123s)

I plotted object size vs request time and found that there is
significant difference in request times only after 200KB. When objects
are smaller than this PUT requests have a minimum execution time of
0.01s, no matter the object size.
I suppose swift is not optimized for such small objects, but I wonder
what is the main cause for this, if it's the HTTP overhead or disk
writing. I checked the log of the object servers and requests are taking
an average of 0.006s, whether objects are 40K or 190 bytes, which
indicate part of the bottleneck could be at the disk. Curently I'm using
a loopback device for storage.

I thought that maybe this could be improved a bit if the proxy server
maintained persistent connections to the storage nodes instead of
opening a new one for each request?

Persistent TCP connections do nothing to improve disc performance, so itwould be unlikely to have much of an effect if indeed the bulk of theresponse time is in getting things to disc on the server.

If you take a tcpdump trace you would be able to see how much of thetime to service a request is taken-up by the TCP three-way handshake.You would take the time from the client's send of the SYNchronizesegment to the time the SYN|ACKnowledgement segment arrives from theserver. You can also then look at the time between when the last byteof the request is sent and the first/last byte of the response arrives.

If you wanted, you could probably compare that with the likes of say anetperf TCP_CRR test and see it with something that is only TCP. Thenetperf TCP_CRR test will open a tcp connection, send a request of agiven size, receive a response of a given size and the connection is closed.

For an example, here is a TCP_CRR test between my laptop and myworkstation, (100 Mbit/s) with netperf configured to include histogramsupport via configure --enable-histogram. I just picked a number out ofthe ether for the sizes:

raj@tardy:~/netperf2_trunk$ src/netperf -Hraj-8510w.americas.hpqcorp.net -t TCP_CRR -l 30 -v 2 -- -r `expr 190 +128`,128MIGRATED TCP Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0AF_INET to internal-host.americas.hpqcorp.net () port 0 AF_INET : histogram

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  318      128     30.00    2340.51
16384  87380

Alignment      Offset
Local  Remote  Local  Remote
Send   Recv    Send   Recv
    8      8       0      0

Histogram of request/response times
UNIT_USEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_USEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
HUNDRED_USEC  :    0:    0:    0: 18238: 47290: 3945:  486:  128:   34:   20
UNIT_MSEC     :    0:   47:   14:    1:    6:    2:    2:    0:    1:    1
TEN_MSEC      :    0:    0:    1:    0:    0:    0:    0:    0:    0:    0
HUNDRED_MSEC  :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
UNIT_SEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_SEC       :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
>100_SECS: 0
HIST_TOTAL:      70216

So, on average it was 1/2340 or 0.00043 seconds per complete transactionjust doing things at the TCP level, without any openstack code or discI/O etc. There was a tail, too short to be TCP retransmissions, my betwould be task scheduling as my workstation was not otherwise idle.


In the midst of all that I did:

raj@tardy:~/netperf2_trunk$ sudo tcpdump -i eth0 -c 20 -w /tmp/example.pcap

and post-processing that with:

sudo tcpdump -i eth0 -r /tmp/example.pcap -ttt

here is one of those transactions:

raj@tardy:~/netperf2_trunk$ sudo tcpdump -i eth0 -r /tmp/example.pcap-ttt port 63978

reading from file /tmp/example.pcap, link-type EN10MB (Ethernet)

00:00:00.000000 IP tardy.63978 >internal-host.americas.hpqcorp.net.35594: Flags [SEW], seq 1709857287,win 14600, options [mss 1460,sackOK,TS val 111970417 ecr 0,nop,wscale7], length 000:00:00.000097 IP internal-host.americas.hpqcorp.net.35594 >tardy.63978: Flags [S.E], seq 3540302987, ack 1709857288, win 14480,options [mss 1460,sackOK,TS val 428732549 ecr 111970417,nop,wscale 5],length 000:00:00.000007 IP tardy.63978 >internal-host.americas.hpqcorp.net.35594: Flags [.], ack 1, win 115,options [nop,nop,TS val 111970417 ecr 428732549], length 000:00:00.000010 IP tardy.63978 >internal-host.americas.hpqcorp.net.35594: Flags [P.], seq 1:319, ack 1,win 115, options [nop,nop,TS val 111970417 ecr 428732549], length 31800:00:00.000158 IP internal-host.americas.hpqcorp.net.35594 >tardy.63978: Flags [.], ack 319, win 486, options [nop,nop,TS val428732549 ecr 111970417], length 000:00:00.000062 IP internal-host.americas.hpqcorp.net.35594 >tardy.63978: Flags [P.], seq 1:129, ack 319, win 486, options[nop,nop,TS val 428732549 ecr 111970417], length 12800:00:00.000004 IP internal-host.americas.hpqcorp.net.35594 >tardy.63978: Flags [F.], seq 129, ack 319, win 486, options [nop,nop,TSval 428732549 ecr 111970417], length 000:00:00.000006 IP tardy.63978 >internal-host.americas.hpqcorp.net.35594: Flags [.], ack 129, win 123,options [nop,nop,TS val 111970417 ecr 428732549], length 000:00:00.000008 IP tardy.63978 >internal-host.americas.hpqcorp.net.35594: Flags [F.], seq 319, ack 130,win 123, options [nop,nop,TS val 111970417 ecr 428732549], length 0

It was in this LAN-based case a very short time between the SYN (thefirst line with "SEW" in it (the EW relates to Explicit CongestionNotification, which is on on my systems) and the second line, which isthe SYN|ACK (and acceptance of ecn for this connection). I used "port63978 as a packet filter to let the relative timestamps be helpful -otherwise, a last ACK from a previous connection would have been in there.

Anyway, it was only 97+7 microseconds (0.000104 s) before the connectionwas established from the point-of-view of the client, and only another10 microseconds after that before the request was on its way. (Strictlyspeaking, past the tracepoint in the stack running on the client, but Irather doubt it was much longer before it was actually through the NICand on the wire)

It would be great if you could share your thoughts on this and how could
the performance of this special case be improved.

You should repeat the above (at least the tcpdump of your actual PUTs ifnot also the pure netperf test) on your setup. I am as much a fan ofpersistent connections as anyone but I think you will find that TCPconnection setup overhead wall clock time, is not the biggest componentof the response time floor you have found.

That was suggesting tcpdump on the client. If you also take a tcpdumptrace at the server, you can see the length of time inside the serverbetween when the request arrived off the wire, and when the response wassent (queued to the driver at least).

Caveat - you should not try to do math between absolute timestamps onthe client and on the server unless you know that the two systems havereally, Really, REALLY well synchronized clocks...


rick jones
http://www.netperf.org/

If you are still reading, for grins here is the TCP_RR for the samesizes. Just no connection establishment overhead:raj@tardy:~/netperf2_trunk$ src/netperf -Hraj-8510w.americas.hpqcorp.net -t TCP_RR -l 30 -v 2 -- -r `expr 190 +128`,128MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INETto internal-host.americas.hpqcorp.net () port 0 AF_INET : histogram :first burst 0

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  318      128     30.00    4114.35
16384  87380
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
    8      0       0      0   243.052   4114.347 10.467    4.213

Histogram of request/response times
UNIT_USEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_USEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
HUNDRED_USEC  :    0:  774: 117564: 4923:  102:   25:    9:    4:    4:    7
UNIT_MSEC     :    0:   14:    2:    1:    0:    0:    0:    1:    1:    1
TEN_MSEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
HUNDRED_MSEC  :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
UNIT_SEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_SEC       :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
>100_SECS: 0
HIST_TOTAL:      123432

Certainly, when/if there is very little service time in the server, apersistent connection will be "faster" - it will send rather fewersegments back and forth per transaction. But if the service times arerather larger than the network round-trip-times involved the gains froma persistent connection will be minimized.


_______________________________________________
Mailing list: https://launchpad.net/~openstack
Post to     : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp

Re: [Openstack] Swift performance for very small objects

Reply via email to