Thoughts on Safely Rolling Out HTTP Pipelines

Patrick McManus Tue, 31 Aug 2010 07:07:54 -0700

Good Day to All, I apologize in advance for the length of this note - it
is a big topic.


Background
-------------

I've been giving the matter of HTTP pipeline deployment some
consideration. The barriers to deployment not withstanding, I think it
is something we want to be doing. Pipelining and concurrency are the 2
primary mechanisms HTTP provides for efficiently operating in high
latency environments. In many cases the Internet as seen by firefox
users has ever increasing latency which makes this a more pressing
problem. Wireless links are part of the reason, but so is the increasing
global reach of the network and the distance between users and
resources.

Parallelism has served us well and will continue to do so. But it has
its limits. Servers struggle with high connection rates which in turn
lead to strange standard imposed requirements (and subsequent
subversions in both letter and spirit of those requirements) about
levels of concurrency. 

I like the following blog post that references the difficulties of
handling high connection rates (as opposed to high transaction rates),
because it is from someone very skilled at scaling server side
implementations:
http://mailinator.blogspot.com/2008/08/benchmarking-talkinator.html

But even beyond the connection setup issues, large levels of TCP level
parallelism have unpleasant interactions with TCP congestion control. To
quickly summarize: a new connection really doesn't know how much data is
safe to send so it selects a relatively conservative constant value and
probes up from there. It is pretty obvious that with a large enough
amount of concurrency that conservative constant gets multiplied to some
very non-conservative and unsafe values. When they are too large the
result is packet loss. There are different levels of pain involved in
packet loss depending on what the state of the TCP connection is and
exactly what packets are lost - but without question the most painful
kind of loss is a connection setup packet or the last data packet in a
short reply.. and those are exactly the kinds of things that will get
lost when a herd of connections all try and get started at the same
time. 

Of course using fewer connections both sidesteps that congestion control
problem and results in fewer handshakes. Both are good things. Each
handshake requires an extra RTT (or more if on TLS), so the fewer the
better. The resulting TCP streams carry more data per stream and a busy
stream deals better with loss events, using TCP fast-retransmit and
friends, than does stream carrying just a small response. A busy stream
also has a more optimally auto tuned congestion window. All in all - a
better fit for the Internet and better network performance.

Slides 25+ in this google slide deck are relevant and interesting:
http://dev.chromium.org/spdy/An_Argument_For_Changing_TCP_Slow_Start.pdf

We should do some measurements of course. HTTP lacks multiplexing so
concurrency is an important tool to minimize head of line blocking
problems. Concurrency and pipelining need to be balanced. TLS is another
variable which might tilt the optimal balance.

Barriers
-----------

Most of the Internet works just fine with pipelining. However, there
have been enough failures in the past to invest in some mitigation
techniques. These are some of the concerns:

#1 - Head of Line Blocking. A pipelined request is delayed until the
requests in front of it are complete. The most egregious example of this
is a comet style hanging-get. We could work around that by not
pipelining behind XHR. In other cases markup could be used to extend to
the browser extra confidence that some resources will not block the
line.

#2 - Client side proxies, possibly invisible, which break pipelining.
Sometimes they just drop pipelined requests silently and sometimes they
reorder responses onto the response channel which is of course a
corruption. Because this is, at least in most cases, an attribute of the
client's network FF could conceivably run a pipelining sanity test to
determine if the path is clean. 

#3 - Server side reverse proxies which have similar failure modes to #2.
However in this case a single sanity test does not provide useful
information. I've read somewhere that Opera, which has pipelining on,
maintains both a static blacklist (which they say has only a single
digit number of domains on it) and a dynamic one based on browser
experience with different domains. Basically, I think we should do
something similar.. see below.

#4 - Connection fate sharing. In the case of packet losses that are
repaired via timeouts all transactions in the pipeline share that delay.
If they were on independent parallel connections the pain would be more
localized. It is worth noting that losses on a busy connection are more
likely to be repaired with some variant of fast-retransmit rather than a
timeout. In those cases the overall throughput of the transaction is not
meaningfully impacted but a visible stall can be observed while out of
order data is still being transferred in the background while we wait
for the lost packet can be resent. For the last week or so I have been
collecting some data from my desktop to determine how often out of order
packets arrive on the web and how long they are delayed in the re-order
queue when that happens.

A Mish Mash of Experience
--------------------------

* Recently Mark Nottingham summed up some of these same problems and
some suggestions in a recent internet draft: 
http://www.ietf.org/internet-drafts/draft-nottingham-http-pipeline-00.txt

* There is a lot of mixed deployment experience but precious little data
on the topic. Firefox has had its share of problems in the past, though
I am not aware of a significant number of sites we can point at that
show a problem.. here is a typical bugzilla that obviously harbors a
server side problem, but doesn't reproduce the issue for me:
https://bugzilla.mozilla.org/show_bug.cgi?id=465814 - Any pointers
anyone has to sites that consistently show problems would be greatly
appreciated.

* Firefox Mobile 1.0 shipped with pipelining turned on. I've inquired
enough to know that there hasn't been a torrent of complaints about that
breaking things at least ;).. 

* Opera reports that things generally work, but some workarounds are
needed.

* A few years ago the safari blog reported generally that there were too
many problems and they turned it off. I couldn't find any specifics
incidents - they would be useful for testing workarounds against.

Proposed Tasks TODO
------------

We're not faced with a simple bug fix. Any reliable answer is going to
consist of a series of work-arounds and heuristics.

#1 - we figure out how to measure. Pick some pages from the
mumble-alexa-top-mumble and develop waterfall charts for them across a
range of variables: pipelining depth (including 0), concurrency, tls,
rtt, etc..

#2 - There are some internal FF algorithms that should probably have
patches and be part of an iteration of those measurements. These are
changes to how we implement the pipelines, not changes to how we decide
whether or not to pipeline. Bug 447866 is one of my favorites just
because I wrote the patch. Bug 329977 has suggestions for a number of
different algorithms that have implications for the head of line
blocking problem (xhr separation among others). It should all be part of
the test matrix, along with the information on re-ordered packet wait
times I mentioned above.

#3 - In most circumstances before using pipelining at the current
location FF can run some kind of pre-test to determine the properties of
any client side proxies. It would essentially send a deep pipeline to a
moz server and confirm the validity of the responses received. They can
be confirmed in this scenario easier than with a random Internet server
as the content can be known out of band. 

The pretest presents an opportunity for the server to identify pipelines
that have been transformed into lock-step request streams by proxies
(which many proxies do) - such transformations don't really break
interop in any way but they do suck the value right out of pipelining,
so we might as well disable it when we know this is happening. We might
also notice proxies spreading different parts of a pipeline onto
parallel connections. IMO that's a rare event and can probably be
correlated with the bugs in failure scenario #2. We should disable
pipelining in this case too - if it turns out to actually impact a
significant number of cases it can be revisited as being overly
conservative. Other bits of sanity testing can be tossed in too: include
a 1xx response, check that we have a http/1.1 request when it gets to
the server and a 1.1. response is returned to the user-agent.

Pretest should probably be linked to going online and imo it can be
cached for a long time using the current set of local addresses as a
cache key. We don't need to test on every startup from the same place.
The blacklist data (i.e. hosts that are known to be pipeline incapable)
can be piggybacked into the response results.

#4 - Next I think it makes sense to implement two of the other
suggestions in draft-nottingham: verify md5 checksums, and verification
of a new assoc-req response header. Content-MD5 is specified by HTTP,
while the assoc-req header exists only in the referenced draft. But
implementation is easy and costs us nothing if the server does not send
the headers (and next to nothing even if they do), and implementing them
now provides incentive for servers to start doing so. Both mechanisms
create a method for the browser to identify a pipelining corruption, and
if we can identify it then we can workaround it. Content-MD5 has a
bugzilla (232030) and a patch attached to it right now.

#5 - A blacklist of hosts known to not play nice is an obvious thing to
include.

#6 - In addition to using md5 and assoc-req to identify pipelining
failures we should consider an anticipated pipeline response that does
not follow closely upon the last byte of the previous response on the
same channel to be a failure too. I favor a definition of closely along
the lines of max (2 x latency of first transaction, 300ms). At this
point the request should be retried on a different channel - when the
first byte finally does arrive the channel not bearing the response
should be closed. If it is the original pipeline channel that is closed
it is considered a failure.

Dynamic failures add your host to the dynamic blacklist for varying
degrees of time depending on the severity of the error. (i.e. a
corruption gets you put there for months, while a timeout is the kind of
thing to re-experiment with on a exponential penalty basis as it could
be attributed to other things.)

#7 - again, inspired from draft-nottingham, support some kind of meta
info in the xhr space that would allow pipelining of xhr requests when
we would normally disallow that over hanging-get concerns. The syntax of
that is something to be determined down the line.

Actually using Pipelining
--------------------------

Thanks for reading this far! Armed with the features from the above list
we have the tools to safely roll out pipeline support.

If a client-wide pretest has been passed, a request may be pipelined
according to the pipelining-state (PS) of the host it is associated
with: Red, Yellow, or Green

Red: No pipelining. The host is on a blacklist. The block may be static
or it may be as a result of a pipelining failure and will timeout. Hosts
might transition from Red to Yellow as failures time out.

Yellow: Hosts start in yellow and are limited to a single pipelined
request in this state. Success transitions the host to green, failure to
red. I suggest using something contrived as the test transaction -
favicon.ico would be absolutely ideal. In the absence of that, a
pipelined "OPTIONS *" might add extremely little overhead to either
client or server without actually subjecting the transaction to any
interference with resources it cares about.

Green: Go Go Go. Fully mix pipelining with a little bit of parallelism.
Failures move the host to red for a variable amount of time depending on
the nature of the failure.

It sounds like a ton of stuff, but truthfully we can run full blast
pipelines with some confidence if a startup time test succeeds and
favicon.ico can come back piggybacked on GET /index.html (for example)..

I vacillate on adding a fourth state in between yellow and green where
perhaps one connection would be allowed a full pipeline but any parallel
connections would not. The theory is that the yellow test as defined may
not be robust enough to detect a high rate of failures and this state
would be easier to recover quickly from than a full green one is. but
part of me says that in practice there won't be much difference between
green and this new state, so for the moment I lean against it.

To whatever extent we do telemetry, host name failures as well as
failure rates are very interesting things to record.

Ideally this information is persistent, but its loading can be deferred
on startup. As it is keyed on hostname it could be coupled with a
persistent DNS store.


Conclusion
--------------
This is just a sketch - and it will take a while to do both in terms of
code and deployment, but feedback is greatly sought. 

Thanks,
Patrick



_______________________________________________
dev-tech-network mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-tech-network

Thoughts on Safely Rolling Out HTTP Pipelines

Reply via email to