[jira] [Created] (TS-3549) configurable option to avoid thundering herd due to concurrent requests for the same object

Sudheer Vinukonda (JIRA) Wed, 22 Apr 2015 18:53:09 -0700

Sudheer Vinukonda created TS-3549:
-------------------------------------

             Summary: configurable option to avoid thundering herd due to 
concurrent requests for the same object
                 Key: TS-3549
                 URL: https://issues.apache.org/jira/browse/TS-3549
             Project: Traffic Server
          Issue Type: New Feature
          Components: HTTP
            Reporter: Sudheer Vinukonda

When ATS is used as a delivery server for a video live streaming event, it's
possible that there are a huge number of concurrent requests for the same
object. Depending on the type of the object being requested, the cache lookup
for those objects can result in either a stale copy of the object (e.g manifest
files) or a complete cache miss (e.g segment files). ATS currently supports
different types of connection collapse (e.g. *read-while-write* functionality -
*https://docs.trafficserver.apache.org/en/latest/admin/http-proxy-caching.en.html#read-while-writer*)
but, in order for this to kick-in, ATS requires the complete response headers
for the object be received and validated. In other words, until this happens,
any number of incoming requests for the same object that result in a cache miss
or a cache stale would be forwarded to the origin. For a scenario such as a
live event, this leaves a sufficiently significant window, where there could be
100's of requests being forwarded to the origin for the same object. It has
been observed during production that this results in significant increase in
latency for the objects waiting in read-while-write state.

Note that, there are also a couple of settings
*proxy.config.http.cache.open_read_retry_time* and
*proxy.config.http.cache.max_open_read_retries*
(*https://docs.trafficserver.apache.org/en/latest/admin/http-proxy-caching.en.html#open-read-retry-timeout*)
that can alleviate the thundering herd to some extent, by re-trying to get the
read lock for the object as configured. With these configured, ATS would retry
to get the read lock for as long and if it's still not available due to the
write lock being held by the first request that was forwarded to the origin
(for e.g. the response headers have not been received yet), then all the
waiting requests would simply be forwarded to the origin (by disabling cache
for each of them).

It is almost impossible to get the above settings accurate to help in all
possible situations (traffic, concurrent connections, network conditions etc).
Due to this reason, a configurable workaround is proposed below that avoids the
thundering herd completely.

Basically, when configured, on failing to obtain a write lock for an object
(which means, there's another ongoing parallel request for the same object that
was forwarded to the origin), if it's a cache refresh miss, a stale copy of the
object is served, while if it's a complete cache miss, a *502* error is
returned to let the client (e.g. player) to reattempt. The *502* error also
includes a special internal ATS header named {{@ats-internal-messages}} with
the appropriate value to allow for custom logging or for plugins to take any
appropriate actions (e.g. prevent a fail-over if there's such a plugin that
does fail-over on a regular 502 error).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TS-3549) configurable option to avoid thundering herd due to concurrent requests for the same object

Reply via email to