[
https://issues.apache.org/jira/browse/TS-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bryan Call updated TS-3549:
---------------------------
Summary: Configurable option to avoid thundering herd due to concurrent
requests for the same object (was: configurable option to avoid thundering
herd due to concurrent requests for the same object)
> Configurable option to avoid thundering herd due to concurrent requests for
> the same object
> -------------------------------------------------------------------------------------------
>
> Key: TS-3549
> URL: https://issues.apache.org/jira/browse/TS-3549
> Project: Traffic Server
> Issue Type: New Feature
> Components: HTTP
> Affects Versions: 5.3.0
> Reporter: Sudheer Vinukonda
> Assignee: Sudheer Vinukonda
> Fix For: 6.0.0
>
> Attachments: TS-3549.diff
>
>
> When ATS is used as a delivery server for a video live streaming event, it's
> possible that there are a huge number of concurrent requests for the same
> object. Depending on the type of the object being requested, the cache lookup
> for those objects can result in either a stale copy of the object (e.g
> manifest files) or a complete cache miss (e.g segment files). ATS currently
> supports different types of connection collapse (e.g. *read-while-write*
> functionality -
> *https://docs.trafficserver.apache.org/en/latest/admin/http-proxy-caching.en.html#read-while-writer*,
> swr etc) but, in order for the *rww* to kick-in, ATS requires the complete
> response headers for the object be received and validated. In other words,
> until this happens, any number of incoming requests for the same object that
> result in a cache miss or a cache stale would be forwarded to the origin. For
> a scenario such as a live event, this leaves a sufficiently significant
> window, where there could be 100's of requests being forwarded to the origin
> for the same object. It has been observed during production that this results
> in significant increase in latency for the objects waiting in
> read-while-write state.
> Note that, there are also a couple of settings
> *proxy.config.http.cache.open_read_retry_time* and
> *proxy.config.http.cache.max_open_read_retries*
> (*https://docs.trafficserver.apache.org/en/latest/admin/http-proxy-caching.en.html#open-read-retry-timeout*)
> that can alleviate the thundering herd to some extent, by re-trying to get
> the read lock for the object as configured. With these configured, ATS would
> retry to get the read lock for as long and if it's still not available due to
> the write lock being held by the first request that was forwarded to the
> origin (for e.g. the response headers have not been received yet), then all
> the waiting requests would simply be forwarded to the origin (by disabling
> cache for each of them).
> It is almost impossible to get the above settings accurate to help in all
> possible situations (traffic, concurrent connections, network conditions
> etc). Due to this reason, a configurable workaround is proposed below that
> avoids the thundering herd completely. The patch below is mainly from
> [~jlaue] and [~psudaemon] with some additional clean up, configuration
> control and debug headers etc.
> Basically, when configured, on failing to obtain a write lock for an object
> (which means, there's another ongoing parallel request for the same object
> that was forwarded to the origin), if it's a cache refresh miss, a stale copy
> of the object is served, while if it's a complete cache miss, a *502* error
> is returned to let the client (e.g. player) to reattempt. The *502* error
> also includes a special internal ATS header named {{@ats-internal-messages}}
> with the appropriate value to allow for custom logging or for plugins to take
> any appropriate actions (e.g. prevent a fail-over if there's such a plugin
> that does fail-over on a regular 502 error).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)