On 8/27/2014 3:22 PM, Leif Hedstrom wrote:
On Aug 27, 2014, at 1:51 PM, Nick Kew <n...@apache.org> wrote:
On Wed, 27 Aug 2014 16:17:17 +0000
Rasim Saltuk Alakuş <rala...@turksat.com.tr> wrote:
Hi All,
ATS uses URL hash for cache storage. And CacheUrl plugin adds some more
flexibility in URL hashing strategy.
We think of creating hash based on packet content and use it as the hash while
storing and retrieving from cache This looks a better solution, so that URI
changes won't hurt caching system. One immediate benefit for example if you
cache YouTube , each request for same video can have different URL and CacheUrl
plugin does not always provide a good solution. Also maintaining site based
hash filters looks not an elegant solution.
Is there any previous or active work for implementing content based hashing?
What kind of problems and constrains you may guess. Is there any volunteer to
implement this feature together with us?
Indeed, the whole scheme is BAD (Broken As Designed).
Using different URLs for common content breaks cacheing on
the Web at large, and hacking one agent (such as Trafficserver)
to work around it will gain you only a tiny fraction of what
you've thrown away. Indeed, if every agent on the Web -
from origin servers to desktop browsers - implemented this
cacheing scheme, you'd still lose MOST of the benefits of
cacheing, as the same content passes through different paths.
I thought some more on this over a boring meeting, two more thoughts comes to
mind:
1) Cache poisoning. This could be a serious problem, at a minimum some defenses
such as using the Host: portion of the request for the cache key would be
required. But, I’m guessing that still would be possible to abuse, to poison
the HTTP caches (since the client request + origin response headers no longer
dictates the cache lookup).
Good point on the cache poisoning. If the attacker knew your hash
generation strategy (e.g. hash the first 1000 bytes of the file) and had
access to a legitimate copy of that data, he could indeed inject bogus
data for the non hashed data.
Given the large number of potential hosts for a CDN, I think you want to
generalize the host name before you add it to the look up key. If the
host name matches your expectations for a CDN, you can use a fixed name
as part of the key. Otherwise, you use the host name straight.