On Fri, 27 Oct 2006, Graham Leggett wrote:
Niklas Edmundsson wrote:
Different VHosts meaning different URLs/directories, pointing to the same
files...
Hmm... Two thoughts come into my head over this one.
One way to approach this is to treat this as a general problem of how do we
stop people who download the same file from multiple places (say different
mirrors via proxy, or different URLs to the backend like you have) from
downloading multiple copies of the same file hosted at different URLs.
Here you might have some kind of regex-like expression, like *.iso, that says
"all files whose names match this regex, are considered the same file". A
mechanism might have a small cache of filenames that have matched the regex
in the past, and that link to actual cached entries in the cache.
This would need to be abstracted out into an existing hook (or new one if
necessary).
A second approach could involve the use of the Etags associated with file
responses, which in the case of files served off disk (as I understand it)
are generated based on inode number and various other uniquely file specific
information.
Therefore in theory two responses with the same Etag are actually the same
file, and if you've already cached a file with that Etag, then the same Etag
quick cache scenario described above could provide a shortcut to the same
file cached at a different URL.
For our use, the following solves the "multiple url:s points to the
same file" problem: When caching the file, if file larger than
$threshold (we use 64k), write a "alias-header" only saying "this URL
equals r->filename". Hash on r->filename, cache the file. Reading the
file follows the "alias-header", and opens the cached file.
This only works when having a filesystem-backend, and it does not
solve the real problems of multiple symlinks pointing to the same
file. The symlink-problem is a significant source of data-duplication
in the cache for us, but I suspect that there must be a relatively
clean solution to this. I'm not particulary fond of the "stat each
component of the path"-solution though, even though caching would
reduce the stat-hammering on the backend.
After reading Henriks post, I suspect that the only way to do this for
non-file-backend is to use content-md5, and that sounds way to
expensive to be really usable...
/Nikke
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | [EMAIL PROTECTED]
---------------------------------------------------------------------------
Confucious say too damn much!
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=