On Fri, 27 Oct 2006, Graham Leggett wrote:

Niklas Edmundsson wrote:

Different VHosts meaning different URLs/directories, pointing to the same files...

Hmm... Two thoughts come into my head over this one.

One way to approach this is to treat this as a general problem of how do we stop people who download the same file from multiple places (say different mirrors via proxy, or different URLs to the backend like you have) from downloading multiple copies of the same file hosted at different URLs.

Here you might have some kind of regex-like expression, like *.iso, that says "all files whose names match this regex, are considered the same file". A mechanism might have a small cache of filenames that have matched the regex in the past, and that link to actual cached entries in the cache.

This would need to be abstracted out into an existing hook (or new one if necessary).

A second approach could involve the use of the Etags associated with file responses, which in the case of files served off disk (as I understand it) are generated based on inode number and various other uniquely file specific information.

Therefore in theory two responses with the same Etag are actually the same file, and if you've already cached a file with that Etag, then the same Etag quick cache scenario described above could provide a shortcut to the same file cached at a different URL.

For our use, the following solves the "multiple url:s points to the same file" problem: When caching the file, if file larger than $threshold (we use 64k), write a "alias-header" only saying "this URL equals r->filename". Hash on r->filename, cache the file. Reading the file follows the "alias-header", and opens the cached file.

This only works when having a filesystem-backend, and it does not solve the real problems of multiple symlinks pointing to the same file. The symlink-problem is a significant source of data-duplication in the cache for us, but I suspect that there must be a relatively clean solution to this. I'm not particulary fond of the "stat each component of the path"-solution though, even though caching would reduce the stat-hammering on the backend.

After reading Henriks post, I suspect that the only way to do this for non-file-backend is to use content-md5, and that sounds way to expensive to be really usable...

/Nikke
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se      |     [EMAIL PROTECTED]
---------------------------------------------------------------------------
 Confucious say too damn much!
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Reply via email to