Hello Per, this writeup is really well done, thank you for it!
The approach so far taken by the Apache Traffic Server plugin is to
examine "Link: <...>; rel=duplicate" response headers. For example here
are response headers from download.services.openoffice.org, which also
uses MirrorBrain:
$ curl -D - -o /dev/null -s
http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz
HTTP/1.1 302 Found
Date: Sat, 02 Jun 2012 06:24:15 GMT
Server: Apache/2.2.22 (Linux/SUSE)
X-Prefix: 41.197.0.0/16
X-AS: 36934
X-MirrorBrain-Mirror: halifax.rwth-aachen.de
X-MirrorBrain-Realm: other_country
Link:
<http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4>;
rel=describedby; type="application/metalink4+xml"
Link:
<http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.torrent>;
rel=describedby; type="application/x-bittorrent"
Link:
<http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
rel=duplicate; pri=1; geo=de
Link:
<http://ftp5.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
rel=duplicate; pri=2; geo=de
Link:
<http://ftp3.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
rel=duplicate; pri=3; geo=de
Link:
<http://ftp.cc.uoc.gr/openoffice.org/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
rel=duplicate; pri=4; geo=gr
Link:
<http://ftp.ntua.gr/pub/OpenOffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
rel=duplicate; pri=5; geo=gr
Digest: MD5=chZROzRjy791zYb5mUhk3A==
Digest: SHA=nRgEtguiGxDlu8PKSxyBSc7TlGw=
Digest: SHA-256=VO2S9pgCq1lqgTFTKssVj6amn0npNdagtjI8ziDtiRQ=
Location:
http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz
Content-Length: 395
Connection: close
Content-Type: text/html; charset=iso-8859-1
$
If a response has a "Location: ..." header and a "Link: <...>;
rel=duplicate" header then the Traffic Server plugin will check if the
URLs in these headers are already cached. If the "Location: ..." URL is
not already cached but a "Link: <...>; rel=duplicate" URL is cached,
then the plugin will rewrite the "Location: ..." header with the cached URL
This should redirect clients that are not Metalink aware to a mirror
that is already cached. I would love any feedback on this approach
The code so far is up on GitHub [1]
We are also thinking of examining "Digest: ..." headers. If a response
has a "Location: ..." header that's not already cached and a "Digest:
..." header, then the plugin would check the cache for a matching
digest. If found then it would rewrite the "Location: ..." header with
the cached URL
This plugin is motivated by a similar problem to the one in your
writeup. We run a caching proxy here at a rural village in Rwanda to
improve our slow internet access. But many web sites don't predictably
redirect users to the same download mirror, which defeats our cache
When you say "we're using Metalink as the mirror list", what do you
mean? One annoying item in my setup is the parsing of the HTML mirror
page - you wouldn't happen to know of a way of retrieving the mirror
list in XML format?
You can retrieve a Metalink/XML resource that includes information about
where a file is mirrored, in XML format. I think the correct way to
*discover* this resource is through a 'Link: <...>; rel=describedby;
type="application/metalink4+xml"' header. Can anyone (Anthony?) confirm
that this is the correct way?
So for example, in the above download.services.openoffice.org example:
http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4
However I can't seem to get these same headers from
download.opensuse.org. Both download.services.openoffice.org and
download.opensuse.org seem to use MirrorBrain, anyone know why might
download.services.openoffice.org responses include a 'Link: <...>;
rel=describedby; type="application/metalink4+xml"' header but
download.opensuse.org responses not?
$ curl -D - -o /dev/null -s
http://download.opensuse.org/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm
HTTP/1.1 302 Found
Date: Sat, 02 Jun 2012 07:22:30 GMT
Server: Apache/2.2.12 (Linux/SUSE)
X-Prefix: 41.197.0.0/16
X-AS: 36934
X-MirrorBrain-Mirror: ftp5.gwdg.de
X-MirrorBrain-Realm: other_country
Location:
http://ftp5.gwdg.de/pub/opensuse/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm
Content-Length: 368
Content-Type: text/html; charset=iso-8859-1
$
More information on the "Link: <...>; rel=duplicate" and 'Link: <...>;
rel=describedby; type="application/metalink4+xml"' headers is in RFC
6249, Metalink/HTTP: Mirrors and Hashes [2]. More information on the XML
format that includes information about where a file is mirrored is in
RFC 5854, The Metalink Download Description Format [3]
Switching off segmented downloading is interesting too, but I wanted an
environment where the regular openSUSE install process would work with
zero modifications. For instance, imagine a student wanting to install
a PC in the lab - grab the NET-install ISO, copy it to a USB stick and
boot. No need to know the proxy, no need to know about a switch for
segmented downloading, just pop in the USB stick and go with the
defaults. Same goes for later updates and additional software - that
Squid is helping out in the background should be 100% transparent.
I've only considered complete downloads so far, although I can see
segmented downloads will be an issue for our cache also. I'm not sure
what is the current status of support for partial responses in Traffic
Server. I know it is an issue, it comes up on the mailing list fairly
regularly, and some improvements to handling partial responses have
recently been made
It would be neat if, after the cache is aware of requests for the same
content from different mirrors, and after it is able to cache segmented
downloads, it could be made aware of requests for the same segment from
different mirrors. Then after one client assembled a complete download
from segments from possibly many different mirrors, the cache would also
contain this complete content, and could respond to requests from
subsequent clients for any segment from any mirror
Your solution to log partial downloads and then download them completely
sounds like a good workaround
[1] https://github.com/jablko/dedup
[2] http://tools.ietf.org/html/rfc6249
[3] http://tools.ietf.org/html/rfc5854
_______________________________________________
mirrorbrain mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain/
Note: To remove yourself from this mailing list, send a mail with the content
unsubscribe
to the address [email protected]