Hello Per, this writeup is really well done, thank you for it!

The approach so far taken by the Apache Traffic Server plugin is to examine "Link: <...>; rel=duplicate" response headers. For example here are response headers from download.services.openoffice.org, which also uses MirrorBrain:

$ curl -D - -o /dev/null -s 
http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz
HTTP/1.1 302 Found
Date: Sat, 02 Jun 2012 06:24:15 GMT
Server: Apache/2.2.22 (Linux/SUSE)
X-Prefix: 41.197.0.0/16
X-AS: 36934
X-MirrorBrain-Mirror: halifax.rwth-aachen.de
X-MirrorBrain-Realm: other_country
Link: 
<http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4>;
 rel=describedby; type="application/metalink4+xml"
Link: 
<http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.torrent>;
 rel=describedby; type="application/x-bittorrent"
Link: 
<http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
 rel=duplicate; pri=1; geo=de
Link: 
<http://ftp5.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
 rel=duplicate; pri=2; geo=de
Link: 
<http://ftp3.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
 rel=duplicate; pri=3; geo=de
Link: 
<http://ftp.cc.uoc.gr/openoffice.org/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
 rel=duplicate; pri=4; geo=gr
Link: 
<http://ftp.ntua.gr/pub/OpenOffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
 rel=duplicate; pri=5; geo=gr
Digest: MD5=chZROzRjy791zYb5mUhk3A==
Digest: SHA=nRgEtguiGxDlu8PKSxyBSc7TlGw=
Digest: SHA-256=VO2S9pgCq1lqgTFTKssVj6amn0npNdagtjI8ziDtiRQ=
Location: 
http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz
Content-Length: 395
Connection: close
Content-Type: text/html; charset=iso-8859-1

$

If a response has a "Location: ..." header and a "Link: <...>; rel=duplicate" header then the Traffic Server plugin will check if the URLs in these headers are already cached. If the "Location: ..." URL is not already cached but a "Link: <...>; rel=duplicate" URL is cached, then the plugin will rewrite the "Location: ..." header with the cached URL

This should redirect clients that are not Metalink aware to a mirror that is already cached. I would love any feedback on this approach

The code so far is up on GitHub [1]

We are also thinking of examining "Digest: ..." headers. If a response has a "Location: ..." header that's not already cached and a "Digest: ..." header, then the plugin would check the cache for a matching digest. If found then it would rewrite the "Location: ..." header with the cached URL

This plugin is motivated by a similar problem to the one in your writeup. We run a caching proxy here at a rural village in Rwanda to improve our slow internet access. But many web sites don't predictably redirect users to the same download mirror, which defeats our cache

When you say "we're using Metalink as the mirror list", what do you
mean?  One annoying item in my setup is the parsing of the HTML mirror
page - you wouldn't happen to know of a way of retrieving the mirror
list in XML format?

You can retrieve a Metalink/XML resource that includes information about where a file is mirrored, in XML format. I think the correct way to *discover* this resource is through a 'Link: <...>; rel=describedby; type="application/metalink4+xml"' header. Can anyone (Anthony?) confirm that this is the correct way?

So for example, in the above download.services.openoffice.org example: http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4

However I can't seem to get these same headers from download.opensuse.org. Both download.services.openoffice.org and download.opensuse.org seem to use MirrorBrain, anyone know why might download.services.openoffice.org responses include a 'Link: <...>; rel=describedby; type="application/metalink4+xml"' header but download.opensuse.org responses not?

$ curl -D - -o /dev/null -s 
http://download.opensuse.org/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm
HTTP/1.1 302 Found
Date: Sat, 02 Jun 2012 07:22:30 GMT
Server: Apache/2.2.12 (Linux/SUSE)
X-Prefix: 41.197.0.0/16
X-AS: 36934
X-MirrorBrain-Mirror: ftp5.gwdg.de
X-MirrorBrain-Realm: other_country
Location: 
http://ftp5.gwdg.de/pub/opensuse/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm
Content-Length: 368
Content-Type: text/html; charset=iso-8859-1

$

More information on the "Link: <...>; rel=duplicate" and 'Link: <...>; rel=describedby; type="application/metalink4+xml"' headers is in RFC 6249, Metalink/HTTP: Mirrors and Hashes [2]. More information on the XML format that includes information about where a file is mirrored is in RFC 5854, The Metalink Download Description Format [3]

Switching off segmented downloading is interesting too, but I wanted an
environment where the regular openSUSE install process would work with
zero modifications.  For instance, imagine a student wanting to install
a PC in the lab - grab the NET-install ISO, copy it to a USB stick and
boot.  No need to know the proxy, no need to know about a switch for
segmented downloading, just pop in the USB stick and go with the
defaults.  Same goes for later updates and additional software - that
Squid is helping out in the background should be 100% transparent.

I've only considered complete downloads so far, although I can see segmented downloads will be an issue for our cache also. I'm not sure what is the current status of support for partial responses in Traffic Server. I know it is an issue, it comes up on the mailing list fairly regularly, and some improvements to handling partial responses have recently been made

It would be neat if, after the cache is aware of requests for the same content from different mirrors, and after it is able to cache segmented downloads, it could be made aware of requests for the same segment from different mirrors. Then after one client assembled a complete download from segments from possibly many different mirrors, the cache would also contain this complete content, and could respond to requests from subsequent clients for any segment from any mirror

Your solution to log partial downloads and then download them completely sounds like a good workaround

[1] https://github.com/jablko/dedup
[2] http://tools.ietf.org/html/rfc6249
[3] http://tools.ietf.org/html/rfc5854


_______________________________________________
mirrorbrain mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain/

Note: To remove yourself from this mailing list, send a mail with the content
        unsubscribe
to the address [email protected]

Reply via email to