Re: [mirrorbrain] How to make Squid work with mirrorbrain

Jack Bates Sat, 02 Jun 2012 04:08:19 -0700

Hello Per, this writeup is really well done, thank you for it!

The approach so far taken by the Apache Traffic Server plugin is toexamine "Link: <...>; rel=duplicate" response headers. For example hereare response headers from download.services.openoffice.org, which alsouses MirrorBrain:

$ curl -D - -o /dev/null -s 
http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz
HTTP/1.1 302 Found
Date: Sat, 02 Jun 2012 06:24:15 GMT
Server: Apache/2.2.22 (Linux/SUSE)
X-Prefix: 41.197.0.0/16
X-AS: 36934
X-MirrorBrain-Mirror: halifax.rwth-aachen.de
X-MirrorBrain-Realm: other_country
Link: 
<http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4>;
 rel=describedby; type="application/metalink4+xml"
Link: 
<http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.torrent>;
 rel=describedby; type="application/x-bittorrent"
Link: 
<http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
 rel=duplicate; pri=1; geo=de
Link: 
<http://ftp5.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
 rel=duplicate; pri=2; geo=de
Link: 
<http://ftp3.gwdg.de/pub/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
 rel=duplicate; pri=3; geo=de
Link: 
<http://ftp.cc.uoc.gr/openoffice.org/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
 rel=duplicate; pri=4; geo=gr
Link: 
<http://ftp.ntua.gr/pub/OpenOffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz>;
 rel=duplicate; pri=5; geo=gr
Digest: MD5=chZROzRjy791zYb5mUhk3A==
Digest: SHA=nRgEtguiGxDlu8PKSxyBSc7TlGw=
Digest: SHA-256=VO2S9pgCq1lqgTFTKssVj6amn0npNdagtjI8ziDtiRQ=
Location: 
http://ftp.halifax.rwth-aachen.de/openoffice/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz
Content-Length: 395
Connection: close
Content-Type: text/html; charset=iso-8859-1

$

If a response has a "Location: ..." header and a "Link: <...>;rel=duplicate" header then the Traffic Server plugin will check if theURLs in these headers are already cached. If the "Location: ..." URL isnot already cached but a "Link: <...>; rel=duplicate" URL is cached,then the plugin will rewrite the "Location: ..." header with the cached URL

This should redirect clients that are not Metalink aware to a mirrorthat is already cached. I would love any feedback on this approach


The code so far is up on GitHub [1]

We are also thinking of examining "Digest: ..." headers. If a responsehas a "Location: ..." header that's not already cached and a "Digest:..." header, then the plugin would check the cache for a matchingdigest. If found then it would rewrite the "Location: ..." header withthe cached URL

This plugin is motivated by a similar problem to the one in yourwriteup. We run a caching proxy here at a rural village in Rwanda toimprove our slow internet access. But many web sites don't predictablyredirect users to the same download mirror, which defeats our cache

When you say "we're using Metalink as the mirror list", what do you
mean?  One annoying item in my setup is the parsing of the HTML mirror
page - you wouldn't happen to know of a way of retrieving the mirror
list in XML format?

You can retrieve a Metalink/XML resource that includes information aboutwhere a file is mirrored, in XML format. I think the correct way to*discover* this resource is through a 'Link: <...>; rel=describedby;type="application/metalink4+xml"' header. Can anyone (Anthony?) confirmthat this is the correct way?

So for example, in the above download.services.openoffice.org example:http://download.services.openoffice.org/files/stable/3.3.0/OOo-SDK_3.3.0_Linux_x86-64_install-deb_en-US.tar.gz.meta4

However I can't seem to get these same headers fromdownload.opensuse.org. Both download.services.openoffice.org anddownload.opensuse.org seem to use MirrorBrain, anyone know why mightdownload.services.openoffice.org responses include a 'Link: <...>;rel=describedby; type="application/metalink4+xml"' header butdownload.opensuse.org responses not?

$ curl -D - -o /dev/null -s 
http://download.opensuse.org/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm
HTTP/1.1 302 Found
Date: Sat, 02 Jun 2012 07:22:30 GMT
Server: Apache/2.2.12 (Linux/SUSE)
X-Prefix: 41.197.0.0/16
X-AS: 36934
X-MirrorBrain-Mirror: ftp5.gwdg.de
X-MirrorBrain-Realm: other_country
Location: 
http://ftp5.gwdg.de/pub/opensuse/factory/repo/oss/suse/x86_64/CharLS-devel-1.0-2.3.x86_64.rpm
Content-Length: 368
Content-Type: text/html; charset=iso-8859-1

$

More information on the "Link: <...>; rel=duplicate" and 'Link: <...>;rel=describedby; type="application/metalink4+xml"' headers is in RFC6249, Metalink/HTTP: Mirrors and Hashes [2]. More information on the XMLformat that includes information about where a file is mirrored is inRFC 5854, The Metalink Download Description Format [3]

Switching off segmented downloading is interesting too, but I wanted an
environment where the regular openSUSE install process would work with
zero modifications.  For instance, imagine a student wanting to install
a PC in the lab - grab the NET-install ISO, copy it to a USB stick and
boot.  No need to know the proxy, no need to know about a switch for
segmented downloading, just pop in the USB stick and go with the
defaults.  Same goes for later updates and additional software - that
Squid is helping out in the background should be 100% transparent.

I've only considered complete downloads so far, although I can seesegmented downloads will be an issue for our cache also. I'm not surewhat is the current status of support for partial responses in TrafficServer. I know it is an issue, it comes up on the mailing list fairlyregularly, and some improvements to handling partial responses haverecently been made

It would be neat if, after the cache is aware of requests for the samecontent from different mirrors, and after it is able to cache segmenteddownloads, it could be made aware of requests for the same segment fromdifferent mirrors. Then after one client assembled a complete downloadfrom segments from possibly many different mirrors, the cache would alsocontain this complete content, and could respond to requests fromsubsequent clients for any segment from any mirror

Your solution to log partial downloads and then download them completelysounds like a good workaround


[1] https://github.com/jablko/dedup
[2] http://tools.ietf.org/html/rfc6249
[3] http://tools.ietf.org/html/rfc5854


_______________________________________________
mirrorbrain mailing list
Archive: http://mirrorbrain.org/archive/mirrorbrain/

Note: To remove yourself from this mailing list, send a mail with the content
        unsubscribe
to the address [email protected]

Re: [mirrorbrain] How to make Squid work with mirrorbrain

Reply via email to