Re: Generate Metalinks with Google App Engine

Bram Neijt Tue, 14 Aug 2012 13:58:25 -0700

Hi Jack,

I once created a similair thing, but it required the "owner" of the
file to host the MD5 he/she thinks it should be. It then generates a
metalink based on all the md5/sha1/sha256 hashes in the database.


The idea is that anybody can step up and start a mirror by hosting the
files and the MD5SUMS and have the service spider the MD5SUMS file.

You can find the service at: http://www.dynmirror.net/

It might be a good idea to join up the databases or do some
collaboration somewhere. Let's see what we can do. For instance, I
could add a mintiply url collection or something like that? Or maybe I
could have dynmirror register the hash/link combinations at mintiply?

Let me know what you think. Currently, I think I'm the only user of
dynmirror.net (at http://www.logfish.net/pr/ccbuild/downloads/ ).

I'd also be happy to dig up and publish the code somewhere if I havn't already.

Greets,

Bram

On Tue, Aug 14, 2012 at 8:30 AM, Jack Bates <[email protected]> wrote:
> Hi, what do you think about a Google App Engine app that generates Metalinks
> for URLs? Maybe something like this already exists?
>
> The first time you visit, e.g.
> http://mintiply.appspot.com/http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2
> it downloads the content and computes a digest. App Engine has *lots* of
> bandwidth, so this is snappy. Then it sends a response with "Digest:
> SHA-256=..." and "Location: ..." headers, similar to MirrorBrain
>
> It also records the digest with Google's Datastore, so on subsequent visits,
> it doesn't download or recompute the digest
>
> Finally, it also checks the Datastore for other URLs with matching digest,
> and sends "Link: <...>; rel=duplicate" headers for each of these. So if you
> visit, e.g.
> http://mintiply.appspot.com/http://mirror.nexcess.net/apache/trafficserver/trafficserver-3.2.0.tar.bz2
> it sends "Link:
> <http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2>;
> rel=duplicate"
>
> The idea is that this could be useful for sites that don't yet generate
> Metalinks, like SourceForge. You could always prefix a URL that you pass to
> a Metalink client with "http://mintiply.appspot.com/"; to get a Metalink.
> Alternatively, if a Metalink client noticed that it was downloading a large
> file without mirror or hash metadata, it could try to get more mirrors from
> this app, while it continued downloading the file. As long as someone else
> had previously tried the same URL, or App Engine can download the file
> faster than the client, then it should get more mirrors in time to help
> finish the download. Popular downloads should have the most complete list of
> mirrors, since these URLs should have been tried the most
>
> Right now it only downloads a URL once, and remembers the digest forever,
> which assumes that the content at the URL never changes. This is true for
> many downloads, but in future it could respect cache control headers
>
> Also right now it only generates HTTP Metalinks with a whole file digest.
> But in future it could conceivably generate XML Metalinks with partial
> digests
>
> A major limitation with this proof of concept is that I ran into some App
> Engine errors with downloads of any significant size, like Ubuntu ISOs. The
> App Engine maximum response size is 32 MB. The app overcomes this with byte
> ranges and downloading files in 32 MB segments. This works on my local
> machine with the App Engine dev server, but in production Google apparently
> kills the process after downloading just a few segments, because it uses too
> much memory. This seems wrong, since the app throws away each segment after
> adding it to the digest. So if it has enough memory to download one segment,
> it shouldn't require any more memory for additional segments. Maybe this
> could be worked around by manually calling the Python garbage collector, or
> by shrinking the segment size...
>
> Also I ran into a second bug with App Engine URL Fetch and downloads of any
> significant size:
> http://code.google.com/p/googleappengine/issues/detail?id=7732#c6
>
> Another thought is whether any web crawlers already maintain a database of
> digests that an app like this could exploit?
>
> Here is the codes:
> https://github.com/jablko/mintiply/blob/master/mintiply.py
>
> What are your thoughts? Maybe something like this already exists, or was
> already tried in the past...
>
> --
> You received this message because you are subscribed to the Google Groups
> "Metalink Discussion" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/metalink-discussion/-/r7cq8sL0LuMJ.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/metalink-discussion?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Metalink Discussion" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/metalink-discussion?hl=en.

Re: Generate Metalinks with Google App Engine

Reply via email to