Re: Generate Metalinks with Google App Engine

Bram Neijt Sun, 19 Aug 2012 14:15:50 -0700

A single page export will not work, for sure, but as for that I was
thinking about moving data out of dynmirror to mintiply.


For example, if you don't want to download the complete file before
you have a metalink, you could check at
http://www.dynmirror.net/metalink/?url=http://example.com
to see if dynmirror has any metalink information. You could use
dynmirror as a kind of caching backend for downloads.

Another thing I could do is have dynmirror redirect to mintiply if
there is no hash information available, maybe that would be a good
approach...

I'm not really sure it would add anything, but technically it should
be possible and I think it might be good to get some code commits on
dynmirror anyway ;)

Greets,

Bram


On Sun, Aug 19, 2012 at 9:58 AM, Jack Bates <[email protected]> wrote:
> On Thursday, August 16, 2012 10:44:19 PM UTC-7, Jack Bates wrote:
>>
>> On Tuesday, August 14, 2012 1:58:22 PM UTC-7, Bram Neijt wrote:
>>>
>>> Hi Jack,
>>>
>>> I once created a similair thing, but it required the "owner" of the
>>> file to host the MD5 he/she thinks it should be. It then generates a
>>> metalink based on all the md5/sha1/sha256 hashes in the database.
>>>
>>> The idea is that anybody can step up and start a mirror by hosting the
>>> files and the MD5SUMS and have the service spider the MD5SUMS file.
>>>
>>> You can find the service at: http://www.dynmirror.net/
>>
>>
>> Cool! The design of this site is impressive. I like how it shows
>> analytics, like recent downloads, on the front page
>>
>>> It might be a good idea to join up the databases or do some
>>> collaboration somewhere. Let's see what we can do. For instance, I
>>> could add a mintiply url collection or something like that? Or maybe I
>>> could have dynmirror register the hash/link combinations at mintiply?
>>
>>
>> Great idea, thanks for suggesting it. The first thing that comes to mind
>> is, how would you like to get data out of Mintiply (and into Dynmirror)? Is
>> there an API that Mintiply could provide that would make this as easy as
>> possible?
>
>
> Hi Bram and thanks again for inviting me to collaborate,
>
> As an experiment, I just added a page to export all of the data from
> Mintiply, in Metalink format. Let me know what you think. Could this be
> useful to a project like Dynmirror? or would you prefer a different format,
> or different data?
>
> There isn't much data in the app yet, so dumping everything in one Metalink
> response works fine. If the amount of data ever gets large, we may need to
> rethink this
>
> Here is the page: http://mintiply.appspot.com/export
>
>>> Let me know what you think. Currently, I think I'm the only user of
>>> dynmirror.net (at http://www.logfish.net/pr/ccbuild/downloads/ ).
>>>
>>> I'd also be happy to dig up and publish the code somewhere if I havn't
>>> already.
>>>
>>> Greets,
>>>
>>> Bram
>>
>>
>> Thanks very much for inviting me to collaborate
>>
>>> On Tue, Aug 14, 2012 at 8:30 AM, Jack Bates <[email protected]> wrote:
>>> > Hi, what do you think about a Google App Engine app that generates
>>> > Metalinks
>>> > for URLs? Maybe something like this already exists?
>>> >
>>> > The first time you visit, e.g.
>>> >
>>> > http://mintiply.appspot.com/http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2
>>> > it downloads the content and computes a digest. App Engine has *lots*
>>> > of
>>> > bandwidth, so this is snappy. Then it sends a response with "Digest:
>>> > SHA-256=..." and "Location: ..." headers, similar to MirrorBrain
>>> >
>>> > It also records the digest with Google's Datastore, so on subsequent
>>> > visits,
>>> > it doesn't download or recompute the digest
>>> >
>>> > Finally, it also checks the Datastore for other URLs with matching
>>> > digest,
>>> > and sends "Link: <...>; rel=duplicate" headers for each of these. So if
>>> > you
>>> > visit, e.g.
>>> >
>>> > http://mintiply.appspot.com/http://mirror.nexcess.net/apache/trafficserver/trafficserver-3.2.0.tar.bz2
>>> > it sends "Link:
>>> > <http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2>;
>>> > rel=duplicate"
>>> >
>>> > The idea is that this could be useful for sites that don't yet generate
>>> > Metalinks, like SourceForge. You could always prefix a URL that you
>>> > pass to
>>> > a Metalink client with "http://mintiply.appspot.com/"; to get a
>>> > Metalink.
>>> > Alternatively, if a Metalink client noticed that it was downloading a
>>> > large
>>> > file without mirror or hash metadata, it could try to get more mirrors
>>> > from
>>> > this app, while it continued downloading the file. As long as someone
>>> > else
>>> > had previously tried the same URL, or App Engine can download the file
>>> > faster than the client, then it should get more mirrors in time to help
>>> > finish the download. Popular downloads should have the most complete
>>> > list of
>>> > mirrors, since these URLs should have been tried the most
>>> >
>>> > Right now it only downloads a URL once, and remembers the digest
>>> > forever,
>>> > which assumes that the content at the URL never changes. This is true
>>> > for
>>> > many downloads, but in future it could respect cache control headers
>>> >
>>> > Also right now it only generates HTTP Metalinks with a whole file
>>> > digest.
>>> > But in future it could conceivably generate XML Metalinks with partial
>>> > digests
>>> >
>>> > A major limitation with this proof of concept is that I ran into some
>>> > App
>>> > Engine errors with downloads of any significant size, like Ubuntu ISOs.
>>> > The
>>> > App Engine maximum response size is 32 MB. The app overcomes this with
>>> > byte
>>> > ranges and downloading files in 32 MB segments. This works on my local
>>> > machine with the App Engine dev server, but in production Google
>>> > apparently
>>> > kills the process after downloading just a few segments, because it
>>> > uses too
>>> > much memory. This seems wrong, since the app throws away each segment
>>> > after
>>> > adding it to the digest. So if it has enough memory to download one
>>> > segment,
>>> > it shouldn't require any more memory for additional segments. Maybe
>>> > this
>>> > could be worked around by manually calling the Python garbage
>>> > collector, or
>>> > by shrinking the segment size...
>>> >
>>> > Also I ran into a second bug with App Engine URL Fetch and downloads of
>>> > any
>>> > significant size:
>>> > http://code.google.com/p/googleappengine/issues/detail?id=7732#c6
>>> >
>>> > Another thought is whether any web crawlers already maintain a database
>>> > of
>>> > digests that an app like this could exploit?
>>> >
>>> > Here is the codes:
>>> > https://github.com/jablko/mintiply/blob/master/mintiply.py
>>> >
>>> > What are your thoughts? Maybe something like this already exists, or
>>> > was
>>> > already tried in the past...
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> > Groups
>>> > "Metalink Discussion" group.
>>> > To view this discussion on the web visit
>>> > https://groups.google.com/d/msg/metalink-discussion/-/r7cq8sL0LuMJ.
>>> > To post to this group, send email to [email protected].
>>> > To unsubscribe from this group, send email to
>>> > [email protected].
>>> > For more options, visit this group at
>>> > http://groups.google.com/group/metalink-discussion?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Metalink Discussion" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/metalink-discussion/-/nQSS5zOJRrgJ.
>
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/metalink-discussion?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Metalink Discussion" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/metalink-discussion?hl=en.

Re: Generate Metalinks with Google App Engine

Reply via email to