Re: Generate Metalinks with Google App Engine

Bram Neijt Wed, 03 Oct 2012 10:47:10 -0700

Hi everybody,

I took the time to look up my code and found out I never published
dynmirror.net.


The code is now online at https://github.com/bneijt/dynmirror.net

I'll still have to publish correct licensing information etc, and find
a good way to clean up having jinja2 in the git repo as well, but as I
have a few other projects going on I don't think I'll get to that any
time soon. If you have any questions regarding the code, feel free to
mail me directly.

Greets,

Bram

On Tue, Aug 21, 2012 at 7:45 AM, Jack Bates <[email protected]> wrote:
> On Sunday, August 19, 2012 2:15:46 PM UTC-7, Bram Neijt wrote:
>>
>> A single page export will not work, for sure, but as for that I was
>> thinking about moving data out of dynmirror to mintiply.
>>
>> For example, if you don't want to download the complete file before
>> you have a metalink, you could check at
>> http://www.dynmirror.net/metalink/?url=http://example.com
>> to see if dynmirror has any metalink information. You could use
>> dynmirror as a kind of caching backend for downloads.
>>
>> Another thing I could do is have dynmirror redirect to mintiply if
>> there is no hash information available, maybe that would be a good
>> approach...
>>
>> I'm not really sure it would add anything, but technically it should
>> be possible and I think it might be good to get some code commits on
>> dynmirror anyway ;)
>
>
> That sounds like a good idea. Please let me know if there's anything I can
> do to help with this
>
> Cheers
>
>> Greets,
>>
>> Bram
>>
>>
>> On Sun, Aug 19, 2012 at 9:58 AM, Jack Bates <[email protected]> wrote:
>> > On Thursday, August 16, 2012 10:44:19 PM UTC-7, Jack Bates wrote:
>> >>
>> >> On Tuesday, August 14, 2012 1:58:22 PM UTC-7, Bram Neijt wrote:
>> >>>
>> >>> Hi Jack,
>> >>>
>> >>> I once created a similair thing, but it required the "owner" of the
>> >>> file to host the MD5 he/she thinks it should be. It then generates a
>> >>> metalink based on all the md5/sha1/sha256 hashes in the database.
>> >>>
>> >>> The idea is that anybody can step up and start a mirror by hosting the
>> >>> files and the MD5SUMS and have the service spider the MD5SUMS file.
>> >>>
>> >>> You can find the service at: http://www.dynmirror.net/
>> >>
>> >>
>> >> Cool! The design of this site is impressive. I like how it shows
>> >> analytics, like recent downloads, on the front page
>> >>
>> >>> It might be a good idea to join up the databases or do some
>> >>> collaboration somewhere. Let's see what we can do. For instance, I
>> >>> could add a mintiply url collection or something like that? Or maybe I
>> >>> could have dynmirror register the hash/link combinations at mintiply?
>> >>
>> >>
>> >> Great idea, thanks for suggesting it. The first thing that comes to
>> >> mind
>> >> is, how would you like to get data out of Mintiply (and into
>> >> Dynmirror)? Is
>> >> there an API that Mintiply could provide that would make this as easy
>> >> as
>> >> possible?
>> >
>> >
>> > Hi Bram and thanks again for inviting me to collaborate,
>> >
>> > As an experiment, I just added a page to export all of the data from
>> > Mintiply, in Metalink format. Let me know what you think. Could this be
>> > useful to a project like Dynmirror? or would you prefer a different
>> > format,
>> > or different data?
>> >
>> > There isn't much data in the app yet, so dumping everything in one
>> > Metalink
>> > response works fine. If the amount of data ever gets large, we may need
>> > to
>> > rethink this
>> >
>> > Here is the page: http://mintiply.appspot.com/export
>> >
>> >>> Let me know what you think. Currently, I think I'm the only user of
>> >>> dynmirror.net (at http://www.logfish.net/pr/ccbuild/downloads/ ).
>> >>>
>> >>> I'd also be happy to dig up and publish the code somewhere if I havn't
>> >>> already.
>> >>>
>> >>> Greets,
>> >>>
>> >>> Bram
>> >>
>> >>
>> >> Thanks very much for inviting me to collaborate
>> >>
>> >>> On Tue, Aug 14, 2012 at 8:30 AM, Jack Bates <[email protected]>
>> >>> wrote:
>> >>> > Hi, what do you think about a Google App Engine app that generates
>> >>> > Metalinks
>> >>> > for URLs? Maybe something like this already exists?
>> >>> >
>> >>> > The first time you visit, e.g.
>> >>> >
>> >>> >
>> >>> > http://mintiply.appspot.com/http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2
>> >>> > it downloads the content and computes a digest. App Engine has
>> >>> > *lots*
>> >>> > of
>> >>> > bandwidth, so this is snappy. Then it sends a response with "Digest:
>> >>> > SHA-256=..." and "Location: ..." headers, similar to MirrorBrain
>> >>> >
>> >>> > It also records the digest with Google's Datastore, so on subsequent
>> >>> > visits,
>> >>> > it doesn't download or recompute the digest
>> >>> >
>> >>> > Finally, it also checks the Datastore for other URLs with matching
>> >>> > digest,
>> >>> > and sends "Link: <...>; rel=duplicate" headers for each of these. So
>> >>> > if
>> >>> > you
>> >>> > visit, e.g.
>> >>> >
>> >>> >
>> >>> > http://mintiply.appspot.com/http://mirror.nexcess.net/apache/trafficserver/trafficserver-3.2.0.tar.bz2
>> >>> > it sends "Link:
>> >>> >
>> >>> > <http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2>;
>> >>> > rel=duplicate"
>> >>> >
>> >>> > The idea is that this could be useful for sites that don't yet
>> >>> > generate
>> >>> > Metalinks, like SourceForge. You could always prefix a URL that you
>> >>> > pass to
>> >>> > a Metalink client with "http://mintiply.appspot.com/"; to get a
>> >>> > Metalink.
>> >>> > Alternatively, if a Metalink client noticed that it was downloading
>> >>> > a
>> >>> > large
>> >>> > file without mirror or hash metadata, it could try to get more
>> >>> > mirrors
>> >>> > from
>> >>> > this app, while it continued downloading the file. As long as
>> >>> > someone
>> >>> > else
>> >>> > had previously tried the same URL, or App Engine can download the
>> >>> > file
>> >>> > faster than the client, then it should get more mirrors in time to
>> >>> > help
>> >>> > finish the download. Popular downloads should have the most complete
>> >>> > list of
>> >>> > mirrors, since these URLs should have been tried the most
>> >>> >
>> >>> > Right now it only downloads a URL once, and remembers the digest
>> >>> > forever,
>> >>> > which assumes that the content at the URL never changes. This is
>> >>> > true
>> >>> > for
>> >>> > many downloads, but in future it could respect cache control headers
>> >>> >
>> >>> > Also right now it only generates HTTP Metalinks with a whole file
>> >>> > digest.
>> >>> > But in future it could conceivably generate XML Metalinks with
>> >>> > partial
>> >>> > digests
>> >>> >
>> >>> > A major limitation with this proof of concept is that I ran into
>> >>> > some
>> >>> > App
>> >>> > Engine errors with downloads of any significant size, like Ubuntu
>> >>> > ISOs.
>> >>> > The
>> >>> > App Engine maximum response size is 32 MB. The app overcomes this
>> >>> > with
>> >>> > byte
>> >>> > ranges and downloading files in 32 MB segments. This works on my
>> >>> > local
>> >>> > machine with the App Engine dev server, but in production Google
>> >>> > apparently
>> >>> > kills the process after downloading just a few segments, because it
>> >>> > uses too
>> >>> > much memory. This seems wrong, since the app throws away each
>> >>> > segment
>> >>> > after
>> >>> > adding it to the digest. So if it has enough memory to download one
>> >>> > segment,
>> >>> > it shouldn't require any more memory for additional segments. Maybe
>> >>> > this
>> >>> > could be worked around by manually calling the Python garbage
>> >>> > collector, or
>> >>> > by shrinking the segment size...
>> >>> >
>> >>> > Also I ran into a second bug with App Engine URL Fetch and downloads
>> >>> > of
>> >>> > any
>> >>> > significant size:
>> >>> > http://code.google.com/p/googleappengine/issues/detail?id=7732#c6
>> >>> >
>> >>> > Another thought is whether any web crawlers already maintain a
>> >>> > database
>> >>> > of
>> >>> > digests that an app like this could exploit?
>> >>> >
>> >>> > Here is the codes:
>> >>> > https://github.com/jablko/mintiply/blob/master/mintiply.py
>> >>> >
>> >>> > What are your thoughts? Maybe something like this already exists, or
>> >>> > was
>> >>> > already tried in the past...
>> >>> >
>> >>> > --
>> >>> > You received this message because you are subscribed to the Google
>> >>> > Groups
>> >>> > "Metalink Discussion" group.
>> >>> > To view this discussion on the web visit
>> >>> > https://groups.google.com/d/msg/metalink-discussion/-/r7cq8sL0LuMJ.
>> >>> > To post to this group, send email to [email protected].
>> >>> > To unsubscribe from this group, send email to
>> >>> > [email protected].
>> >>> > For more options, visit this group at
>> >>> > http://groups.google.com/group/metalink-discussion?hl=en.
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "Metalink Discussion" group.
>> > To view this discussion on the web visit
>> > https://groups.google.com/d/msg/metalink-discussion/-/nQSS5zOJRrgJ.
>> >
>> > To post to this group, send email to [email protected].
>> > To unsubscribe from this group, send email to
>> > [email protected].
>> > For more options, visit this group at
>> > http://groups.google.com/group/metalink-discussion?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Metalink Discussion" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/metalink-discussion/-/zkL9SJJaRssJ.
>
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/metalink-discussion?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Metalink Discussion" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/metalink-discussion?hl=en.

Re: Generate Metalinks with Google App Engine

Reply via email to