> 
> Another thought is whether any web crawlers already maintain a database
of 
> digests that an app like this could exploit?
> 
> Here is the codes: 
> https://github.com/jablko/mintiply/blob/master/mintiply.py
> 
> What are your thoughts? Maybe something like this already exists, or was

> already tried in the past...

I've written a metalink crawler for .metalink files.  Its pretty dumb but
it gets the job done. The code is available here:

http://metalinks.svn.sourceforge.net/viewvc/metalinks/crawler/

You can see the results here:

http://www.nabber.org/projects/metalink/crawler/list.php

I imagine it wouldn't be hard to modify to instead of grabbing the
.metalink files, parse them and dump them into your database.  One
advantage to this method is any URLs that are now dead are still captured
in the .metalink files, so your AppEngine code could detect and redirect a
"dumb" browser to a working download location instead.

As for a hash database, I've been researching options for my Appupdater
project.  There are some hash search type sites out there but I don't think
they will be useful in this case since I haven't seen any that track URLs,
its usually just file size, version, product name, etc.  There seem to be
plenty of datasets out there for installers from the various download
websites, like sourceforge.net, softpedia, oldapps.com, etc.  However, from
what I can tell there is no way to download a database from any of these,
you'd have parse the individual web pages.  While possible that doesn't
seem to be a very efficient way of doing things, you'd need to customize it
for each website.  Actually probably the better and easier way is to build
a .exe, .msi, etc. crawler, download the file and compute your own hashes. 
It will take a lot of time and bandwidth but you'd get a really good
dataset that way.  In other words have a crawler that feeds your AppEngine
code URLs to process.

Neil

-- 
You received this message because you are subscribed to the Google Groups 
"Metalink Discussion" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/metalink-discussion?hl=en.

Reply via email to