> > Another thought is whether any web crawlers already maintain a database of > digests that an app like this could exploit? > > Here is the codes: > https://github.com/jablko/mintiply/blob/master/mintiply.py > > What are your thoughts? Maybe something like this already exists, or was
> already tried in the past... I've written a metalink crawler for .metalink files. Its pretty dumb but it gets the job done. The code is available here: http://metalinks.svn.sourceforge.net/viewvc/metalinks/crawler/ You can see the results here: http://www.nabber.org/projects/metalink/crawler/list.php I imagine it wouldn't be hard to modify to instead of grabbing the .metalink files, parse them and dump them into your database. One advantage to this method is any URLs that are now dead are still captured in the .metalink files, so your AppEngine code could detect and redirect a "dumb" browser to a working download location instead. As for a hash database, I've been researching options for my Appupdater project. There are some hash search type sites out there but I don't think they will be useful in this case since I haven't seen any that track URLs, its usually just file size, version, product name, etc. There seem to be plenty of datasets out there for installers from the various download websites, like sourceforge.net, softpedia, oldapps.com, etc. However, from what I can tell there is no way to download a database from any of these, you'd have parse the individual web pages. While possible that doesn't seem to be a very efficient way of doing things, you'd need to customize it for each website. Actually probably the better and easier way is to build a .exe, .msi, etc. crawler, download the file and compute your own hashes. It will take a lot of time and bandwidth but you'd get a really good dataset that way. In other words have a crawler that feeds your AppEngine code URLs to process. Neil -- You received this message because you are subscribed to the Google Groups "Metalink Discussion" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/metalink-discussion?hl=en.
