> > I did find data on the internet archive which looks like amazon crawls, > > but not sure how new/old it is and how and what was parsed into ol already.
Another source of data is Common Crawl -- they recently published a partial index of the URLs in their data, suitable for easy extraction: http://commoncrawl.org/common-crawl-url-index/ And here's a worked-out example of using it: http://commoncrawl.org/analysis-of-the-ncsu-library-urls-in-the-common-crawl-index/ At some point soonish Common Crawl will issue a complete url index, and will also be hosting an url index in similar format of blekko's crawl. I can give you access to pull pages out of blekko's cache. At this point we don't know who has more amazon or how much they overlap, but at some point we will. -- greg (CTO, blekko) _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
