Re: [ol-tech] [ol-discuss] Amazon crawler code for open library

Greg Lindahl Sat, 16 Feb 2013 18:33:47 -0800

> > I did find data on the internet archive which looks like amazon crawls,
> > but not sure how new/old it is and how and what was parsed into ol already.


Another source of data is Common Crawl -- they recently published a
partial index of the URLs in their data, suitable for easy extraction:

http://commoncrawl.org/common-crawl-url-index/

And here's a worked-out example of using it:

http://commoncrawl.org/analysis-of-the-ncsu-library-urls-in-the-common-crawl-index/

At some point soonish Common Crawl will issue a complete url index,
and will also be hosting an url index in similar format of blekko's
crawl. I can give you access to pull pages out of blekko's cache.  At
this point we don't know who has more amazon or how much they overlap, but
at some point we will.

-- greg (CTO, blekko)

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] [ol-discuss] Amazon crawler code for open library

Reply via email to