Sorry Ryan,

I should have clarified that I am using Nutch as my crawler. There is a script for Nutch to do Whole web crawling, but it is not compatible with Hadoop.


Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering
Moon Valley Software
---------------------------------------------
eosg...@calpoly.edu
e...@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/eosgood
www.lakemeadonline.com

On Oct 6, 2009, at 12:24 PM, Ryan Smith wrote:

This isnt a script per-se but this may help.

http://code.google.com/p/hbase-writer

Its a plugin for heritrix2 web crawler to write crawled site data to hbase tables, which run on hadoop. Each url is written as a rowkey in the hbase
table.

HTH,
-Ryan

On Tue, Oct 6, 2009 at 3:02 PM, Eric <e...@lakemeadonline.com> wrote:

Has anyone written a script for whole web crawling using Hadoop? The script for nutch doesn't work since the data is inside the HDFS (tail -f wont work
with this).

Thanks,

Eric


Reply via email to