Sorry Ryan,
I should have clarified that I am using Nutch as my crawler. There is
a script for Nutch to do Whole web crawling, but it is not compatible
with Hadoop.
Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering
Moon Valley Software
---------------------------------------------
eosg...@calpoly.edu
e...@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/eosgood
www.lakemeadonline.com
On Oct 6, 2009, at 12:24 PM, Ryan Smith wrote:
This isnt a script per-se but this may help.
http://code.google.com/p/hbase-writer
Its a plugin for heritrix2 web crawler to write crawled site data to
hbase
tables, which run on hadoop. Each url is written as a rowkey in the
hbase
table.
HTH,
-Ryan
On Tue, Oct 6, 2009 at 3:02 PM, Eric <e...@lakemeadonline.com> wrote:
Has anyone written a script for whole web crawling using Hadoop?
The script
for nutch doesn't work since the data is inside the HDFS (tail -f
wont work
with this).
Thanks,
Eric