[Nutch-general] Re: AW: nutch-0.8 crawl problem

Gal Nitzan Wed, 22 Feb 2006 02:13:02 -0800

:) a bit misleading....

first: Hadoop is the evolution from "Nutch Distributed File System".


It is based on google's file system. It enable one to keep all data in a
distributed file system which is very suitable to Nutch.

When you see bin/nuctch NDFS -ls write instead bin/hadoop dfs -ls

now to create the seeds:

create the urls.txt file in a folder called seeds i.e. seeds/urls.txt

bin/hadoop dfs -put seeds seeds
this will copy the seeds folder into hadoop file system

and now

bin/nutch crawl seeds -dir crawled -depth 3 >& crawl.log

Happy crawling.

Gal.


On Wed, 2006-02-22 at 01:05 -0800, Foong Yie wrote:
> matt
> 
> as the tutorial stated ..
> 
> bin/nutch crawl urls -dir crawled -depth 3 >& crawl.log
> 
> the urls is in .txt right? i created it and put inside c:/nutch-0.7.1
> 
> Stephanie
> 
>               
> ---------------------------------
>  Yahoo! Autos. Looking for a sweet ride? Get pricing, reviews, & more on new 
> and used cars.




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: AW: nutch-0.8 crawl problem

Reply via email to