Re: hadoop

ogjunk-nutch Mon, 21 Apr 2008 16:43:02 -0700

Boss,


----- Original Message ----
> From: Jason Boss <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Monday, April 21, 2008 6:42:28 PM
> Subject: hadoop
> 
> Hi guys,
> 
> Working on getting hadoop up and running.  I have the master up and
> running.  AT this point I am not sure what steps are.  I am trying to
> do a whole web search.

Whole web?  What for, if not a secret?

> [EMAIL PROTECTED] search]# bin/nutch generate crawl/crawldb

Suggestion: don't run things as root.

> crawl/segments -topN 10000
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080421151646
> Generator: filtering: true
> Generator: topN: 10000
> Generator: org.apache.hadoop.mapred.InvalidInputException: Input path
> doesnt exist : /user/root/crawl/crawldb/current

Have you formatted the filesystem?
Can you run bin/hadoop fs -ls /user/root/crawl ?

>         at 
> org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:416)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:548)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:511)
> 
> 
> Now that I got somewhat setup, is there a guide for full web crawling?
> If I have a list of urls how do I inject?  Does the database live in
> the hadoop file system?

Oh, if you have not injected any URLs, there is nothing to crawl in your 
crawldb.
Run bin/nutch and you will see "inject" as one of the options.

Yes, the DB lives in HDFS.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: hadoop

Reply via email to