Boss,
----- Original Message ---- > From: Jason Boss <[EMAIL PROTECTED]> > To: [email protected] > Sent: Monday, April 21, 2008 6:42:28 PM > Subject: hadoop > > Hi guys, > > Working on getting hadoop up and running. I have the master up and > running. AT this point I am not sure what steps are. I am trying to > do a whole web search. Whole web? What for, if not a secret? > [EMAIL PROTECTED] search]# bin/nutch generate crawl/crawldb Suggestion: don't run things as root. > crawl/segments -topN 10000 > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: crawl/segments/20080421151646 > Generator: filtering: true > Generator: topN: 10000 > Generator: org.apache.hadoop.mapred.InvalidInputException: Input path > doesnt exist : /user/root/crawl/crawldb/current Have you formatted the filesystem? Can you run bin/hadoop fs -ls /user/root/crawl ? > at > org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543) > at org.apache.nutch.crawl.Generator.generate(Generator.java:416) > at org.apache.nutch.crawl.Generator.run(Generator.java:548) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at org.apache.nutch.crawl.Generator.main(Generator.java:511) > > > Now that I got somewhat setup, is there a guide for full web crawling? > If I have a list of urls how do I inject? Does the database live in > the hadoop file system? Oh, if you have not injected any URLs, there is nothing to crawl in your crawldb. Run bin/nutch and you will see "inject" as one of the options. Yes, the DB lives in HDFS. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
