Hi guys,
Working on getting hadoop up and running. I have the master up and
running. AT this point I am not sure what steps are. I am trying to
do a whole web search.
[EMAIL PROTECTED] search]# bin/nutch generate crawl/crawldb
crawl/segments -topN 10000
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080421151646
Generator: filtering: true
Generator: topN: 10000
Generator: org.apache.hadoop.mapred.InvalidInputException: Input path
doesnt exist : /user/root/crawl/crawldb/current
at
org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
at org.apache.nutch.crawl.Generator.generate(Generator.java:416)
at org.apache.nutch.crawl.Generator.run(Generator.java:548)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.crawl.Generator.main(Generator.java:511)
Now that I got somewhat setup, is there a guide for full web crawling?
If I have a list of urls how do I inject? Does the database live in
the hadoop file system?
Any help would be great.
Thanks,
Jason