The script creates a 'crawl' directory in the present working directory. Where is your Nutch directory and where are you running the script. I usually change directory to the top level Nutch directory, put the script in the 'bin' directory, chmod a+x bin/crawl and then run it as bin/crawl. So, as per this setup the crawl_generate directory should be created in: crawl/segments/<segment-number>/crawl-generate (a typical example of segment-number: 20080102215525).
Your error seems to come from this statement in the script:- $NUTCH_HOME/bin/nutch fetch $segment -threads $threads Fetcher tries to access $segment/crawl_generate in the beginning. In your case the fetcher is trying to open: /user/nutch/-threads/crawl_generate So, it seems the above statement is resolved to:- $NUTCH_HOME/bin/nutch fetch /user/nutch/-threads $threads This means your $segment is /user/nutch and a space is missing between $segment and -threads. Have you modified the script, altered the paths but missed the space accidentally. I hope this information and the script helps you to resolve the problem. Whatever the result is, please let us know. This would help us to improve the script if needed. Regards, Susam Pal On Jan 13, 2008 11:19 AM, Manoj Bist <[EMAIL PROTECTED]> wrote: > Thanks for the response. > I tried this with nutch-0.9. The script seems to be accessing non-existent > file/dirs. > > Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt > exist : /user/nutch/-threads/crawl_generate > at org.apache.hadoop.mapred.InputFormatBase.validateInput( > InputFormatBase.java:138) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) > > > > > On Jan 12, 2008 9:00 PM, Susam Pal <[EMAIL PROTECTED]> wrote: > > > You can try the crawl script: http://wiki.apache.org/nutch/Crawl > > > > Regards, > > Susam Pal > > > > On Jan 13, 2008 8:36 AM, Manoj Bist <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > When I run crawl the second time, it always complains that 'crawled' > > already > > > exists. I always need to remove this directory using 'hadoop dfs -rm > > > crawled' to get going. > > > Is there some way to avoid this error and tell nutch that its a recrawl? > > > > > > bin/nutch crawl urls -dir crawled -depth 1 2>&1 | tee /tmp/foo.log > > > > > > > > > Exception in thread "main" java.lang.RuntimeException: crawled already > > > exists. > > > at org.apache.nutch.crawl.Crawl.main(Crawl.java:85) > > > > > > Thanks, > > > > > > Manoj. > > > > > > -- > > > Tired of reading blogs? Listen to your favorite blogs at > > > http://www.blogbard.com !!!! > > > > > > > > > -- > > Tired of reading blogs? Listen to your favorite blogs at > http://www.blogbard.com !!!! >
