What I find interesting is that I got the "urls.txt" approach to work when I installed nutch0.9 rather than the nightly build; e.g.: bin/nutch crawl urls.txt -dir z/sf911truth
That works. I just haven't gotten it to work with a nightly build. I'm still trying to figure out how to compile the nightly build under cygwin. --Kai Middleton ----- Original Message ---- From: feran <[EMAIL PROTECTED]> To: [email protected] Sent: Saturday, July 28, 2007 10:06:06 AM Subject: Re: cygwin - Input path doesnt exist Nutch checks relative URLs using the area where you ran the command as the starting point, as shown in the tutorial. Try using relative paths; go up one directory, set up your directories there, then run bin/nutch with your parameters. ----- Original Message ----- From: "Kai_testing Middleton" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Friday, July 27, 2007 7:00 PM Subject: Re: cygwin - Input path doesnt exist Hmmm, creating a seed directory with a file in it named urls doesn't seem to work. [EMAIL PROTECTED] /cygdrive/c/nutch-2007-07-26_04-01-20/conf $ nutch crawl /cygdrive/c/nutch-2007-07-26_04-01-20/seed -dir /cygdrive/c/nutch-2007-07-26_04-01-20/zzzz/sf911truth -depth 3 -topN 200 crawl started in: /cygdrive/c/nutch-2007-07-26_04-01-20/zzzz/sf911truth rootUrlDir = /cygdrive/c/nutch-2007-07-26_04-01-20/seed threads = 10 depth = 3 topN = 200 Injector: starting Injector: crawlDb: /cygdrive/c/nutch-2007-07-26_04-01-20/zzzz/sf911truth/crawldb Injector: urlDir: /cygdrive/c/nutch-2007-07-26_04-01-20/seed Injector: Converting injected urls to crawl db entries. Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /cygdrive/c/nutch-2 007-07-26_04-01-20/seed at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543) at org.apache.nutch.crawl.Injector.inject(Injector.java:162) at org.apache.nutch.crawl.Crawl.main(Crawl.java:115) Maybe, as I speculated before, it's some kind of pathing problem with cygwin in hadoop. Maybe I'll try Susam Pal's suggestion of installing a JDK within cygwin that thinks in terms of unix paths. --Kai ----- Original Message ---- From: feran <[EMAIL PROTECTED]> To: [email protected] Sent: Friday, July 27, 2007 6:20:56 AM Subject: Re: cygwin - Input path doesnt exist This is the problem: Injector: urlDir: /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt urls.txt is not a Directory. Crawl takes a Directory parameter, not the direct file. Inside the directory, it checks for a flat file with no extension called urls. - feran_a ----- Original Message ----- From: "Kai_testing Middleton" <[EMAIL PROTECTED]> To: "nutch user" <[email protected]> Sent: Friday, July 27, 2007 2:56 AM Subject: cygwin - Input path doesnt exist I've freshly installed a nutch nightly build onto my laptop using an up-to-date cygwin. Basically I just downloaded the .tar.gz, ran ant, and verified that $NUTCH_HOME/bin/nutch works (gives me the help screen). I set up nutch-site.xml, urls.txt and attempted to crawl. However, I get an exception in org.apache.hadoop.mapred.InvalidInputException. The hadoop.log doesn't report the error, just the command line crawl command. Anyone seen this before? $ nutch crawl /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt -dir /cygdrive/c/nutch-2007-07-26_04-01-20/content /sf911truth -depth 3 -topN 200 crawl started in: /cygdrive/c/nutch-2007-07-26_04-01-20/content/sf911truth rootUrlDir = /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt threads = 10 depth = 3 topN = 200 Injector: starting Injector: crawlDb: /cygdrive/c/nutch-2007-07-26_04-01-20/content/sf911truth/crawldb Injector: urlDir: /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt Injector: Converting injected urls to crawl db entries. Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /cygdrive/c/nutch-2007-07-26_04-01-20/content/urls.txt at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543) at org.apache.nutch.crawl.Injector.inject(Injector.java:162) at org.apache.nutch.crawl.Crawl.main(Crawl.java:115) ____________________________________________________________________________________ Get the Yahoo! toolbar and be alerted to new email wherever you're surfing. http://new.toolbar.yahoo.com/toolbar/features/mail/index.php ____________________________________________________________________________________ Need a vacation? Get great deals to amazing places on Yahoo! Travel. http://travel.yahoo.com/ ____________________________________________________________________________________ Shape Yahoo! in your own image. Join our Network Research Panel today! http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7
