Hi, These day I follow the Nutch totur: http://wiki.apache.org/nutch/NutchTutorial, but I always get the error message as follows:
MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2 Injector: starting at 2013-04-23 22:00:46 Injector: crawlDb: TestCrawl/crawldb Injector: urlDir: urls/seed.txt Injector: Converting injected urls to crawl db entries. 2013-04-23 22:00:46.562 java[1047:1903] Unable to load realm info from SCDynamicStore Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Injector: finished at 2013-04-23 22:01:01, elapsed: 00:00:14 2013年 4月23日 星期二 22时01分01秒 CST : Iteration 1 of 2 Generating a new segment 2013-04-23 22:01:01.888 java[1055:1903] Unable to load realm info from SCDynamicStore Generator: starting at 2013-04-23 22:01:02 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: Partitioning selected urls for politeness. Generator: segment: TestCrawl/segments/20130423220110 Generator: finished at 2013-04-23 22:01:17, elapsed: 00:00:15 Operating on segment : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110 Fetching : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2013-04-23 22:01:18 Fetcher: segment: TestCrawl/segments/drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110 Fetcher Timelimit set for : 1366736478177 2013-04-23 22:01:18.308 java[1068:1903] Unable to load realm info from SCDynamicStore Fetcher: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110 at org.apache.hadoop.fs.Path.initialize(Path.java:148) at org.apache.hadoop.fs.Path.<init>(Path.java:126) at org.apache.hadoop.fs.Path.<init>(Path.java:50) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1084) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1023) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341) Caused by: java.net.URISyntaxException: Relative path in absolute URI: drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110 at java.net.URI.checkPath(URI.java:1788) at java.net.URI.<init>(URI.java:734) at org.apache.hadoop.fs.Path.initialize(Path.java:145) ... 30 more All I did was following the totur as follows: 1. download nutch bin from: http://mirror.esocc.com/apache/nutch/1.6/apache-nutch-1.6-bin.zip 2. unzip and step into the dir: apache-nutch-1.6 3. in my home dir i setup JAVA_HOME in .bash_profile like: JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home export JAVA_HOME 4. change the content in conf/nutch-site.xml to follows: <configuration> <property> <name>http.agent.name</name> <value>NutchSpider</value> </property> </configuration> 5. under dir: apache-nutch-1.6, excute: mkdir -p urls cd urls touch seed.txt 6. edit seed.txt with content: http://nutch.apache.org/ 7. then edit file conf/regex-urlfilter.txt and replace # accept anything else +. with +^http://([a-z0-9]*\.)*nutch.apache.org/ 8. finally, i run comand under dir :apache-nutch-1.6 as follows: MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2 9. at the end show the error message as mentioned before. please help me to solve this problem, thanks very much. my java version: MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ java -version java version "1.6.0_43" Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203) Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode) Max OS X version 10.7.5 Best Regards. -------------------------------------- Maohua Liu Email: [email protected]

