Hi Maybe this problem is cause by crawl script. the SEGMENT parameter set command is like this:
SEGMENT=`ls -l $CRAWL_PATH/segments/ | sed -e "s/ /\\n/g" | egrep 20[0-9]+ | sort -n | tail -n 1` you can run this command in your terminal: ls -l TestCrawl/segments/ | sed -e "s/ /\\n/g" | egrep 20[0-9]+ | sort -n | tail -n 1 maybe it output like this: drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110 this command is used to generate the segment name as an input parameter for fetcher. but i don't know why you generate this. Maybe the correct SEGMENT parameter is 20130423220110. On Tue, Apr 23, 2013 at 10:17 PM, Maohua Liu <[email protected]> wrote: > Hi, > > These day I follow the Nutch totur: > http://wiki.apache.org/nutch/NutchTutorial, but I always get the error > message as follows: > > MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt > TestCrawl http://localhost:8983/solr/ 2 > Injector: starting at 2013-04-23 22:00:46 > Injector: crawlDb: TestCrawl/crawldb > Injector: urlDir: urls/seed.txt > Injector: Converting injected urls to crawl db entries. > 2013-04-23 22:00:46.562 java[1047:1903] Unable to load realm info from > SCDynamicStore > Injector: total number of urls rejected by filters: 0 > Injector: total number of urls injected after normalization and filtering: > 1 > Injector: Merging injected urls into crawl db. > Injector: finished at 2013-04-23 22:01:01, elapsed: 00:00:14 > 2013年 4月23日 星期二 22时01分01秒 CST : Iteration 1 of 2 > Generating a new segment > 2013-04-23 22:01:01.888 java[1055:1903] Unable to load realm info from > SCDynamicStore > Generator: starting at 2013-04-23 22:01:02 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: false > Generator: normalizing: true > Generator: topN: 50000 > Generator: Partitioning selected urls for politeness. > Generator: segment: TestCrawl/segments/20130423220110 > Generator: finished at 2013-04-23 22:01:17, elapsed: 00:00:15 > Operating on segment : > drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110 > Fetching : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110 > Fetcher: Your 'http.agent.name' value should be listed first in > 'http.robots.agents' property. > Fetcher: starting at 2013-04-23 22:01:18 > Fetcher: segment: > TestCrawl/segments/drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110 > Fetcher Timelimit set for : 1366736478177 > 2013-04-23 22:01:18.308 java[1068:1903] Unable to load realm info from > SCDynamicStore > Fetcher: java.lang.IllegalArgumentException: java.net.URISyntaxException: > Relative path in absolute URI: > drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110 > at org.apache.hadoop.fs.Path.initialize(Path.java:148) > at org.apache.hadoop.fs.Path.<init>(Path.java:126) > at org.apache.hadoop.fs.Path.<init>(Path.java:50) > at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1084) > at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) > at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) > at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) > at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) > at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) > at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) > at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) > at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1023) > at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) > at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105) > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989) > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981) > at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332) > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341) > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110 > at java.net.URI.checkPath(URI.java:1788) > at java.net.URI.<init>(URI.java:734) > at org.apache.hadoop.fs.Path.initialize(Path.java:145) > ... 30 more > > All I did was following the totur as follows: > 1. download nutch bin from: > http://mirror.esocc.com/apache/nutch/1.6/apache-nutch-1.6-bin.zip > 2. unzip and step into the dir: apache-nutch-1.6 > 3. in my home dir i setup JAVA_HOME in .bash_profile like: > JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home > export JAVA_HOME > 4. change the content in conf/nutch-site.xml to follows: > <configuration> > <property> > <name>http.agent.name</name> > <value>NutchSpider</value> > </property> > </configuration> > > 5. under dir: apache-nutch-1.6, excute: > mkdir -p urls > cd urls > touch seed.txt > 6. edit seed.txt with content: > > http://nutch.apache.org/ > > 7. then edit file conf/regex-urlfilter.txt and replace > > > # accept anything else > +. > > with > > > +^http://([a-z0-9]*\.)*nutch.apache.org/ > > 8. finally, i run comand under dir :apache-nutch-1.6 as follows: > MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt > TestCrawl http://localhost:8983/solr/ 2 > > 9. at the end show the error message as mentioned before. > > > please help me to solve this problem, thanks very much. > > my java version: > MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ java -version > java version "1.6.0_43" > Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203) > Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode) > > Max OS X version 10.7.5 > > > > Best Regards. > -------------------------------------- > Maohua Liu > Email: [email protected] > > -- Don't Grow Old, Grow Up... :-)

