The crawl script picks up the name of the segment created after the
generate phase of nutch using some shell code:

124 if [ $mode = "local" ]; then 125 SEGMENT=`ls -l $CRAWL_PATH/segments/ |
sed -e "s/ /\\n/g" | egrep 20[0-9]+ | sort -n | tail -n 1` 126 else
127SEGMENT=`hadoop fs -ls $CRAWL_PATH/segments/ | grep segments | sed
-e
"s/\//\\n/g" | egrep 20[0-9]+ | sort -n | tail -n 1` 128 fi
129 130 echo "Operating on segment : $SEGMENT"

For some reason on your setup it gives incorrect output:
Operating on segment : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n2
2:01n20130423220110

You are running in local mode so the line 125 above would apply. Try "ls
-l TestCrawl/segments/" over shell. I think that its giving incompatible
output which causes this. Ideal thing to get is something as given in
example at [0].

[0] : http://www.computerhope.com/unix/uls.htm

Thanks,
Tejas







On Tue, Apr 23, 2013 at 7:17 AM, Maohua Liu <[email protected]> wrote:

> Hi,
>
> These day I follow the Nutch totur:
> http://wiki.apache.org/nutch/NutchTutorial, but I always get the error
> message as follows:
>
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt
> TestCrawl http://localhost:8983/solr/ 2
> Injector: starting at 2013-04-23 22:00:46
> Injector: crawlDb: TestCrawl/crawldb
> Injector: urlDir: urls/seed.txt
> Injector: Converting injected urls to crawl db entries.
> 2013-04-23 22:00:46.562 java[1047:1903] Unable to load realm info from
> SCDynamicStore
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering:
> 1
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2013-04-23 22:01:01, elapsed: 00:00:14
> 2013年 4月23日 星期二 22时01分01秒 CST : Iteration 1 of 2
> Generating a new segment
> 2013-04-23 22:01:01.888 java[1055:1903] Unable to load realm info from
> SCDynamicStore
> Generator: starting at 2013-04-23 22:01:02
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: Partitioning selected urls for politeness.
> Generator: segment: TestCrawl/segments/20130423220110
> Generator: finished at 2013-04-23 22:01:17, elapsed: 00:00:15
> Operating on segment :
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetching : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2013-04-23 22:01:18
> Fetcher: segment:
> TestCrawl/segments/drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetcher Timelimit set for : 1366736478177
> 2013-04-23 22:01:18.308 java[1068:1903] Unable to load realm info from
> SCDynamicStore
> Fetcher: java.lang.IllegalArgumentException: java.net.URISyntaxException:
> Relative path in absolute URI:
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> at org.apache.hadoop.fs.Path.initialize(Path.java:148)
> at org.apache.hadoop.fs.Path.<init>(Path.java:126)
> at org.apache.hadoop.fs.Path.<init>(Path.java:50)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1084)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1023)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
> at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
> at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
> at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
> Caused by: java.net.URISyntaxException: Relative path in absolute URI:
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> at java.net.URI.checkPath(URI.java:1788)
> at java.net.URI.<init>(URI.java:734)
> at org.apache.hadoop.fs.Path.initialize(Path.java:145)
> ... 30 more
>
> All I did was following the totur as follows:
> 1. download nutch bin from:
> http://mirror.esocc.com/apache/nutch/1.6/apache-nutch-1.6-bin.zip
> 2. unzip and step into the dir: apache-nutch-1.6
> 3. in my home dir i setup JAVA_HOME in .bash_profile like:
> JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
> export JAVA_HOME
> 4. change the content in conf/nutch-site.xml to follows:
> <configuration>
>     <property>
>         <name>http.agent.name</name>
>         <value>NutchSpider</value>
>     </property>
> </configuration>
>
> 5. under dir: apache-nutch-1.6, excute:
> mkdir -p urls
> cd urls
> touch seed.txt
> 6. edit seed.txt with content:
>
> http://nutch.apache.org/
>
> 7. then edit file conf/regex-urlfilter.txt and replace
>
> # accept anything else
> +.
>
> with
>
> +^http://([a-z0-9]*\.)*nutch.apache.org/
>
> 8. finally, i run comand under dir :apache-nutch-1.6 as follows:
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt
> TestCrawl http://localhost:8983/solr/ 2
>
> 9. at the end show the error message as mentioned before.
>
>
> please help me to solve this problem, thanks very much.
>
> my java version:
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ java -version
> java version "1.6.0_43"
> Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203)
> Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)
>
> Max OS X version 10.7.5
>
>
>
> Best Regards.
> --------------------------------------
> Maohua Liu
> Email: [email protected]
>
>

Reply via email to