Error when running Nutch, please help

Maohua Liu Tue, 23 Apr 2013 07:21:29 -0700

Hi,

These day I follow the Nutch totur: http://wiki.apache.org/nutch/NutchTutorial, 
but I always get the error message as follows:


MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt TestCrawl 
http://localhost:8983/solr/ 2
Injector: starting at 2013-04-23 22:00:46
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: urls/seed.txt
Injector: Converting injected urls to crawl db entries.
2013-04-23 22:00:46.562 java[1047:1903] Unable to load realm info from 
SCDynamicStore
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-04-23 22:01:01, elapsed: 00:00:14
2013年 4月23日 星期二 22时01分01秒 CST : Iteration 1 of 2
Generating a new segment
2013-04-23 22:01:01.888 java[1055:1903] Unable to load realm info from 
SCDynamicStore
Generator: starting at 2013-04-23 22:01:02
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: Partitioning selected urls for politeness.
Generator: segment: TestCrawl/segments/20130423220110
Generator: finished at 2013-04-23 22:01:17, elapsed: 00:00:15
Operating on segment : 
drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
Fetching : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
Fetcher: Your 'http.agent.name' value should be listed first in 
'http.robots.agents' property.
Fetcher: starting at 2013-04-23 22:01:18
Fetcher: segment: 
TestCrawl/segments/drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
Fetcher Timelimit set for : 1366736478177
2013-04-23 22:01:18.308 java[1068:1903] Unable to load realm info from 
SCDynamicStore
Fetcher: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Relative path in absolute URI: 
drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
        at org.apache.hadoop.fs.Path.initialize(Path.java:148)
        at org.apache.hadoop.fs.Path.<init>(Path.java:126)
        at org.apache.hadoop.fs.Path.<init>(Path.java:50)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1084)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
        at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
        at 
org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1023)
        at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
        at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
        at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
        at 
org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
        at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
        at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
        at java.net.URI.checkPath(URI.java:1788)
        at java.net.URI.<init>(URI.java:734)
        at org.apache.hadoop.fs.Path.initialize(Path.java:145)
        ... 30 more

All I did was following the totur as follows:
1. download nutch bin from: 
http://mirror.esocc.com/apache/nutch/1.6/apache-nutch-1.6-bin.zip
2. unzip and step into the dir: apache-nutch-1.6
3. in my home dir i setup JAVA_HOME in .bash_profile like:
JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
export JAVA_HOME
4. change the content in conf/nutch-site.xml to follows:
<configuration>
    <property>
        <name>http.agent.name</name>
        <value>NutchSpider</value>
    </property>
</configuration>

5. under dir: apache-nutch-1.6, excute:
mkdir -p urls
cd urls
touch seed.txt
6. edit seed.txt with content:
http://nutch.apache.org/
7. then edit file conf/regex-urlfilter.txt and replace
# accept anything else
+.
with
+^http://([a-z0-9]*\.)*nutch.apache.org/
8. finally, i run comand under dir :apache-nutch-1.6 as follows:
MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt TestCrawl 
http://localhost:8983/solr/ 2

9. at the end show the error message as mentioned before.


please help me to solve this problem, thanks very much.

my java version:
MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ java -version
java version "1.6.0_43"
Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)

Max OS X version 10.7.5



Best Regards.
--------------------------------------
Maohua Liu
Email: [email protected]

Error when running Nutch, please help

Reply via email to