I am having trouble getting Nutch to work using the DFS. I pulled Nutch 0.8
from SVN and build it just
fine using eclipse. I was able to set it up on a Whitebox Enterprise Linux
3 Respin 2 box (800 Mghz,
512M ram) and do a crawl using the local file-system. I was able to setup
the was inside of tomcat
and search the local index.
I then tried to switch to using the DFS. I was running everything as a
nutch user. I have a password-less
login to the local machine. I am using the options below in my
hadoop-site.xml file. When I run start-all.sh
I get some weird output but doing a ps -ef | grep java shows 2 java threads
running. Then when I try to do
a crawl it errors out.
Anybody got any ideas.
Dennis
hadoop-site.xml
----------------------------------------------------------------------------
------------------
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>localhost:9000</value>
<description>
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
<description>
The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and
reduce task.
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>2</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/nutch/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/nutch/filesystem/data</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/nutch/filesystem/mapreduce</value>
</property>
</configuration>
log of startup
----------------------------------------------------------------------------
------------------
localhost:9000: command-line: line 0: Bad configuration option:
ConnectTimeout
devcluster02:9000: command-line: line 0: Bad configuration option:
ConnectTimeout
starting namenode, logging to
/nutch/search/bin/../logs/hadoop-nutch-namenode-devcluster01.visvo.com.log
: command not foundadoop: line 2:
: command not foundadoop: line 7:
: command not foundadoop: line 10:
: command not foundadoop: line 13:
: command not foundadoop: line 16:
: command not foundadoop: line 19:
: command not foundadoop: line 22:
: command not foundadoop: line 25:
: command not foundadoop: line 28:
: command not foundadoop: line 31:
starting jobtracker, logging to
/nutch/search/bin/../logs/hadoop-nutch-jobtracke
r-devcluster01.visvo.com.log
: command not foundadoop: line 2:
: command not foundadoop: line 7:
: command not foundadoop: line 10:
: command not foundadoop: line 13:
: command not foundadoop: line 16:
: command not foundadoop: line 19:
: command not foundadoop: line 22:
: command not foundadoop: line 25:
: command not foundadoop: line 28:
: command not foundadoop: line 31:
localhost:9000: command-line: line 0: Bad configuration option:
ConnectTimeout
devcluster02:9000: command-line: line 0: Bad configuration option:
ConnectTimeout
ps -ef | grep java
----------------------------------------------------------------------------
------------------
[EMAIL PROTECTED] search]$ ps -ef | grep java
nutch 9907 1 2 17:26 pts/0 00:00:02
/usr/java/jdk1.5.0_06/bin/java -Xmx1000m -classpath
/nutch/search/conf:/usr/java/jdk1.5.0_06/lib/tools.jar:/nutch/search:/nutch/
search/hadoop-*.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib
/commons-logging-api-1.0.4.jar:/nutch/search/lib/concurrent-1.3.4.ja
nutch 9945 1 8 17:27 pts/0 00:00:07
/usr/java/jdk1.5.0_06/bin/java -Xmx1000m -classpath
/nutch/search/conf:/usr/java/jdk1.5.0_06/lib/tools.jar:/nutch/search:/nutch/
search/hadoop-*.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib
/commons-logging-api-1.0.4.jar:/nutch/search/lib/concurrent-1.3.4.ja
nutch 10028 9771 0 17:28 pts/0 00:00:00 grep java
Errors when running crawl
----------------------------------------------------------------------------
------------------
[EMAIL PROTECTED] search]$ bin/nutch crawl urls -depth 3 -topN 50
060316 173158 parsing
jar:file:/nutch/search/lib/hadoop-0.1-dev.jar!/hadoop-default.xml
060316 173158 parsing file:/nutch/search/conf/nutch-default.xml
060316 173159 parsing file:/nutch/search/conf/crawl-tool.xml
060316 173159 parsing
jar:file:/nutch/search/lib/hadoop-0.1-dev.jar!/mapred-default.xml
060316 173159 parsing file:/nutch/search/conf/nutch-site.xml
060316 173159 parsing file:/nutch/search/conf/hadoop-site.xml
060316 173159 Client connection to 127.0.0.1:9000: starting
060316 173159 crawl started in: crawl-20060316173159
060316 173159 rootUrlDir = urls
060316 173159 threads = 10
060316 173159 depth = 3
060316 173159 topN = 50
060316 173159 Injector: starting
060316 173159 Injector: crawlDb: crawl-20060316173159/crawldb
060316 173159 Injector: urlDir: urls
060316 173159 Injector: Converting injected urls to crawl db entries.
060316 173159 parsing
jar:file:/nutch/search/lib/hadoop-0.1-dev.jar!/hadoop-default.xml
060316 173159 parsing file:/nutch/search/conf/nutch-default.xml
060316 173159 parsing file:/nutch/search/conf/crawl-tool.xml
060316 173159 parsing
jar:file:/nutch/search/lib/hadoop-0.1-dev.jar!/mapred-default.xml
060316 173159 parsing
jar:file:/nutch/search/lib/hadoop-0.1-dev.jar!/mapred-default.xml
060316 173159 parsing file:/nutch/search/conf/nutch-site.xml
060316 173159 parsing file:/nutch/search/conf/hadoop-site.xml
060316 173200 Client connection to 127.0.0.1:9001: starting
060316 173200 Client connection to 127.0.0.1:9000: starting
060316 173200 parsing
jar:file:/nutch/search/lib/hadoop-0.1-dev.jar!/hadoop-default.xml
060316 173200 parsing file:/nutch/search/conf/hadoop-site.xml
Exception in thread "main" java.io.IOException: Cannot create file
/tmp/hadoop/mapred/system/submit_wdapr7/job.jar on client
DFSClient_1136455260
at org.apache.hadoop.ipc.Client.call(Client.java:301)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
at org.apache.hadoop.dfs.$Proxy0.create(Unknown Source)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSCli
ent.java:587)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:556)
at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:99)
at
org.apache.hadoop.dfs.DistributedFileSystem.createRaw(DistributedFileSystem.
java:71)
at
org.apache.hadoop.fs.FSDataOutputStream$Summer.<init>(FSDataOutputStream.jav
a:39)
at
org.apache.hadoop.fs.FSDataOutputStream.<init>(FSDataOutputStream.java:128)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:180)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:168)
at
org.apache.hadoop.dfs.DistributedFileSystem.doFromLocalFile(DistributedFileS
ystem.java:156)
at
org.apache.hadoop.dfs.DistributedFileSystem.copyFromLocalFile(DistributedFil
eSystem.java:131)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:247)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:294)
at org.apache.nutch.crawl.Injector.inject(Injector.java:114)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:104)