Hello,
I can not index website with Nutch and Hadoop.
I spend 7 days to try that nutch 0.8.1 work but with no success.
Using Nutch-0.8.1 with Hadoop is very hard.
I use :
* jdk1.5.0_10
* Nutch-0.8.1
* Hadoop-0.9.2 or 0.9.3-dev (file hadoop-2006-12-27.tar.gz)
And I made configuration with http://wiki.apache.org/nutch/NutchHadoopTutorial
I have ONLY one server.
I start Hadoop-0.9.2 (start-all.sh) with no errors in the logs.
The directories for crawl and url are created.
I use dfs put command line to put them in NDFS File System.
[nutch-0.8.1]$ /opt/hadoop-0.9.2/bin/hadoop dfs -ls
06/12/28 12:57:02 INFO ipc.Client:
org.apache.hadoop.io.ObjectWritableConnection culler maxidletime= 1000ms
06/12/28 12:57:02 INFO ipc.Client: org.apache.hadoop.io.ObjectWritable
Connection Culler: starting
Found 2 items
/user/webadm/crawls <dir>
/user/webadm/urls <dir>
And when I crawl with nutch 0.8.1, I have this message error:
[nutch-0.8.1]$ bin/nutch crawl /opt/nutch-0.8.1/urls/url-mywebsite -dir
/opt/nutch-0.8.1/crawls/crawl-mywebsite -depth 5
crawl started in: /opt/nutch-0.8.1/crawls/crawl-mywebsite
rootUrlDir = /opt/nutch-0.8.1/urls/url-mywebsite
threads = 10
depth = 5
Injector: starting
Injector: crawlDb: /opt/nutch-0.8.1/crawls/crawl-mywebsite/crawldb
Injector: urlDir: /opt/nutch-0.8.1/urls/url-mywebsite
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: failure closing block of file
/opt/nutch-0.8.1/HDFS/mapred/system/submit_svdjjy/.job.jar.crc to node
192.168.1.2:50010
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.internalClose(DFSClient.java:1063)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:1027)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1105)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at
org.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:96)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:154)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:74)
at
org.apache.hadoop.dfs.DistributedFileSystem.copyFromLocalFile(DistributedFileSystem.java:186)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:254)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.readFully(DataInputStream.java:176)
at java.io.DataInputStream.readLong(DataInputStream.java:380)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.internalClose(DFSClient.java:1057)
*******************************************************************
In the nutch-0.8.1/logs/hadoop.log, I have this error
2006-12-28 12:57:10,230 INFO crawl.Crawl - crawl started in:
/opt/nutch-0.8.1/crawls/crawl-mywebsite
2006-12-28 12:57:10,232 INFO crawl.Crawl - rootUrlDir =
/opt/nutch-0.8.1/urls/url-mywebsite
2006-12-28 12:57:10,232 INFO crawl.Crawl - threads = 10
2006-12-28 12:57:10,232 INFO crawl.Crawl - depth = 5
2006-12-28 12:57:10,236 INFO crawl.Injector - Injector: starting
2006-12-28 12:57:10,237 INFO crawl.Injector - Injector: crawlDb:
/opt/nutch-0.8.1/crawls/crawl-mywebsite/crawldb
2006-12-28 12:57:10,237 INFO crawl.Injector - Injector: urlDir:
/opt/nutch-0.8.1/urls/url-mywebsite
2006-12-28 12:57:10,237 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2006-12-28 12:58:40,708 WARN fs.DFSClient - Problem renewing lease for
DFSClient_1418597997: java.net.SocketTimeoutException: timed out waiting for
rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:312)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:150)
at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:437)
at java.lang.Thread.run(Thread.java:595)
2006-12-28 13:00:10,638 WARN fs.DFSClient - Error while writing.
java.io.IOException: failure closing block of file
/opt/nutch-0.8.1/HDFS/mapred/system/submit_svdjjy/.job.jar.crc to node
192.168.1.2:50010
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.internalClose(DFSClient.java:1063)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:1027)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1105)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at
org.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:96)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:154)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:74)
at
org.apache.hadoop.dfs.DistributedFileSystem.copyFromLocalFile(DistributedFileSystem.java:186)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:254)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.readFully(DataInputStream.java:176)
at java.io.DataInputStream.readLong(DataInputStream.java:380)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.internalClose(DFSClient.java:1057)
... 16 more
****************************************
In the hadoop-0.10.0/logs/hadoop-webadm-datanode-192.168.1.2.log, I have this
error:
2006-12-28 13:06:10,677 WARN org.apache.hadoop.dfs.DataNode: DataXCeiver
java.io.EOFException: EOF reading from
Socket[addr=/192.168.1.2,port=37623,localport=50010]
at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:793)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:563)
at java.lang.Thread.run(Thread.java:595)
****************************************
In my /opt/nutch-0.8.1/conf/hadoop-env.sh and
/opt/hadoop-0.9.2/conf/hadoop-env.sh,
I Have:
export HADOOP_HOME=/opt/hadoop-0.9.2
export JAVA_HOME=/logiciels/java/jdk1.5.0_10
export HADOOP_HEAPSIZE=2000
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
export HADOOP_PID_DIR=/opt/hadoop-0.9.2/pids
****************************************
In my /opt/nutch-0.8.1/conf/hadoop-site.xml and
/opt/hadoop-0.9.2/conf/hadoop-site.xml,
I Have these values with xml tags of course:
hadoop.tmp.dir = /opt/nutch-0.8.1/HDFS
fs.default.name = 192.168.1.2:9000
dfs.name.dir = /opt/nutch-0.8.1/HDFS/dfs/name
dfs.client.buffer.dir = /opt/nutch-0.8.1/HDFS/dfs/tmp
dfs.data.dir = /opt/nutch-0.8.1/HDFS/dfs/data
dfs.replication = 1
mapred.job.tracker = 192.168.1.2:9001
mapred.local.dir = /opt/nutch-0.8.1/HDFS/mapred/local
mapred.system.dir = /opt/nutch-0.8.1/HDFS/mapred/system
mapred.temp.dir = /opt/nutch-0.8.1/HDFS/mapred/temp
mapred.map.tasks = 2
mapred.reduce.tasks = 2
In my conf/crawl-urlfilter.txt, I have this line to index:
+^http://www.mywebsite.com
In my urls/url-mywebsite, I have this line to index:
http://www.mywebsite.com/index.htm
In my conf/nutch-site.xml, I have this line:
searcher.dir = crawls/crawl-mywebsite
Thanks in advance.
Yannick LE NY
________________________________________________________________________
iFRANCE, exprimez-vous !
http://web.ifrance.com
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general