[Nutch-general] Re: Could someone please share your experience with 0.8step-by-step crawl??

monu . ogbe Wed, 19 Apr 2006 09:07:03 -0700

Hello Olive,

Quoting Olive g <[EMAIL PROTECTED]>:

Hi Monu,
Thank you for your help. I double checked and I had plenty of diskspace and /tmp was not filled up either. For my test case, I testedwith only 200 urls.

No problem. Don't let me pretend to be an expert, I'm just on adifferent part

of the same steep learning curve :)

Also, is the string "670052811" in the path right? I did not see anydirectory /user/root/test/crawldb/670052811/ while/user/root/test/crawldb/part-00000/data was there, or it was justsome temp directory used by Nutch, and if that was the case, whywould it fail if I had a lot of free space?


The sequence of generate, fetch, updatedb, invertlinks works for me.  I index
later.

The structure of the segments "tree" looks like this in my case:

segments/20060330035131
segments/20060330035131/content
segments/20060330035131/crawl_fetch
segments/20060330035131/crawl_generate
segments/20060330035131/crawl_parse
segments/20060330035131/parse_data
segments/20060330035131/parse_text

Here, the name of each segment is derived from the date and time, andthis seems

to be the default behaviour of nutch 0.8 with hadoop 0.1

segments/20060330035131/parse_text/part-00000
segments/20060330035131/parse_text/part-00001
segments/20060330035131/parse_text/part-00002
segments/20060330035131/parse_text/part-00003
segments/20060330035131/parse_text/part-00004
segments/20060330035131/parse_text/part-00005
segments/20060330035131/parse_text/part-00006
segments/20060330035131/parse_text/part-00007
segments/20060330035131/parse_text/part-00008
segments/20060330035131/parse_text/part-00009
segments/20060330035131/parse_text/part-00010
segments/20060330035131/parse_text/part-00011
segments/20060330035131/parse_text/part-00012

As you see above, I haven't had a problem with the number of "parts".  Indeed,
here again the above was created with the default behaviour such as:

# bin/nutch generate crawl/db segments -topN 1250000

and

# bin/fetch segments/20060330035131

I don't know where this error comes from and maybe someone else can shed some
light on it.

java.rmi.RemoteException: java.io.IOException: Cannot create file/user/root/test/crawldb/670052811/part-00000/data on clientDFSClient_-1133147307 at
How many reduce and map tasks did you use? I have been strugglingwith this issue for a while and it seems to be that Nutch can'thandle more than 5 parts.


I am using a cluster of 1 x jobtracker and 6 x tasktrackers. Each has a single
Xeon 3Ghz processor, 2Gig RAM, Gigabit ethernet (over copper) and twin 400Gig
WD4000KD disks LVM'ed together.

In this configuration I've had the best performance using:

mapred.map.tasks - 61 (because the book says approx 10 x tasktrackers)

mapred.reduce.tasks - 6 (because it seems to work faster than 2 xtasktrackers)mapred.tasktracker.tasks.maximum - 1 (because that's how manyprocessors I have)


BTW, I got the last two figures from a conversation between YOU and Doug! :)

Good luck,

Monu

Because of this, I am not able to run incrementail crawling. Pleasesee my previous message:
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04150.html

Anybody has any insight?
Thanks!

Olive
From: [EMAIL PROTECTED]
Reply-To: [email protected]
To: [email protected], Olive g <[EMAIL PROTECTED]>
Subject: Re: Could someone please share your experience with0.8step-by-step crawl??
Date: Tue, 18 Apr 2006 16:36:24 +0100

Hello Olive,
IIRC I got a similar message when the /tmp partition on my disksfilled up. Ithen reconfigured the locations of all the directories inhadoop-site.xml to a
more spacious area of my disk.

Hope that helps; see below:

<property>
 <name>dfs.name.dir</name>
 <value>/home/nutch/hadoop/dfs/name</value>
</property>

<property>
 <name>dfs.data.dir</name>
 <value>/home/nutch/hadoop/dfs/data</value>
</property>

<property>
 <name>mapred.local.dir</name>
 <value>/home/nutch/hadoop/mapred/local</value>
</property>

<property>
 <name>mapred.system.dir</name>
 <value>/home/nutch/hadoop/mapred/system</value>
</property>

<property>
 <name>mapred.temp.dir</name>
 <value>/home/nutch/hadoop/mapred/temp</value>
</property>


Quoting Olive g <[EMAIL PROTECTED]>:
Hi,

Are you guys able to run step-by-step crawl on 0.8 successfully?
I am using Nutch 0.8 (3/31 build) and using DFS. I followed the 0.8tutorial for step-by-step crawling and got errors for updatadb. Iused two reduce tasks and two map tasks. Here are the exact stepsthat I did:
1. bin/nutch inject test/crawldb urls
2. bin/nutch generate test/crawldb test/segments
3. bin/nutch fetch test/segments/20060415143555
4. bin/nutch updatedb test/crawldb test/segments/20060415143555

Fetch one more round:
5. bin/nutch generate test/crawldb test/segments -topN 100
6. bin/nutch fetch test/segments/20060415150130
7. bin/nutch updatedb test/crawldb test/segments/20060415150130

Fetch one more round:
8. bin/nutch generate test/crawldb test/segments -topN 100
9. bin/nutch fetch test/segments/20060415151309
The the steps above ran successfully and I kept checking thedirectories in DFS
and doing nutch readdb and everything appeared to be fine.

Then:
10. bin/nutch updatedb test/crawldb test/segments/20060415151309
It failed with the following error for the two reduce tasks (thefollowing log was for one
of the two tasks):
java.rmi.RemoteException: java.io.IOException: Cannot create file/user/root/test/crawldb/670052811/part-00000/data on clientDFSClient_-1133147307 atorg.apache.hadoop.dfs.NameNode.create(NameNode.java:137) atsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:615) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:237) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:216) at org.apache.hadoop.ipc.Client.call(Client.java:303) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:587) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:554) at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:99) at org.apache.hadoop.dfs.DistributedFileSystem.createRaw(DistributedFileSystem.java:83) at org.apache.hadoop.fs.FSDataOutputStream$Summer.(FSDataOutputStream.java:39) at org.apache.hadoop.fs.FSDataOutputStream.(FSDataOutputStream.java:128) atorg.apache.hadoo
p.fs.FileSystem.create(FileSystem.java:180) atorg.apache.hadoop.fs.FileSystem.create(FileSystem.java:168) atorg.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:96) atorg.apache.hadoop.io.MapFile$Writer.(MapFile.java:101) atorg.apache.hadoop.io.MapFile$Writer.(MapFile.java:76) atorg.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:38) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:265)at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709)


Anything wrong with my steps? Is this a known bug?

Thank you for your help.

Olive

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today -it's FREE!http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it'sFREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/






-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Could someone please share your experience with 0.8step-by-step crawl??

Reply via email to