nutch - start me up - help please

Philip Brown Thu, 24 Aug 2006 06:45:02 -0700

I am having a little trouble gettng Nutch running and would appreciateany help:

I am using nutch 0.8

I have altered my conf/crawl-urlfilter.txt to read my local server


# accept hosts in MY.DOMAIN.NAME
+^http://philfedora5:8080/

When it came to the urls file, I was a little uncertain what to create:

When creating only a "urls" file I received message it was invalid. Imanaged to get nutch running by creating"urls" folder with "nutch" file in containg root URL from which topopulate the initial fetchlist.


contents of "nutch" file.
http://philfedora5:8080/tinysite/A.html

I then run:
bin/nutch crawl urls -dir crawl-tinysite -depth 3  -topN 50

A crawl-tinysite folder is created.

I then run:
bin/nutch readdb crawl-tinysite/crawldb/ -stats
a bit of churning and turning then nothing. returned to prompt.

I then run:
bin/nutch readdb crawl-tinysite/crawldb/ -dump dumpfile

inside the dumpfile folder I find part-000000, the contents are:

http://philfedora5:8080/tinysite/A.html    Version: 4
Status: 1 (DB_unfetched)
Fetch time: Thu Aug 24 14:36:06 CEST 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: null

There should also be records for B.html, C.html, C-duplicate.html?
Also, this looks suspicious? > Status: 1 (DB_unfetched)

I have been using these tutorials:
http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
http://lucene.apache.org/nutch/tutorial8.html
Any help would be appreciated.

nutch - start me up - help please

Reply via email to