I am having a little trouble gettng Nutch running and would appreciate any help: I am using nutch 0.8
I have altered my conf/crawl-urlfilter.txt to read my local server # accept hosts in MY.DOMAIN.NAME +^http://philfedora5:8080/ When it came to the urls file, I was a little uncertain what to create: When creating only a "urls" file I received message it was invalid. I managed to get nutch running by creating "urls" folder with "nutch" file in containg root URL from which to populate the initial fetchlist. contents of "nutch" file. http://philfedora5:8080/tinysite/A.html I then run: bin/nutch crawl urls -dir crawl-tinysite -depth 3 -topN 50 A crawl-tinysite folder is created. I then run: bin/nutch readdb crawl-tinysite/crawldb/ -stats a bit of churning and turning then nothing. returned to prompt. I then run: bin/nutch readdb crawl-tinysite/crawldb/ -dump dumpfile inside the dumpfile folder I find part-000000, the contents are: http://philfedora5:8080/tinysite/A.html Version: 4 Status: 1 (DB_unfetched) Fetch time: Thu Aug 24 14:36:06 CEST 2006 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0 Signature: null Metadata: null There should also be records for B.html, C.html, C-duplicate.html? Also, this looks suspicious? > Status: 1 (DB_unfetched) I have been using these tutorials: http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html http://lucene.apache.org/nutch/tutorial8.html Any help would be appreciated. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
