http://wiki.media-style.com/display/nutchDocu/Home


Roeland Weve wrote:

Hi,

I've installed Nutch 0.7.1 today on Windows XP with Cygwin and tried to follow the tutorial at:
http://lucene.apache.org/nutch/tutorial.html
But this tutorial seems to be written for another version of Nutch. Because, first of all the DmozParser is not available (I could'nt find it in the nutch-0.7.1.jar file, not under 'crawl', 'tools' or somewhere else):
java.lang.NoClassDefFoundError: org/apache/nutch/crawl/DmozParser
java.lang.NoClassDefFoundError: org/apache/nutch/tools/DmozParser
Since I'm not really interested in Dmoz data, I continue with injecting URLs of my own (in the dmoz dir, the file is called 'urls', with on each line an url) in the database. Unfortunately, I got stuck again. I tried to execute:
bin/nutch inject crawl/crawldb dmoz
The error is:
> 060225 212634 parsing file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-default.xml > 060225 212635 parsing file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-site.xml > Usage: WebDBInjector (-local | -ndfs <namenode:port>) <db_dir> (-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset <subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] [-topicFile <topic list file>] [-topic <topic> [-topic <topic> [...]]]

So I tried to adjust the parameters, with something like:
> bin/nutch inject crawl/crawldb -urlfile dmoz/urls
But this leads to an exception:
Exception in thread "main" java.io.FileNotFoundException: crawl\crawldb\webdb\pagesByURL\data

There are some files in the crawldb dir, but not the webdb dir. Is there a possibility to create an empty or default database? Or do I need Nutch 0.8? If yes, where can I download it? Hopefully, this can this be done with Nutch 0.7.1, because I'm not a hero with compiling stuff on Cygwin

The only thing I want is to inject URLs that can be found in a plain text file, with on each row a URL. The next step is the crawl those URLs. The URLs are all different, so I am not interested in the intranet option of Nitch.

Hopefully someone can help me out with this problem.

Roeland



Reply via email to