Hi,
I've installed Nutch 0.7.1 today on Windows XP with Cygwin and tried to
follow the tutorial at:
http://lucene.apache.org/nutch/tutorial.html
But this tutorial seems to be written for another version of Nutch.
Because, first of all the DmozParser is not available (I could'nt find
it in the nutch-0.7.1.jar file, not under 'crawl', 'tools' or somewhere
else):
java.lang.NoClassDefFoundError: org/apache/nutch/crawl/DmozParser
java.lang.NoClassDefFoundError: org/apache/nutch/tools/DmozParser
Since I'm not really interested in Dmoz data, I continue with injecting
URLs of my own (in the dmoz dir, the file is called 'urls', with on
each line an url) in the database. Unfortunately, I got stuck again. I
tried to execute:
bin/nutch inject crawl/crawldb dmoz
The error is:
> 060225 212634 parsing
file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-default.xml
> 060225 212635 parsing
file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-site.xml
> Usage: WebDBInjector (-local | -ndfs <namenode:port>) <db_dir>
(-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset
<subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc]
[-topicFile <topic list file>] [-topic <topic> [-topic <topic> [...]]]
So I tried to adjust the parameters, with something like:
> bin/nutch inject crawl/crawldb -urlfile dmoz/urls
But this leads to an exception:
Exception in thread "main" java.io.FileNotFoundException:
crawl\crawldb\webdb\pagesByURL\data
There are some files in the crawldb dir, but not the webdb dir. Is there
a possibility to create an empty or default database? Or do I need Nutch
0.8? If yes, where can I download it?
Hopefully, this can this be done with Nutch 0.7.1, because I'm not a
hero with compiling stuff on Cygwin
The only thing I want is to inject URLs that can be found in a plain
text file, with on each row a URL. The next step is the crawl those
URLs. The URLs are all different, so I am not interested in the intranet
option of Nitch.
Hopefully someone can help me out with this problem.
Roeland