Hi All,
I have been using Nutch 0.7.1 for some time (although I am certainly not an
expert) and am now in the process of switching over to Nutch 0.8. However, I
have ran into a couple of problems along the way and am hoping that those of
you who have been using nutch 0.8 for a while will take a quick look at what
I have done and see if you can figure why I am running into these problems.
Thanks ahead of time for any help you can offer!!
__________________________
The two problems I am having are essentially as follows (more detail
provided below):
1. So far I have been able to run a testcrawl using "bin/nutch crawl", but
when I go to my nutch searchpage (:8080) and try a search, I always get zero
results returned, even though I am able to open the index using Luke and
verify that there are approximately 200 documents and approximately 40,000
search terms in my index and there are no errors in the Tomcat logs.
2. I am unable to get through the whole-web crawl in the nutch-0.8 tutorial.
Specifically, I get stuck on the "bin/nutch invertlinks" step, where I get
the message:
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:203)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:305)
_______________________________
** Details **
These are the steps I took to install nutch 0.8.
1. Downloaded Nutch 0.8 (dev)
I was previously using the release copy of nutch 0.7.1, so this was the
first time I had to build a release of nutch using ant. I downloaded ant and
then installed the current trunk of nutch 0.8 (thinking it would be more
stable than the nightly build). To do this I did the following from my home
directory:
$ svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk
$ mv trunk nutch-8d
$ export ANT_HOME=/usr/local/ant/apache-ant-1.6.5
$ export PATH=${PATH}:${ANT_HOME}/bin
$ cd nutch-8d
$ ant
2. Compiled Nutch 0.8 war file and then replaced ROOT Tomcat directory
I then did the following from my nutch-8d directory:
$ ant war
$ mv /usr/local/jakarta-tomcat-4.1.31/webapps/ROOT /usr/local/jakarta-
tomcat-4.1.31/webapps/ROOT_nutch-0.7/
$ cp build/nutch-0.8-dev.war /usr/local/jakarta-tomcat-4.1.31
/webapps/ROOT.war
3. Tried first Nutch 0.8 crawl using the CrawlTool
I first created an urls file at ../nutch-8d/test/urls and then set the
crawl-urlfilter.txt file to allow essentially all URLs.
I then did a round of fetching using the following call:
$ bin/nutch crawl test -dir crawl3 -depth 2 -topN 50
It seemed like everything worked correctly (although unlike nutch 0.7.1, no
ouput was generated)
I then did the following:
$cd crawl3
$ /usr/local/jakarta-tomcat-4.1.31/bin/catalina.sh stop
Using CATALINA_BASE: /usr/local/jakarta-tomcat-4.1.31
Using CATALINA_HOME: /usr/local/jakarta-tomcat-4.1.31
Using CATALINA_TMPDIR: /usr/local/jakarta-tomcat-4.1.31/temp
Using JAVA_HOME: /usr/local/j2sdk1.4.2_08
$ /usr/local/jakarta-tomcat-4.1.31/bin/catalina.sh start
Using CATALINA_BASE: /usr/local/jakarta-tomcat-4.1.31
Using CATALINA_HOME: /usr/local/jakarta-tomcat-4.1.31
Using CATALINA_TMPDIR: /usr/local/jakarta-tomcat-4.1.31/temp
Using JAVA_HOME: /usr/local/j2sdk1.4.2_08
Everything seemed to be working correctly, but when I went to my nutch
search page (i.e. :8080), no matter what search term I enter, I get zero
results returned.
I then did the following to troubleshoot the situation:
1. Reviewed the tomcat logs (no error messages of any sort).
2. Looked at the following segments stats:
$ bin/nutch segread -list -dir crawl3/segments
NAME GENERATED FETCHER START FETCHER
END FETCHED PARSED
20060613200213 3 2006-06-13T20:02:20
2006-06-13T20:02:22 3 3
20060613200226 214 2006-06-13T20:02:32
2006-06-13T20:04:48 217 181
3. Opened the index I am trying to search using Luke, which allowed me to
verify that there are approximately 200 documents and approximately 40,000
seach terms in my index (including search terms that were returning zero
results when I was searching for them).
I HAVE NO IDEA WHY ZERO SEARCH RESULTS ARE ALWAYS BEING RETURNED -- PLEASE
HELP.
4. Trying a Whole-Web Crawl
After I couldn't figure out why I was always getting zero search results, I
tried to follow the instructions for a whole-web crawl, just for the hell of
it. Things seemed to be going fine, until I got to the invertlinks steps, at
which point I always get an error message. Below are the command calls that
I made (and the error message). Please let me know what I am doing wrong:
I first made sure that the test/urls file and regex-urlfilter.txt files had
valid entries, which they do.
-bash-2.05b$ bin/nutch inject testcrawl/crawldb test
-bash-2.05b$ bin/nutch generate testcrawl/crawldb testcrawl/segments
-bash-2.05b$ s1=`ls -d testcrawl/segments/2* | tail -1`
-bash-2.05b$ echo $s1
testcrawl/segments/20060615190036
-bash-2.05b$ bin/nutch fetch $s1
-bash-2.05b$ bin/nutch updatedb testcrawl/crawldb $s1
-bash-2.05b$ bin/nutch generate testcrawl/crawldb testcrawl/segments -topN
100
-bash-2.05b$ s2=`ls -d testcrawl/segments/2* | tail -1`
-bash-2.05b$ echo $s2
testcrawl/segments/20060615190956
-bash-2.05b$ bin/nutch fetch $s2
-bash-2.05b$ bin/nutch updatedb testcrawl/crawldb $s2
-bash-2.05b$ bin/nutch invertlinks testcrawl/linkdb testcrawl/segments
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:203)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:305)
Any Suggestions Are Much Appreciated!
Thanks,
Bryan