Does not locate my urls or filter problem.

Lukas, Ray Wed, 25 Feb 2009 12:40:31 -0800

Invalid indexes are generated {newbie question}

Please if you could help. I am trying to get Nutch to work from Java. I
wish to crawl a web page and generate Lucene indexes and then use the
NutchBean to query them.  I located an example in the Nutch distribution
and have it working, or so I thought. 
I am executing org.apache.nutch.crawl.Crawl. The code seems to runs
fine, but does not seem to use my URL's.
I get the following directory
C:\EclipseWorkspaces\Nutch\crawl-20090225143955\crawldb\current\part-000
00. This directory contains the following files. 
.data.crc, .index.crc, data, and index. All are 1MB. I was doing a very
shallow search, but even so... . 
Excited, I then attempted to open them in Luke and I am not able to.
"There is no valid Lucene index in this directory".. 
The output ends with 
2009-02-25 15:08:56,899 WARN  crawl.Generator
(Generator.java:generate(425)) - Generator: 0 records selected for
fetching, exiting ...
2009-02-25 15:10:50,670 INFO  crawl.Crawl (Crawl.java:main(144)) -
Stopping at depth=0 - no more URLs to fetch.
2009-02-25 15:11:20,280 WARN  crawl.Crawl (Crawl.java:main(161)) - No
URLs to fetch - check your seed list and URL filters.


My crawlURLFilter.txt file contains: 
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*apache.org/
This is my filter, right?

And I have a directory urlsDir which contains one file holding the
string "http://ant.apache.org/"; followed by a blank line. This is my
seed list, right?

I know it is going to my urlsDir file. If I remove the http://  Nutch
complains about an unknown protocol. 

I am running V0.9, I know it is something small.. But I just don't see
it.. 
Thanks in advance.
ray

Does not locate my urls or filter problem.

Reply via email to