Invalid indexes are generated {newbie question} Please if you could help. I am trying to get Nutch to work from Java. I wish to crawl a web page and generate Lucene indexes and then use the NutchBean to query them. I located an example in the Nutch distribution and have it working, or so I thought. I am executing org.apache.nutch.crawl.Crawl. The code seems to runs fine, but does not seem to use my URL's. I get the following directory C:\EclipseWorkspaces\Nutch\crawl-20090225143955\crawldb\current\part-000 00. This directory contains the following files. .data.crc, .index.crc, data, and index. All are 1MB. I was doing a very shallow search, but even so... . Excited, I then attempted to open them in Luke and I am not able to. "There is no valid Lucene index in this directory".. The output ends with 2009-02-25 15:08:56,899 WARN crawl.Generator (Generator.java:generate(425)) - Generator: 0 records selected for fetching, exiting ... 2009-02-25 15:10:50,670 INFO crawl.Crawl (Crawl.java:main(144)) - Stopping at depth=0 - no more URLs to fetch. 2009-02-25 15:11:20,280 WARN crawl.Crawl (Crawl.java:main(161)) - No URLs to fetch - check your seed list and URL filters.
My crawlURLFilter.txt file contains: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*apache.org/ This is my filter, right? And I have a directory urlsDir which contains one file holding the string "http://ant.apache.org/" followed by a blank line. This is my seed list, right? I know it is going to my urlsDir file. If I remove the http:// Nutch complains about an unknown protocol. I am running V0.9, I know it is something small.. But I just don't see it.. Thanks in advance. ray