I should add that what I really want to do is toss all previous crawl information and reindex everything every night. It's just a few servers and very low impact. My crawl on 1.0 takes about 10 minutes.
On Thu, Apr 22, 2010 at 4:59 AM, Phil Barnett <ph...@philb.us> wrote: > I'm having a problem where shouldfetch is rejecting everything. I have > deleted the crawl directory and started the entire crawl from scratch by > > rm -rf crawl > mkdir crawl > mkdir segments > > I'm absolutely baffled by how this scheduler works. > > Is there documentation? > > Is the fetchtime saved somewhere other than the crawl database? > > I have tried lowering > > db.default.fetch.interval to 0 > db.fetch.interval.default to many lower values > db.fetch.interval.max to different levels. > > With those changed, it crawls the top page over and over again. I make them > a little larger and it rejects the top page. > > I'd really like to see how this tika parser works, but I can't get any web > pages into the crawl database. > > The last thing I tried was to remove the entire /opt/nutch-1.1 directory > and start from scratch. It made no difference. > > Is this a bug or am I doing something stupid? >