I'm having a problem where shouldfetch is rejecting everything. I have deleted the crawl directory and started the entire crawl from scratch by
rm -rf crawl mkdir crawl mkdir segments I'm absolutely baffled by how this scheduler works. Is there documentation? Is the fetchtime saved somewhere other than the crawl database? I have tried lowering db.default.fetch.interval to 0 db.fetch.interval.default to many lower values db.fetch.interval.max to different levels. With those changed, it crawls the top page over and over again. I make them a little larger and it rejects the top page. I'd really like to see how this tika parser works, but I can't get any web pages into the crawl database. The last thing I tried was to remove the entire /opt/nutch-1.1 directory and start from scratch. It made no difference. Is this a bug or am I doing something stupid?