I've been wondering about this problem. When you did the invertlinks and index steps did you do it just on the current/most recent segment or all the segments?
Presumably this is why you tried to do a merge? Alex 2009/8/10 Paul Tomblin <ptomb...@xcski.com>: > After applying the patch I sent earlier, I got it so that it correctly > skips downloading pages that haven't changed. And after doing the > generate/fetch/updatedb loop, and merging the segments with mergeseg, > dumping the segment file seems to show that it still has the old > content as well as the new content. But when I then ran the > invertlinks and index step, the resulting index consists of very small > files compared to the files from the previous crawl, indicating that > it only indexed the stuff that it had newly fetched. I tried the > NutchBean, and sure enough it could only find things I knew were on > the newly loaded pages, and couldn't find things that occur hundreds > of times on the pages that haven't changed. "merge" doesn't seem to > help, since the resulting merged index is still the same size as > before merging.