I've been wondering about this problem. When you did the invertlinks
and index steps did you do it just on the current/most recent segment
or all the segments?

Presumably this is why you tried to do a merge?

Alex

2009/8/10 Paul Tomblin <ptomb...@xcski.com>:
> After applying the patch I sent earlier, I got it so that it correctly
> skips downloading pages that haven't changed.  And after doing the
> generate/fetch/updatedb loop, and merging the segments with mergeseg,
> dumping the segment file seems to show that it still has the old
> content as well as the new content.  But when I then ran the
> invertlinks and index step, the resulting index consists of very small
> files compared to the files from the previous crawl, indicating that
> it only indexed the stuff that it had newly fetched.  I tried the
> NutchBean, and sure enough it could only find things I knew were on
> the newly loaded pages, and couldn't find things that occur hundreds
> of times on the pages that haven't changed.  "merge" doesn't seem to
> help, since the resulting merged index is still the same size as
> before merging.

Reply via email to