cybercouf wrote:
If I'm not wrong, segments are used by nutch to store parsed data, and after
update the crawldb, and finally build an index.
But when the crawl is finished, for a next recrawl nutch only need the last
crawldb? so not my old segments.
And for building the new index, it only needs my new indexes and the old
index, not the old segs.
(and it seems for the search engine part segment are used just for "show
page cache copy" ?)
It could be nice space saved to delete the segments, but do my argument is
right?
Well, your argument is actually not correct. crawl db only holds the
information about the crawl status of the url, not the contents. and in
the index, the contents of the url is not stored, just indexed. So, how
would you give summaries without the segments? You can delete the
segments only if you do not need them for cached results, or summaries.