As the size of my data keeps growing, and the indexing time grows even
faster, I'm trying to switch from a "reindex all at every crawl" model to an
incremental indexing one. I intend to keep the segments separate, but I
want to index only the segment fetched during the last cycle, and then merge
indexes and perhaps linkdb. I have a few questions:
1. In an incremental scenario, how do I remove from the indexes references
to segments that have expired??
2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear that
I can call "bin/nutch merge" with only two parameters: the original index
directory as destination, and the directory to be merged in the former:
$nutch_dir/nutch merge $index_dir $new_indexes
But when I do that, the merged data are left in a subdirectory called
$index_dir/merge_output . Shouldn't I instead create a new empty destination
directory, do the merge, and then replace the original with the newly merged
directory:
merged_indexes=$crawl_dir/merged_indexes
rm -rf $merged_indexes # just in case it's already there
$nutch_dir/nutch merge $merged_indexes $index_dir $new_indexes
rm -rf $index_dir.old # just in case it's already there
mv $index_dir $index_dir.old
mv $merged_indexes $index_dir
rm -rf $index_dir.old
3. Regarding linkdb, does running "$nutch_dir/nutch invertlinks" on the
latest segment only, and then merging the newly obtained linkdb with the
current one with "$nutch_dir/nutch mergelinkdb", make sense rather than
recreating linkdb afresh from the whole set of segments every time? In other
words, can invertlinks work incrementally, or does it need to have a view of
all segments in order to work correctly?
Thanks,
Enzo