As the size of my data keeps growing, and the indexing time grows even
faster, I'm trying to switch from a "reindex all at every crawl" model to an
incremental indexing one. I intend to keep the segments separate, but I
want to index only the segment fetched during the last cycle, and then merge
indexes and perhaps linkdb. I have a few questions:

1. In an incremental scenario, how do I remove from the indexes references
to segments that have expired??

2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear that
I can call "bin/nutch merge" with only two parameters: the original index
directory as destination, and the directory to be merged in the former:

 $nutch_dir/nutch merge $index_dir $new_indexes

But when I do that, the merged data are left in a subdirectory called $index_dir/merge_output . Shouldn't I instead create a new empty destination directory, do the merge, and then replace the original with the newly merged directory:

 merged_indexes=$crawl_dir/merged_indexes
 rm -rf $merged_indexes # just in case it's already there
 $nutch_dir/nutch merge $merged_indexes $index_dir $new_indexes
 rm -rf $index_dir.old # just in case it's already there
 mv $index_dir $index_dir.old
 mv $merged_indexes $index_dir
 rm -rf $index_dir.old

3. Regarding linkdb, does running "$nutch_dir/nutch invertlinks" on the latest segment only, and then merging the newly obtained linkdb with the current one with "$nutch_dir/nutch mergelinkdb", make sense rather than recreating linkdb afresh from the whole set of segments every time? In other words, can invertlinks work incrementally, or does it need to have a view of all segments in order to work correctly?

Thanks,

Enzo


Reply via email to