Incremental indexing

Enzo Michelangeli Sun, 10 Jun 2007 18:00:29 -0700

As the size of my data keeps growing, and the indexing time grows even
faster, I'm trying to switch from a "reindex all at every crawl" model to an
incremental indexing one. I intend to keep the segments separate, but I
want to index only the segment fetched during the last cycle, and then merge
indexes and perhaps linkdb. I have a few questions:


1. In an incremental scenario, how do I remove from the indexes references
to segments that have expired??

2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear that
I can call "bin/nutch merge" with only two parameters: the original index
directory as destination, and the directory to be merged in the former:

 $nutch_dir/nutch merge $index_dir $new_indexes

But when I do that, the merged data are left in a subdirectory called$index_dir/merge_output . Shouldn't I instead create a new empty destinationdirectory, do the merge, and then replace the original with the newly mergeddirectory:


 merged_indexes=$crawl_dir/merged_indexes
 rm -rf $merged_indexes # just in case it's already there
 $nutch_dir/nutch merge $merged_indexes $index_dir $new_indexes
 rm -rf $index_dir.old # just in case it's already there
 mv $index_dir $index_dir.old
 mv $merged_indexes $index_dir
 rm -rf $index_dir.old

3. Regarding linkdb, does running "$nutch_dir/nutch invertlinks" on thelatest segment only, and then merging the newly obtained linkdb with thecurrent one with "$nutch_dir/nutch mergelinkdb", make sense rather thanrecreating linkdb afresh from the whole set of segments every time? In otherwords, can invertlinks work incrementally, or does it need to have a view ofall segments in order to work correctly?


Thanks,

Enzo

Incremental indexing

Reply via email to