I just had a look at the script to merge 2 differents crawl, and i'm confused for some step. It says:
... $nutch_dir/nutch mergelinkdb $linkdb_dir $crawl_1/linkdb $crawl_2/linkdb ==> So far its ok it merged both linkdb in a new linkdb $nutch_dir/nutch mergedb $webdb_dir $crawl_1/crawldb $crawl_2/crawldb ==> So far its still ok it merged both crawldb in a new crawldb $nutch_dir/nutch mergesegs $segments_dir $segments_1 $segments_2 ==> still ok it merged all segments from both crawl in a new segment $nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir ==> It start to be confusing, why do we have to use invertlinks as we just merge the linkdb above in the first step ?? $nutch_dir/nutch index $new_indexes $webdb_dir $linkdb_dir $segment ==> So I guess we recreate a new index based on the single segment merged $nutch_dir/nutch dedup $new_indexes ==> Still ok... it eliminates all dupliacte $nutch_dir/nutch merge $index_dir $new_indexes ==> Its again confusing, we just create a new index above based on all segments merged, so why do we have to merge this index ??? Could you please help me to understand ? Thanks
