Hi Guys I would really appreciate if you can help me to get answer to my question below ?
Beside, I understand that we have the following working directory: crawl/crawldb crawl/linkdb crawl/segments crawl/index crawl/indexes Do we really need crawl/index ? it was used to merge per-segment index but as far as i understand it's not the case anymore, isn't it ? ---------- Forwarded message ---------- From: Emmanuel JOKE <[EMAIL PROTECTED]> Date: 5 juil. 2007 22:56 Subject: Merge Question To: nutch-user <[email protected]> I just had a look at the script to merge 2 differents crawl, and i'm confused for some step. It says: ... $nutch_dir/nutch mergelinkdb $linkdb_dir $crawl_1/linkdb $crawl_2/linkdb ==> So far its ok it merged both linkdb in a new linkdb $nutch_dir/nutch mergedb $webdb_dir $crawl_1/crawldb $crawl_2/crawldb ==> So far its still ok it merged both crawldb in a new crawldb $nutch_dir/nutch mergesegs $segments_dir $segments_1 $segments_2 ==> still ok it merged all segments from both crawl in a new segment $nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir ==> It start to be confusing, why do we have to use invertlinks as we just merge the linkdb above in the first step ?? $nutch_dir/nutch index $new_indexes $webdb_dir $linkdb_dir $segment ==> So I guess we recreate a new index based on the single segment merged $nutch_dir/nutch dedup $new_indexes ==> Still ok... it eliminates all dupliacte $nutch_dir/nutch merge $index_dir $new_indexes ==> Its again confusing, we just create a new index above based on all segments merged, so why do we have to merge this index ??? Could you please help me to understand ? Thanks
