I just had a look at the script to merge 2 differents crawl, and i'm
confused for some step.
It says:

...

$nutch_dir/nutch mergelinkdb $linkdb_dir $crawl_1/linkdb $crawl_2/linkdb
==> So far its ok it merged both linkdb in a new linkdb

$nutch_dir/nutch mergedb $webdb_dir $crawl_1/crawldb $crawl_2/crawldb
==> So far its still ok it merged both crawldb in a new crawldb

$nutch_dir/nutch mergesegs $segments_dir $segments_1 $segments_2
==> still ok it merged all segments from both crawl in a new segment

$nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir
==> It start to be confusing, why do we have to use invertlinks as we
just merge the linkdb above in the first step ??

$nutch_dir/nutch index $new_indexes $webdb_dir $linkdb_dir $segment
==> So I guess we recreate a new index based on the single segment merged

$nutch_dir/nutch dedup $new_indexes
==> Still ok... it eliminates all dupliacte

$nutch_dir/nutch merge $index_dir $new_indexes
==> Its again confusing, we just create a new index above based on all
segments merged,
so why do we have to merge this index ???

Could you please help me to understand ?

Thanks
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to