Merging CrawlDBs

Kamil Mroczek Wed, 01 Feb 2023 16:55:40 -0800

Hi,

I am testing how merging crawls works and found this script
https://cwiki.apache.org/confluence/display/NUTCH/MergeCrawl.


I was wondering if this script is advisable to use? I plan to use it for
crawls of non-overlapping urls.

I am wary of using it since it is located under "Archive & Legacy" on the
wiki. But after running some tests it seems to function correctly. I only
had to remove the merge command ($nutch_dir/nutch merge $index_dir
$new_indexes)since that is not a command anymore.

I am not necessarily looking for a list of potential issues (if the list is
long), just trying to understand why it might be under the archive.

Kamil

Merging CrawlDBs

Reply via email to