Hello,
You can merge segments for these two crawls using "nutch mergesegs", in fact you can simply copy all segment directories to one place. But it would not be a "full" merge of crawls as right now there is no way to merge WebDB for these two crawls. You can deduplicate it using "nutch dedup" ("nutch mergesegs" does deduplication for you too). But in fact you should probably try a different approach - Intranet crawling was meant as an easy way to crawl small sites strating every time from scratch. If you do not want to start from scratch you should use "Whole web crawling" tutorial limiting it to your site/sites only in config file.
Regards
Piotr

Benny Lin wrote:
Hi,

I am looking to see if there are ways to merge
different Crawl results,

Let's say, I have two url sets in two different files,

I use following commands,

bin/nutch crawl URLs1.txt -dir test1 -depth 0 >&
test1.log

bin/nutch crawl URLs2.txt -dir test1 -depth 0 >&
test2.log


Then I have two folders test1 and test2.

My question are,

1. Is there a way to merge two sets of above result?
If it is, what's command string?

2. If above two sets have duplicate urls, how to make
merge results unique?


Or may be I can do it in different way if I want to do
accumlation indexing but not need do everything again
and again from beginning.

Can someone help out?

Thanks a lot.

Benny





                
____________________________________________________
Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to