Hello,
You can merge segments for these two crawls using "nutch mergesegs", in
fact you can simply copy all segment directories to one place. But it
would not be a "full" merge of crawls as right now there is no way to
merge WebDB for these two crawls. You can deduplicate it using "nutch
dedup" ("nutch mergesegs" does deduplication for you too).
But in fact you should probably try a different approach - Intranet
crawling was meant as an easy way to crawl small sites strating every
time from scratch. If you do not want to start from scratch you should
use "Whole web crawling" tutorial limiting it to your site/sites only in
config file.
Regards
Piotr
Benny Lin wrote:
Hi,
I am looking to see if there are ways to merge
different Crawl results,
Let's say, I have two url sets in two different files,
I use following commands,
bin/nutch crawl URLs1.txt -dir test1 -depth 0 >&
test1.log
bin/nutch crawl URLs2.txt -dir test1 -depth 0 >&
test2.log
Then I have two folders test1 and test2.
My question are,
1. Is there a way to merge two sets of above result?
If it is, what's command string?
2. If above two sets have duplicate urls, how to make
merge results unique?
Or may be I can do it in different way if I want to do
accumlation indexing but not need do everything again
and again from beginning.
Can someone help out?
Thanks a lot.
Benny
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general