Re: [Nutch-general] Multiple Crawl and Merging Methods

Andrzej Bialecki Fri, 23 Jun 2006 07:57:20 -0700

Murat Ali Bayir wrote:
> Andrzej Bialecki wrote:
>
>> Murat Ali Bayir wrote:
>>
>>> Hi, I want to know is there any method for merging outputs of 
>>> multiple crawls. Assume that We have one main
>>> crawler having time period 4T
>>> /MainCrawler/crawldb
>>> /MainCrawler/segments
>>> /MainCrawler/linkdb
>>> . then We have topic-spesific focused crawler having time period T
>>> /FocusedCrawler/crawldb
>>> /FocusedCrawler/segments
>>> /FocusedCrawler/linkdb
>>> I want to know is there any way to merge these two databases.  
>>> Another question is that do I need to merge them for
>>> indexing and querying purposes? Does anyone suggest an architecture 
>>> about this?
>>
>>
>> "mergedb" and "mergelinkdb" serve exactly this purpose. Yes, you need 
>> to merge them if you want to index segments  to form a single index 
>> (and you need the merged linkdb on the searcher if you want to use 
>> anchors.jsp).
>>
> is it possible to do that without stopping main crawl or any other 
> architecture suggestions?


Yes, of course - please remember that all data files in Nutch are 
essentially "write once", so they are never modified - if any tool needs 
to modify them, then new files are created, and then only their name is 
changed. In case of merge tools, the output directory can be specified 
on the command line, and in fact may not be the same as the input 
directory - so you can safely run them in parallel with other jobs.

The way I do it is to keep around the merged-db and merged-linkdb, and 
just prior to indexing I update them with the latest DBs from other 
crawls. Then I get the newest segments from each crawl and index them 
together, and finally I'm merging the resulting index with the total 
merged-index (incrementally merged over all previous cycles). Finally, 
you need to run deduplication, and then deploy the new segments, new 
merged-index, and new merged-linkdb. A bit messy, but it works.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Multiple Crawl and Merging Methods

Reply via email to