Murat Ali Bayir wrote: > Andrzej Bialecki wrote: > >> Murat Ali Bayir wrote: >> >>> Hi, I want to know is there any method for merging outputs of >>> multiple crawls. Assume that We have one main >>> crawler having time period 4T >>> /MainCrawler/crawldb >>> /MainCrawler/segments >>> /MainCrawler/linkdb >>> . then We have topic-spesific focused crawler having time period T >>> /FocusedCrawler/crawldb >>> /FocusedCrawler/segments >>> /FocusedCrawler/linkdb >>> I want to know is there any way to merge these two databases. >>> Another question is that do I need to merge them for >>> indexing and querying purposes? Does anyone suggest an architecture >>> about this? >> >> >> "mergedb" and "mergelinkdb" serve exactly this purpose. Yes, you need >> to merge them if you want to index segments to form a single index >> (and you need the merged linkdb on the searcher if you want to use >> anchors.jsp). >> > is it possible to do that without stopping main crawl or any other > architecture suggestions?
Yes, of course - please remember that all data files in Nutch are essentially "write once", so they are never modified - if any tool needs to modify them, then new files are created, and then only their name is changed. In case of merge tools, the output directory can be specified on the command line, and in fact may not be the same as the input directory - so you can safely run them in parallel with other jobs. The way I do it is to keep around the merged-db and merged-linkdb, and just prior to indexing I update them with the latest DBs from other crawls. Then I get the newest segments from each crawl and index them together, and finally I'm merging the resulting index with the total merged-index (incrementally merged over all previous cycles). Finally, you need to run deduplication, and then deploy the new segments, new merged-index, and new merged-linkdb. A bit messy, but it works. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
