Hi all,

I just committed a couple of new tools, and I'd like to briefly explain their purpose and intended use.

* CrawlDbMerger: available from bin/nutch as 'mergedb'. You can merge several existing DBs into one. This comes useful if you ran several partial crawls and you'd like to combine the DBs. Optionally, you can run current URLFilters on URLs in the databases, to filter out unwanted URLs. This works also if you run it with just one input DB, which means that you can use this tool for weeding out unwanted URLs from a single DB.

* LinkDbMerger: available from bin/nutch as 'mergelinkdb', with a similar purpose as above, and with similar options. Please note that URLFilters, if activated, will apply to both target and source URLs. This tool can be useful if you built partial linkdb-s from groups of segments, and then you need to integrate them into one (e.g. for indexing or for searching). Or you can use it with a single linkdb, just to filter out unwanted URLs.

* SegmentMerger: available as 'mergesegs'. This tool merges several input segments into one or more output segments, with optional filtering as above. Optionally, the output data can be divided into several smaller segments of fixed size. There are many do-s and dont-s regarding the use of this tool, described in Javadoc - please be sure to read them before using. The purpose of this tool is to e.g. re-shape your segments (in preparation for deployment to search servers), or to filter out unwanted data, or to minimize the number of active segments.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to