Hi all,
I just committed a couple of new tools, and I'd like to briefly explain
their purpose and intended use.
* CrawlDbMerger: available from bin/nutch as 'mergedb'. You can merge
several existing DBs into one. This comes useful if you ran several
partial crawls and you'd like to combine the DBs. Optionally, you can
run current URLFilters on URLs in the databases, to filter out unwanted
URLs. This works also if you run it with just one input DB, which means
that you can use this tool for weeding out unwanted URLs from a single DB.
* LinkDbMerger: available from bin/nutch as 'mergelinkdb', with a
similar purpose as above, and with similar options. Please note that
URLFilters, if activated, will apply to both target and source URLs.
This tool can be useful if you built partial linkdb-s from groups of
segments, and then you need to integrate them into one (e.g. for
indexing or for searching). Or you can use it with a single linkdb, just
to filter out unwanted URLs.
* SegmentMerger: available as 'mergesegs'. This tool merges several
input segments into one or more output segments, with optional filtering
as above. Optionally, the output data can be divided into several
smaller segments of fixed size. There are many do-s and dont-s regarding
the use of this tool, described in Javadoc - please be sure to read them
before using. The purpose of this tool is to e.g. re-shape your segments
(in preparation for deployment to search servers), or to filter out
unwanted data, or to minimize the number of active segments.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers