Hi,
I just committed the new version of SegmentMergeTool. Please give it a try - it should speed up the merging process by a factor of 2-3 as compared with the old version. Here's a short synopsis:
* the new version doesn't need per-segment indexes in order to deduplicate - it builds its own small index. The output segment(s) can be optionally indexed, as before, when the merging process is finished.
* thanks to the new SegmentReader API, corrupt segments are fixed on the fly, if this is at all possible - if not, they are just skipped.
* optionally, the output can be split into many segments, which should make it easier to deploy on multiple search servers.
Please note also that command-line arguments are slightly different - check your scripts!
In a couple of days I will have a (yet) better version :-) which can work with non-parsed segments - this version treats them as corrupted. I believe that using non-parsed segments should be the default mode of operation, because it saves a lot of disk IO, and anyway parsing in a separate stage is more bullet-proof.
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
------------------------------------------------------- This SF.Net email is sponsored by: InterSystems CACHE FREE OODBMS DOWNLOAD - A multidimensional database that combines robust object and relational technologies, making it a perfect match for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
