[Nutch-dev] New SegmentMergeTool in CVS now

Andrzej Bialecki Sun, 14 Nov 2004 13:43:07 -0800

Hi,

I just committed the new version of SegmentMergeTool. Please give it a try - it should speed up the merging process by a factor of 2-3 as compared with the old version. Here's a short synopsis:

* the new version doesn't need per-segment indexes in order to deduplicate - it builds its own small index. The output segment(s) can be optionally indexed, as before, when the merging process is finished.

* thanks to the new SegmentReader API, corrupt segments are fixed on the fly, if this is at all possible - if not, they are just skipped.

* optionally, the output can be split into many segments, which should make it easier to deploy on multiple search servers.

Please note also that command-line arguments are slightly different - check your scripts!

In a couple of days I will have a (yet) better version :-) which can work with non-parsed segments - this version treats them as corrupted. I believe that using non-parsed segments should be the default mode of operation, because it saves a lot of disk IO, and anyway parsing in a separate stage is more bullet-proof.

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] New SegmentMergeTool in CVS now

Reply via email to