DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://issues.apache.org/bugzilla/show_bug.cgi?id=28183>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=28183 [Patch] replace DocumentWriter with InvertedDocument for performance Summary: [Patch] replace DocumentWriter with InvertedDocument for performance Product: Lucene Version: unspecified Platform: All OS/Version: All Status: NEW Severity: Enhancement Priority: Other Component: Index AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] I've found a way to improve Lucene's indexing performance by about 45% for my dataset. Here's how it works: currently the indexing process goes like this: - use DocumentWriter to create an inverted index and serialize a one-document segment to a RAMDirectory - when enough documents have been read, deserialize the one-document segments in the RAMDirectory and merge them, writing the merged segment to disk. What I've done instead is create a new class, InvertedDocument, that keeps the inverted index in a Map, and can also be used directly as input for a merge. This avoids the serialization/deserialization step, and the RAMDirectory is no longer used when indexing. The patch applies to the contents of CVS as of today (April 3). (It's a big patch and includes some minor style tweaks that aren't directly related.) I did the performance testing using a simple application that creates an index from a file containing messages extracted from a bulletin board. It could index about 100 kilobytes/second with Lucene 1.3, and 145 kilobytes/second with the patch. This is on an 700Mhz eMac, which is pretty slow at Java, and the documents being indexed are, on average, less than a screenful. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]