DO NOT REPLY [Bug 28183] New: - [Patch] replace DocumentWriter with InvertedDocument for performance

bugzilla Sat, 03 Apr 2004 14:36:07 -0800

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=28183>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.


http://issues.apache.org/bugzilla/show_bug.cgi?id=28183

[Patch] replace DocumentWriter with InvertedDocument for performance

           Summary: [Patch] replace DocumentWriter with InvertedDocument for
                    performance
           Product: Lucene
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: Enhancement
          Priority: Other
         Component: Index
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


I've found a way to improve Lucene's indexing performance by about 45% for my dataset.

Here's how it works:  currently the indexing process goes like this:

- use DocumentWriter to create an inverted index and serialize a one-document segment 
to a 
RAMDirectory
- when enough documents have been read, deserialize the one-document segments in the 
RAMDirectory and merge them, writing the merged segment to disk.

What I've done instead is create a new class, InvertedDocument, that keeps the 
inverted index in a Map, 
and can also be used directly as input for a merge.  This avoids the 
serialization/deserialization step, 
and the RAMDirectory is no longer used when indexing.

The patch applies to the contents of CVS as of today (April 3).  (It's a big patch and 
includes some 
minor style tweaks that aren't directly related.)

I did the performance testing using a simple application that creates an index from a 
file containing 
messages extracted from a bulletin board.  It could index about 100 kilobytes/second 
with Lucene 1.3, 
and 145 kilobytes/second with the patch.  This is on an 700Mhz eMac, which is pretty 
slow at Java, and 
the documents being indexed are, on average, less than a screenful.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 28183] New: - [Patch] replace DocumentWriter with InvertedDocument for performance

Reply via email to