I did some performance comparison testing of Lucene 2.0 vs. trunk (with LUCENE-843). I'm seeing at least a 4X increase in indexing rate with the new DocumentsWriter in LUCENE-843 (still doing single-threaded indexing). Better yet, the total time to build the index is much shorter because I can now build the entire 3GB index (900K docs) in one segment in RAM (using FSDirectory) and flush it to disk at the end. Before, I had to build smaller segments (20K docs), merge after 20 segments and then optimize at the end. The memory usage with LUCENE-843 is much lower, presumably because stored fields and term vectors no longer sit in RAM.
I also observed a 20-25% gain by reusing the Field objects. Implementing my own Fieldable class was too complicated, so I simply extended the Field class (after removing final) and added 2 setter methods: public void setValue(String value) { this.fieldsData = value; } public void setValue(byte[] value) { this.fieldsData = value; } Since this improved performance significantly, I would vote to either add setters to Field or make it extendable. Kudos to Mike for this huge improvement! Peter On 7/13/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
"Grant Ingersoll" <[EMAIL PROTECTED]> wrote: > This is good stuff... Might be good to put a organized version of > this up on the Wiki under Best Practices I agree! I will update the ImproveIndexingSpeed page: http://wiki.apache.org/lucene-java/ImproveIndexingSpeed with these suggestions. > On Jul 13, 2007, at 8:13 AM, Michael McCandless wrote: > > > Yeah it's not so easy now: Field.java does not have setters. > > > > You have to make your own class that implements Fieldable (or > > subclasses AbstractField) and adds your own setters. Field.java is > > also [currently] final so you can't subclass it. > > > > Should we consider putting in these changes? I think it might be a > little weird on the Search side to have setters for Field and it > sounds like it could cause trouble for people esp. in a threaded > indexing situation, but maybe I am mistaken? I think adding setters would be reasonable, if we document clearly that they are advanced, be careful about threads, use at your own risk sort of methods? Are there any concerns with that approach? If not I'll open an issue and do it... this just makes it easier for people to maximize indexing performance "out of the box". Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]