I did some performance comparison testing of Lucene 2.0 vs. trunk (with
LUCENE-843). I'm seeing at least a 4X increase in indexing rate with the new
DocumentsWriter in LUCENE-843 (still doing single-threaded indexing). Better
yet, the total time to build the index is much shorter because I can now
build the entire 3GB index (900K docs) in one segment in RAM (using
FSDirectory) and flush it to disk at the end. Before, I had to build smaller
segments (20K docs), merge after 20 segments and then optimize at the end.
The memory usage with LUCENE-843 is much lower, presumably because stored
fields and term vectors no longer sit in RAM.

I also observed a 20-25% gain by reusing the Field objects. Implementing my
own Fieldable class was too complicated, so I simply extended the Field
class (after removing final) and added 2 setter methods:

     public void setValue(String value) {
       this.fieldsData = value;
     }
     public void setValue(byte[] value) {
       this.fieldsData = value;
     }

Since this improved performance significantly, I would vote to either add
setters to Field or make it extendable.

Kudos to Mike for this huge improvement!

Peter

On 7/13/07, Michael McCandless <[EMAIL PROTECTED]> wrote:

"Grant Ingersoll" <[EMAIL PROTECTED]> wrote:

> This is good stuff...  Might be good to put a organized version of
> this up on the Wiki under Best Practices

I agree!  I will update the ImproveIndexingSpeed page:

    http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

with these suggestions.

> On Jul 13, 2007, at 8:13 AM, Michael McCandless wrote:
>
> > Yeah it's not so easy now: Field.java does not have setters.
> >
> > You have to make your own class that implements Fieldable (or
> > subclasses AbstractField) and adds your own setters.  Field.java is
> > also [currently] final so you can't subclass it.
> >
>
> Should we consider putting in these changes?  I think it might be a
> little weird on the Search side to have setters for Field and it
> sounds like it could cause trouble for people esp. in a threaded
> indexing situation, but maybe I am mistaken?

I think adding setters would be reasonable, if we document clearly
that they are advanced, be careful about threads, use at your own risk
sort of methods?  Are there any concerns with that approach?  If not
I'll open an issue and do it... this just makes it easier for people
to maximize indexing performance "out of the box".

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to