Re: Lucene update performance

Adrien Grand Tue, 09 May 2017 07:23:33 -0700

addDocument can be a significant gain compared to updateDocument as doing a
PK lookup on a unique field has a cost that is not negligible compared to
indexing a document, especially if the indexing chain is simple (no large
text fields with complex analyzers). Reindexing in place will also cause
more merging. Overall I find the 3x factor a bit high, but not too
surprising if documents and the analysis chain are simple, and/or if
storage is slow.


Le mar. 9 mai 2017 à 16:06, Rob Audenaerde <[email protected]> a
écrit :

> As far as I know, the updateDocument method on the IndexWriter delete and
> add. See also the javadoc:
>
> [..] Updates a document by first deleting the document(s)
>     containing term and then adding the new
>     document.  The delete and then add are atomic as seen
>     by a reader on the same index (flush may happen only after
>     the add). [..]
>
>
> On Tue, May 9, 2017 at 3:37 PM, Kudrettin Güleryüz <[email protected]>
> wrote:
>
> > I do update the entire document each time. Furthermore, this sometimes
> > means deleting compressed archives which are stores as multiple documents
> > for each compressed archive file and readding them.
> >
> > Is there an update method, is it better performance than remove then
> add? I
> > was simply removing modified files from the index (which doesn't seem to
> > take long), and readd them.
> >
> > On Tue, May 9, 2017 at 9:33 AM Rob Audenaerde <[email protected]>
> > wrote:
> >
> > > Do you update each entire document? (vs updating numeric docvalues?)
> > >
> > > That is implemented as 'delete and add' so I guess that will be slower
> > than
> > > clean sheet indexing. Not sure if it is 3x slower, that seems a bit
> much?
> > >
> > > On Tue, May 9, 2017 at 3:24 PM, Kudrettin Güleryüz <
> [email protected]>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > For a 5.2.1 index that contains around 1.2 million documents,
> updating
> > > the
> > > > index with 1.3 million files seems to take 3X longer than doing a
> > scratch
> > > > indexing. (Files are crawled over NFS, indexes are stored on a
> > mechanical
> > > > disk locally (Btrfs))
> > > >
> > > > Is this expected for Lucene's update index logic, or should I further
> > > debug
> > > > my part of the code for update performance?
> > > >
> > > > Thank you,
> > > > Kudret
> > > >
> > >
> >
>

Re: Lucene update performance

Reply via email to