Re: Lucene update performance

Kudrettin Güleryüz Tue, 09 May 2017 10:14:22 -0700

Fair enough, however, I see this:
$ cat log
Tue May  9 07:19:45 EDT 2017: Indexing starts
Tue May  9 07:32:33 EDT 2017: Deletion starts with a list of 1278635 files
Tue May  9 07:49:47 EDT 2017: Deletion complete, Addition starts with
1272334 files


$ date
Tue May  9 13:12:58 EDT 2017

I am using two phase commit model. Deletion logic above utilizes
writer.deleteDocuments(query), and addition utilizes
writer.addDocument(doc). Judging simply from this log deletion doesn't seem
to be taking long. What am I missing?


On Tue, May 9, 2017 at 10:23 AM Adrien Grand <[email protected]> wrote:

> addDocument can be a significant gain compared to updateDocument as doing a
> PK lookup on a unique field has a cost that is not negligible compared to
> indexing a document, especially if the indexing chain is simple (no large
> text fields with complex analyzers). Reindexing in place will also cause
> more merging. Overall I find the 3x factor a bit high, but not too
> surprising if documents and the analysis chain are simple, and/or if
> storage is slow.
>
> Le mar. 9 mai 2017 à 16:06, Rob Audenaerde <[email protected]> a
> écrit :
>
> > As far as I know, the updateDocument method on the IndexWriter delete and
> > add. See also the javadoc:
> >
> > [..] Updates a document by first deleting the document(s)
> >     containing term and then adding the new
> >     document.  The delete and then add are atomic as seen
> >     by a reader on the same index (flush may happen only after
> >     the add). [..]
> >
> >
> > On Tue, May 9, 2017 at 3:37 PM, Kudrettin Güleryüz <[email protected]>
> > wrote:
> >
> > > I do update the entire document each time. Furthermore, this sometimes
> > > means deleting compressed archives which are stores as multiple
> documents
> > > for each compressed archive file and readding them.
> > >
> > > Is there an update method, is it better performance than remove then
> > add? I
> > > was simply removing modified files from the index (which doesn't seem
> to
> > > take long), and readd them.
> > >
> > > On Tue, May 9, 2017 at 9:33 AM Rob Audenaerde <
> [email protected]>
> > > wrote:
> > >
> > > > Do you update each entire document? (vs updating numeric docvalues?)
> > > >
> > > > That is implemented as 'delete and add' so I guess that will be
> slower
> > > than
> > > > clean sheet indexing. Not sure if it is 3x slower, that seems a bit
> > much?
> > > >
> > > > On Tue, May 9, 2017 at 3:24 PM, Kudrettin Güleryüz <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > For a 5.2.1 index that contains around 1.2 million documents,
> > updating
> > > > the
> > > > > index with 1.3 million files seems to take 3X longer than doing a
> > > scratch
> > > > > indexing. (Files are crawled over NFS, indexes are stored on a
> > > mechanical
> > > > > disk locally (Btrfs))
> > > > >
> > > > > Is this expected for Lucene's update index logic, or should I
> further
> > > > debug
> > > > > my part of the code for update performance?
> > > > >
> > > > > Thank you,
> > > > > Kudret
> > > > >
> > > >
> > >
> >
>

Re: Lucene update performance

Reply via email to