Fair enough, however, I see this: $ cat log Tue May 9 07:19:45 EDT 2017: Indexing starts Tue May 9 07:32:33 EDT 2017: Deletion starts with a list of 1278635 files Tue May 9 07:49:47 EDT 2017: Deletion complete, Addition starts with 1272334 files
$ date Tue May 9 13:12:58 EDT 2017 I am using two phase commit model. Deletion logic above utilizes writer.deleteDocuments(query), and addition utilizes writer.addDocument(doc). Judging simply from this log deletion doesn't seem to be taking long. What am I missing? On Tue, May 9, 2017 at 10:23 AM Adrien Grand <jpou...@gmail.com> wrote: > addDocument can be a significant gain compared to updateDocument as doing a > PK lookup on a unique field has a cost that is not negligible compared to > indexing a document, especially if the indexing chain is simple (no large > text fields with complex analyzers). Reindexing in place will also cause > more merging. Overall I find the 3x factor a bit high, but not too > surprising if documents and the analysis chain are simple, and/or if > storage is slow. > > Le mar. 9 mai 2017 à 16:06, Rob Audenaerde <rob.audenae...@gmail.com> a > écrit : > > > As far as I know, the updateDocument method on the IndexWriter delete and > > add. See also the javadoc: > > > > [..] Updates a document by first deleting the document(s) > > containing term and then adding the new > > document. The delete and then add are atomic as seen > > by a reader on the same index (flush may happen only after > > the add). [..] > > > > > > On Tue, May 9, 2017 at 3:37 PM, Kudrettin Güleryüz <kudret...@gmail.com> > > wrote: > > > > > I do update the entire document each time. Furthermore, this sometimes > > > means deleting compressed archives which are stores as multiple > documents > > > for each compressed archive file and readding them. > > > > > > Is there an update method, is it better performance than remove then > > add? I > > > was simply removing modified files from the index (which doesn't seem > to > > > take long), and readd them. > > > > > > On Tue, May 9, 2017 at 9:33 AM Rob Audenaerde < > rob.audenae...@gmail.com> > > > wrote: > > > > > > > Do you update each entire document? (vs updating numeric docvalues?) > > > > > > > > That is implemented as 'delete and add' so I guess that will be > slower > > > than > > > > clean sheet indexing. Not sure if it is 3x slower, that seems a bit > > much? > > > > > > > > On Tue, May 9, 2017 at 3:24 PM, Kudrettin Güleryüz < > > kudret...@gmail.com> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > For a 5.2.1 index that contains around 1.2 million documents, > > updating > > > > the > > > > > index with 1.3 million files seems to take 3X longer than doing a > > > scratch > > > > > indexing. (Files are crawled over NFS, indexes are stored on a > > > mechanical > > > > > disk locally (Btrfs)) > > > > > > > > > > Is this expected for Lucene's update index logic, or should I > further > > > > debug > > > > > my part of the code for update performance? > > > > > > > > > > Thank you, > > > > > Kudret > > > > > > > > > > > > > > >