I see, makes better sense now. The query is a BooleanQuery. Here is what I do: https://gist.github.com/Kudret/56879bf30fa129e752895305e1db5a80
On Wed, May 10, 2017 at 1:31 PM Michael McCandless < luc...@mikemccandless.com> wrote: > IndexWriter simply buffers that Query you passed to deleteDocuments, so > that's very fast. > > Only later on will it (lazily) resolve that Query to the docIDs to delete, > which is the costly part, when a merge wants to kick off, or a refresh, or > a commit. > > What Query are you using to identify documents to delete? > > Mike McCandless > > http://blog.mikemccandless.com > > On Tue, May 9, 2017 at 1:13 PM, Kudrettin Güleryüz <kudret...@gmail.com> > wrote: > >> Fair enough, however, I see this: >> $ cat log >> Tue May 9 07:19:45 EDT 2017: Indexing starts >> Tue May 9 07:32:33 EDT 2017: Deletion starts with a list of 1278635 files >> Tue May 9 07:49:47 EDT 2017: Deletion complete, Addition starts with >> 1272334 files >> >> $ date >> Tue May 9 13:12:58 EDT 2017 >> >> I am using two phase commit model. Deletion logic above utilizes >> writer.deleteDocuments(query), and addition utilizes >> writer.addDocument(doc). Judging simply from this log deletion doesn't >> seem >> to be taking long. What am I missing? >> >> >> On Tue, May 9, 2017 at 10:23 AM Adrien Grand <jpou...@gmail.com> wrote: >> >> > addDocument can be a significant gain compared to updateDocument as >> doing a >> > PK lookup on a unique field has a cost that is not negligible compared >> to >> > indexing a document, especially if the indexing chain is simple (no >> large >> > text fields with complex analyzers). Reindexing in place will also cause >> > more merging. Overall I find the 3x factor a bit high, but not too >> > surprising if documents and the analysis chain are simple, and/or if >> > storage is slow. >> > >> > Le mar. 9 mai 2017 à 16:06, Rob Audenaerde <rob.audenae...@gmail.com> a >> > écrit : >> > >> > > As far as I know, the updateDocument method on the IndexWriter delete >> and >> > > add. See also the javadoc: >> > > >> > > [..] Updates a document by first deleting the document(s) >> > > containing term and then adding the new >> > > document. The delete and then add are atomic as seen >> > > by a reader on the same index (flush may happen only after >> > > the add). [..] >> > > >> > > >> > > On Tue, May 9, 2017 at 3:37 PM, Kudrettin Güleryüz < >> kudret...@gmail.com> >> > > wrote: >> > > >> > > > I do update the entire document each time. Furthermore, this >> sometimes >> > > > means deleting compressed archives which are stores as multiple >> > documents >> > > > for each compressed archive file and readding them. >> > > > >> > > > Is there an update method, is it better performance than remove then >> > > add? I >> > > > was simply removing modified files from the index (which doesn't >> seem >> > to >> > > > take long), and readd them. >> > > > >> > > > On Tue, May 9, 2017 at 9:33 AM Rob Audenaerde < >> > rob.audenae...@gmail.com> >> > > > wrote: >> > > > >> > > > > Do you update each entire document? (vs updating numeric >> docvalues?) >> > > > > >> > > > > That is implemented as 'delete and add' so I guess that will be >> > slower >> > > > than >> > > > > clean sheet indexing. Not sure if it is 3x slower, that seems a >> bit >> > > much? >> > > > > >> > > > > On Tue, May 9, 2017 at 3:24 PM, Kudrettin Güleryüz < >> > > kudret...@gmail.com> >> > > > > wrote: >> > > > > >> > > > > > Hi, >> > > > > > >> > > > > > For a 5.2.1 index that contains around 1.2 million documents, >> > > updating >> > > > > the >> > > > > > index with 1.3 million files seems to take 3X longer than doing >> a >> > > > scratch >> > > > > > indexing. (Files are crawled over NFS, indexes are stored on a >> > > > mechanical >> > > > > > disk locally (Btrfs)) >> > > > > > >> > > > > > Is this expected for Lucene's update index logic, or should I >> > further >> > > > > debug >> > > > > > my part of the code for update performance? >> > > > > > >> > > > > > Thank you, >> > > > > > Kudret >> > > > > > >> > > > > >> > > > >> > > >> > >> > >