We have indexer machines which are fed documents by other machines. If an error occurs (machine crashing etc) the same document may be sent to an indexer multiple times. Serial ids are assigned before documents reach the indexer, so a document, may be in the index multiple times, each time with the same serial id.
When the index gets large enough, the indexer will stop writing to the index, and upload it to another machine, which keeps the index forever. Before we upload the index, we forceMerge(1) on it, and gather some stats about the index like max,min serial id, total documents. While calculating max and min serial id, if we see a duplicate serial id, we call IndexReader.deleteByDocId(...) . We could check for duplicate serial ids while indexing, but that is racy, and not as efficient. Thanks, Sean On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer <simon.willna...@gmail.com> wrote: > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.brid...@gmail.com> wrote: >> Is it possible to delete by docId in lucene 4? I can delete by docid >> in lucene 3 using IndexReader.deleteDocument(int docId), but that >> method is gone in lucene 4, and IndexWriter only allows deleting by >> Term or Query. > > that is correct. In lucene 4 IndexReader is really just a reader! >> >> This is our use case - In our system, each document is identified by >> a unique serial id. If an error occurs, we may index the same message >> multiple times. When an index grows large enough, we stop adding to >> it, and optimize the index. During optimization, if we see multiple >> docs with the same serialid, we delete all but the first, as all >> documents with the same serialid are the same. > > I am wondering why you don't use the IW#updateDocument(Term,Doc) > method? do you rely on multiple versions of the same doc? With Lucene > 4 relying on the doc id can become very tricky. If you use multiple > threads you create a lot of segments which can be merged in any order. > You can't tell if a document ID maintains happened-before semantics at > all. > > Can you tell us more about your usecase and why you are using deleteByDocID > > simon > > >> >> Thanks, >> >> Sean >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org