Thanks for the tip. Does using updateDocument instead of addDocument affect indexing/search performance?
Sean On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <u...@thetaphi.de> wrote: > The trick is to index not with addDocument(Document) but instead with > updateDocument(Term, Document). Lucene then adds the document atomically > while deleting any previous documents with the given term (which is qour > unique ID). If the key does not exist it simply indexes without deleting > anything. > By this you always have only one document with the same Term (==your unique > ID). > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -----Original Message----- >> From: Sean Bridges [mailto:sean.brid...@gmail.com] >> Sent: Thursday, July 12, 2012 5:42 PM >> To: java-user@lucene.apache.org; simon.willna...@gmail.com >> Subject: Re: delete by docid in lucene 4 >> >> We have indexer machines which are fed documents by other machines. >> If an error occurs (machine crashing etc) the same document may be sent to > an >> indexer multiple times. Serial ids are assigned before documents reach > the >> indexer, so a document, may be in the index multiple times, each time with > the >> same serial id. >> >> When the index gets large enough, the indexer will stop writing to the > index, >> and upload it to another machine, which keeps the index forever. Before > we >> upload the index, we forceMerge(1) on it, and gather some stats about the >> index like max,min serial id, total documents. While calculating max and > min >> serial id, if we see a duplicate serial id, we call > IndexReader.deleteByDocId(...) . >> >> We could check for duplicate serial ids while indexing, but that is racy, > and not >> as efficient. >> >> Thanks, >> >> Sean >> >> >> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer >> <simon.willna...@gmail.com> wrote: >> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.brid...@gmail.com> >> wrote: >> >> Is it possible to delete by docId in lucene 4? I can delete by docid >> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that >> >> method is gone in lucene 4, and IndexWriter only allows deleting by >> >> Term or Query. >> > >> > that is correct. In lucene 4 IndexReader is really just a reader! >> >> >> >> This is our use case - In our system, each document is identified by >> >> a unique serial id. If an error occurs, we may index the same >> >> message multiple times. When an index grows large enough, we stop >> >> adding to it, and optimize the index. During optimization, if we see >> >> multiple docs with the same serialid, we delete all but the first, as >> >> all documents with the same serialid are the same. >> > >> > I am wondering why you don't use the IW#updateDocument(Term,Doc) >> > method? do you rely on multiple versions of the same doc? With Lucene >> > 4 relying on the doc id can become very tricky. If you use multiple >> > threads you create a lot of segments which can be merged in any order. >> > You can't tell if a document ID maintains happened-before semantics at >> > all. >> > >> > Can you tell us more about your usecase and why you are using >> > deleteByDocID >> > >> > simon >> > >> > >> >> >> >> Thanks, >> >> >> >> Sean >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org