On Thu, Jul 12, 2012 at 6:55 PM, Sean Bridges <sean.brid...@gmail.com> wrote: > Thanks for the tip. > > Does using updateDocument instead of addDocument affect > indexing/search performance?
it does affect index performance compared to add document but that might be minor compared to your analysis chain. I wouldn't worry about updateDocument its the only sensible way to use lucene really. Why didn't you use this before, any reason? What is your ingest rate / doc throughput and where would you get concerned? simon > > Sean > > On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <u...@thetaphi.de> wrote: >> The trick is to index not with addDocument(Document) but instead with >> updateDocument(Term, Document). Lucene then adds the document atomically >> while deleting any previous documents with the given term (which is qour >> unique ID). If the key does not exist it simply indexes without deleting >> anything. >> By this you always have only one document with the same Term (==your unique >> ID). >> >> Uwe >> >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> >>> -----Original Message----- >>> From: Sean Bridges [mailto:sean.brid...@gmail.com] >>> Sent: Thursday, July 12, 2012 5:42 PM >>> To: java-user@lucene.apache.org; simon.willna...@gmail.com >>> Subject: Re: delete by docid in lucene 4 >>> >>> We have indexer machines which are fed documents by other machines. >>> If an error occurs (machine crashing etc) the same document may be sent to >> an >>> indexer multiple times. Serial ids are assigned before documents reach >> the >>> indexer, so a document, may be in the index multiple times, each time with >> the >>> same serial id. >>> >>> When the index gets large enough, the indexer will stop writing to the >> index, >>> and upload it to another machine, which keeps the index forever. Before >> we >>> upload the index, we forceMerge(1) on it, and gather some stats about the >>> index like max,min serial id, total documents. While calculating max and >> min >>> serial id, if we see a duplicate serial id, we call >> IndexReader.deleteByDocId(...) . >>> >>> We could check for duplicate serial ids while indexing, but that is racy, >> and not >>> as efficient. >>> >>> Thanks, >>> >>> Sean >>> >>> >>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer >>> <simon.willna...@gmail.com> wrote: >>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.brid...@gmail.com> >>> wrote: >>> >> Is it possible to delete by docId in lucene 4? I can delete by docid >>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that >>> >> method is gone in lucene 4, and IndexWriter only allows deleting by >>> >> Term or Query. >>> > >>> > that is correct. In lucene 4 IndexReader is really just a reader! >>> >> >>> >> This is our use case - In our system, each document is identified by >>> >> a unique serial id. If an error occurs, we may index the same >>> >> message multiple times. When an index grows large enough, we stop >>> >> adding to it, and optimize the index. During optimization, if we see >>> >> multiple docs with the same serialid, we delete all but the first, as >>> >> all documents with the same serialid are the same. >>> > >>> > I am wondering why you don't use the IW#updateDocument(Term,Doc) >>> > method? do you rely on multiple versions of the same doc? With Lucene >>> > 4 relying on the doc id can become very tricky. If you use multiple >>> > threads you create a lot of segments which can be merged in any order. >>> > You can't tell if a document ID maintains happened-before semantics at >>> > all. >>> > >>> > Can you tell us more about your usecase and why you are using >>> > deleteByDocID >>> > >>> > simon >>> > >>> > >>> >> >>> >> Thanks, >>> >> >>> >> Sean >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> > For additional commands, e-mail: java-user-h...@lucene.apache.org >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org