Sorry for the delay responding. Holidays and all that. :) The retry approach did work, our process finished in the end. At some point, I suppose we'll just live with the chance this might happen and dump a bunch of exceptions into the log, if the effort to fix it is too high. Being pragmatic and all.
You are correct that preventing the duplicate indexing is hard. We do have things in place to try to prevent it, emphasis on the "try". Occasionally, things go wrong and we get a small number of duplicates, but on at least on occasion that number was anything but small. ;) I'm as sure as I can be that there were no merges running, since we're locking that directory before running this process. All our things that index use that same lock, so unless merges happen in a background thread within Lucene, rather than the calling thread that's adding new documents to the index, there should be no merges going on outside of this lock. In that case, calling waitForMerges shouldn't have any effect. I know you've mentioned the infoStream a couple times :) But I don't think turning it on would be a good idea, in our case. We've only had this problem crop up once, so there's no guarantee at all that it'll happen again, and the infoStream logging would be a lot of data with all the indexing we're doing. Unfortunately, I just don't think it's feasible. Thanks very much for the suggestion about FilterIndexReader with addIndices. That sounds very promising. I'm going to investigate doing our duplicate filtering that way instead. Thanks again for the help. Cheers :) Derek Lewis On Sat, Dec 21, 2013 at 5:13 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > OK I see; so deleting by Term or Query is a no go. I suppose, the > "retry" approach is actually fine: deleting by docID should be so fast > that having to retry if any single docID failed, is probably still > plenty fast. Out of curiosity, if you have the numbers handy, how > much time does it take to do all of your deletions (when it succeeds)? > > Maybe, try to prevent indexing so many duplicate documents in the > first place? But I assume that's hard for some reason. > > You could also make a FilterIndexReader subclass that filters out the > duplicates, and pass that to addIndices(IR[]) to build the new > de-duped index. There is also DuplicatesFilter... > > Indeed, I don't think tryDeleteDocument will ever trigger a new merge. > But are you certain merges were not already running when you started? > Maybe call IW.waitForMerges first? And turn on the infoStream ... > > > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Dec 20, 2013 at 1:50 PM, Derek Lewis <de...@lewisd.com> wrote: > > I'll see if I can explain the scenario a bit simpler in a moment, but > > there's one other thing I thought worth mentioning. > > > > I'm not sure it's possible for me to fall back to Term/Query deleting. > > Basically, if there are two documents in the index that have the same > > serialId, it's as the result of the same thing being indexed twice, so > all > > the terms are going to be the same. If I understand right, the fallback > > method of deletion then delete all the identical documents. I need to > > leave one (and only one) document in the index, for each serialId, so I > > think deleting by docId is my only option. > > > > A simpler (though incomplete) description of the scenario: > > > > I have an index containing a bunch of segments, with millions of > documents, > > each with a unique ID in a docField. However, due to some other > > conditions, I've ended up with some input documents indexed multiple > > (hundreds or more) times, with the same serialId. I need to remove all > > those duplicates when I merge the indexes. > > > > The code I have that does this (samples in the original email) never > > explicitly adds any documents to the index, it just creates the writer > from > > the reader, calls tryDeleteDocument probably millions of times, and then > > force merges everything. Somewhere along this process, while I'm still > > doing the deletes, it appears a segment is being merged away. I've > walked > > through the code for tryDeleteDocument, and the things it calls, fairly > > deeply, and I can't figure out why it would be merging away segments. > I've > > tried creating some test scenarios, but I never see it happen. > > > > > > On Fri, Dec 20, 2013 at 10:28 AM, Michael McCandless < > > luc...@mikemccandless.com> wrote: > > > >> I couldn't quite follow the scenario ... but if there's any chance a > >> merge could run in that IndexWriter it can lead to this. Could it > >> just be a merge that was running already at the start of your deletion > >> process? > >> > >> Maybe turn on IndexWriter's infoStream to see what merges are kicking > off? > >> > >> Really, your app should not consider this an "error" (it sounds like > >> it throws an exception and retries again later until it succeeds)... > >> it's better to delete those documents "the old fashioned way". > >> Relying on when IW starts/finishes merges is fragile (it's an > >> implementation detail...). > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> > >> On Fri, Dec 20, 2013 at 1:06 PM, Derek Lewis <de...@lewisd.com> wrote: > >> > Hi Mike, > >> > > >> > Thanks for the response. I realize that merging could cause segments > to > >> be > >> > deleted, resulting in tryDeleteDocument returning false. However, > I've > >> > been unable to figure out why the scenario I've described would cause > >> > segments to be merged. I've tried duplicating it by writing indexes > with > >> > many segments and deleting all the documents in them, but I haven't > had > >> any > >> > luck. > >> > > >> > Can you suggest any ways the scenario I've outlined would cause > merges? > >> > > >> > Cheers, > >> > Derek > >> > > >> > > >> > On Fri, Dec 20, 2013 at 9:50 AM, Michael McCandless < > >> > luc...@mikemccandless.com> wrote: > >> > > >> >> tryDeleteDocument will return false if the IndexReader is "stale", > >> >> i.e. the segment that contains the docID you are trying to delete has > >> >> been merged by IndexWriter. > >> >> > >> >> In this case you need to fallback to deleting by Term/Query. > >> >> > >> >> Mike McCandless > >> >> > >> >> http://blog.mikemccandless.com > >> >> > >> >> > >> >> On Fri, Dec 20, 2013 at 12:12 PM, Derek Lewis <de...@lewisd.com> > wrote: > >> >> > Hello, > >> >> > > >> >> > I have a problem where IndexWriter.tryDeleteDocument is returning > >> false > >> >> > unexpectedly. Unfortunately, it's in production, on indexes that > have > >> >> > since been merged and shunted around all over, and I've been > unable to > >> >> > create a scenario that duplicates the problem in any development > >> >> > environments. It also means I haven't been able to find out exact > >> >> details > >> >> > about the scenario, so some of this is extrapolation. > >> >> > > >> >> > The basic scenario is, I think, this: > >> >> > There is a Lucene index with millions of documents, and a bunch of > >> >> segments. > >> >> > Each of the documents has an associated "serialId" stored. There > are > >> >> many > >> >> > many duplicates, due to a transient error that occurred. > >> >> > Our system attempts to perform a process whereby it merges the > index > >> >> > segments, and deletes the documents with duplicate serialIds, so > that > >> at > >> >> > the end of the process, we have only one segment, and for each > >> serialId > >> >> > there is only one document. > >> >> > > >> >> > We have an IndexWriter we created with: > >> >> > writer = new IndexWriter( > >> >> > FSDirectory.open(indexdir), > >> >> > config); > >> >> > > >> >> > We create a DirectoryReader: > >> >> > final DirectoryReader nearRealtimeReader = > >> DirectoryReader.open(writer, > >> >> > false); > >> >> > > >> >> > which we use to iterate over the documents with: > >> >> > for (int docId = 0; docId < nearRealtimeReader.maxDoc(); ++docId) { > >> >> > > >> >> > For any document who's serialId indicates it's a duplicate (ie. > we've > >> >> > already seen that serialId), we delete it: > >> >> > final boolean deletionSuccessful = > >> >> > writer.tryDeleteDocument(nearRealtimeReader, docId); > >> >> > > >> >> > This works the vast majority of the time, however, in this case I > >> haven't > >> >> > been able to reproduce, it returns false, which we check, and > throw an > >> >> > exception. > >> >> > > >> >> > What I found particularly interesting is that when our system > >> >> re-schedules > >> >> > this process and tries again, it eventually succeeds, despite > nothing > >> >> else > >> >> > in our system writing to this index in the meantime. (Before > indexes > >> are > >> >> > shunted off to this merging process, they're "closed" to the rest > of > >> the > >> >> > system) This seems to hint to me that maybe something is merging > the > >> >> > segments of this index, even though we throw and exception before > we > >> get > >> >> to > >> >> > the part of our code that calls: > >> >> > writer.forceMerge(1, true); > >> >> > writer.commit(); > >> >> > > >> >> > Any ideas as to why this might be happening? > >> >> > > >> >> > We're using Lucene 4.4.0, on Java 7 64-bit, on Solaris. > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> > >> >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >