On Tue, Apr 28, 2009 at 6:09 AM, Shai Erera <ser...@gmail.com> wrote: > Hi > > I think I've hit a bug in ConcurrentMergeScheduler, but I'd like those who > are more familiar with the code to review it. I ran > TestStressSort.testSort() and started to get AIOOB exceptions from > MergeThread, the CPU spiked to 98-100% and did not end for a couple of > minutes, until I was able to regain control and kill the process (looks like > an infinite loop). > > To reproduce it all you need is to add the following line to > PQ.initialize(): size = maxSize, and then you'll get the aforementioned > exceptions. I did it acceindentally, but I'm sure there's a way to reproduce > it with a JUnit test or something so that it will happen consistently.
It sounds like this caused every merge to always hit an exception while merging? > When I debugged-trace the test, I noticed that MergeThread are just spawned > forever. The reason is this: In CMS.merge(IndexWriter) there's a 'while > (true)' loop which does 'while (mergeThreadCount() >= maxThreadCount)' and > if false just spawns a new MergeThread. On the other hand, in > MergeThread.run there's a try-finally which executes whatever it needs to > execute and in the finally block removes this thread from the list of > threads. That causes CMS to spawn a new thread, which will hit another > exception, remove itself from the queue and CMS will spawn a new thread. > That puts the code into an infinite loop. Unfortunately, this is tricky to fix correctly, because IW/MergeScheduler knows so little about what actually failed. Right now, if a merge hits an exception, IW simply undoes everything it did (removes partial files, allows new merges to try merging the segments again, etc). If it was some transient IO error, or say a transient disk full situation, retrying the merge seems good. But if it's some [temporary] bug in Lucene, and every merge will always hit an exception, then retrying is hopeless. Likewise a corrupt index, a disk full that won't clear up, sudden permission denied errors on opening new files, etc., retrying is hopeless. > That sounds like a bug to me ... I think that if MergeThread hits any > exception, the merge should fail? Anyway, the exception is added to an > exceptions List, which is a private member of CMS but is never chceked by > CMS. Perhaps merge(IndexWriter) should check if the exceptions list is not > empty and fail the merge in such case? Actually the merge does "fail", and IW "undoes" its changes, but then MergePolicy is free to pick merges again, and in turn picks the very same merge. The exceptions list is normally only checked during Lucene's unit tests, but apps could check it as well. > Anyway, I'll fix PQ's code now to continue my work, but if you want to > reproduce it, it's as easy as adding size = maxSize to initialize() and run > TestStressSort. > > I don't mind to open an issue and fix it (though I'm not sure what should > the fix be at the moment, but I'll figure it out), but it will have to wait, > so if you know the code and can put a patch together quickly, don't wait up > for me :) I think maybe the best/simplest fix is to simply sleep for a bit (250 msec?) on hitting an exception while merging? This way CPU won't be pegged, you won't suddenly see zillions of exceptions streaming by, etc. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org