As far as the build system goes, I implemented the two ideas mentioned earlier in this message (not creating a new Formatter for each test, and not spawning 26 jvms for each batch)
Jira is down, but if you want to help test you can try a patch here: http://pastebin.com/iqwb73H2 (click Raw/Download) Additionally this cuts 1:20 off the total Solrcene 'ant clean test' for me. before: BUILD SUCCESSFUL Total time: 7 minutes 42 seconds after: BUILD SUCCESSFUL Total time: 6 minutes 23 seconds On Wed, Apr 28, 2010 at 12:25 PM, Michael McCandless < [email protected]> wrote: > I think this are good changes to NativeFSLockFactory. > > But: the chances that N JVMs launched at once would conflict on the > randomly generated lock file name should be miniscule... though it > does depend on how good new Random() is at seeding itself. Do we > really think this explains your exceptions Shai? (And, if so, even w/ > these changes, the conflict could still happen?) Maybe we should > explicitly seed it? > > Mike > > On Wed, Apr 28, 2010 at 11:22 AM, Shai Erera <[email protected]> wrote: > > I'd like to summarize the IRC discussion Mark and I had: > > > > The lock file's existence in the directory should not fail obtain() from > > retrieving obtaining a lock. That's the whole difference between Simple > and > > Native. So we should make a best-effort to delete it. If the delete fails > on > > release(), then ok. On obtain(), we won't return false if the lock > exists, > > but attempt to really obtain it and fail appropriately. > > > > While the previously proposed fix (add "&& path.exists()" to release()) > > might work most of the times, it will only work "most of the times". > I.e., > > between release() and delete(), an external process, like AntiVirus, > might > > lock the file, and delete will fail, but the file will still be there, > and > > we'll throw an exception still. > > > > So, the proposed changes are: > > * release() is allowed to fail to delete the lock file. > > * obtain() should not return false if the lock file exists - it should > > really attempt to obtain it. > > * in acquireTestLock(), if after release() is called, the lock file still > > exists, we'll retry the delete few ms later, and if that fails, call > > deleteOnExit. > > > > How's that sound? > > > > Shai > > > > On Wed, Apr 28, 2010 at 5:58 PM, Mark Miller <[email protected]> > wrote: > >> > >> I don't follow. The simple lock impl must delete the file, but the > native > >> impl should not have to. The file has nothing to do with the lock - its > just > >> the medium to ask for and release the lock. If it already exists, you > don't > >> have to create it - you can just use it to try and get a native lock. > >> Likewise, it doesn't need to be removed to release a native lock - you > >> simply call unlock on it. > >> > >> On 4/28/10 10:34 AM, Shai Erera wrote: > >>> > >>> But this method is called also for the regular lock file - if release() > >>> won't delete the file, then the next l.obtain() will return false. > >>> > >>> Shai > >>> > >>> On Wed, Apr 28, 2010 at 5:31 PM, Mark Miller <[email protected] > >>> <mailto:[email protected]>> wrote: > >>> > >>> It shouldn't need too though - the native lock file is simply a > >>> dummy file to apply the lock too - shouldn't matter if it already > >>> exists or not (though it seems to in the current code). > >>> > >>> > >>> On 4/28/10 10:22 AM, Shai Erera wrote: > >>> > >>> If you won't delete the file, the next obtain will fail? > >>> > >>> On Wed, Apr 28, 2010 at 5:12 PM, Mark Miller > >>> <[email protected] <mailto:[email protected]> > >>> <mailto:[email protected] <mailto:[email protected]>>> > >>> wrote: > >>> > >>> I wonder if not being able to delete the file should throw a > >>> release > >>> failed exception at all. You have actually released the > >>> native lock > >>> - you where just not able to clean up - but that seems more > >>> like a > >>> warning situation than a failure. > >>> > >>> > >>> -- > >>> - Mark > >>> > >>> http://www.lucidimagination.com > >>> > >>> On 4/28/10 9:53 AM, Shai Erera wrote: > >>> > >>> I've hit it again and here's the full stacktrace (at > least > >>> what's printed): > >>> > >>> [junit] Exception in thread "main" > >>> java.lang.RuntimeException: > >>> Failed to acquire random test lock; please verify > >>> filesystem for > >>> lock > >>> directory > >>> 'C:\DOCUME~1\shaie\LOCALS~1\Temp\lucene_junit_lock' > >>> supports > >>> locking > >>> [junit] at > >>> > >>> > >>> > > org.apache.lucene.store.NativeFSLockFactory.acquireTestLock(NativeFSLockFactory.java:88) > >>> [junit] at > >>> > >>> > >>> > > org.apache.lucene.store.NativeFSLockFactory.makeLock(NativeFSLockFactory.java:127) > >>> [junit] at > >>> > >>> > >>> > > org.apache.lucene.util.LuceneJUnitResultFormatter.<init>(LuceneJUnitResultFormatter.java:74) > >>> [junit] at > >>> java.lang.J9VMInternals.newInstanceImpl(Native Method) > >>> [junit] at > >>> java.lang.Class.newInstance(Class.java:1325) > >>> [junit] at > >>> > >>> > >>> > > org.apache.tools.ant.taskdefs.optional.junit.FormatterElement.createFormatter(FormatterElement.java:248) > >>> [junit] at > >>> > >>> > >>> > > org.apache.tools.ant.taskdefs.optional.junit.FormatterElement.createFormatter(FormatterElement.java:214) > >>> [junit] at > >>> > >>> > >>> > > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.transferFormatters(JUnitTestRunner.java:819) > >>> [junit] at > >>> > >>> > >>> > > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:909) > >>> [junit] at > >>> > >>> > >>> > > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:743) > >>> [junit] Caused by: > >>> org.apache.lucene.store.LockReleaseFailedException: > >>> failed to delete > >>> > >>> > >>> > C:\DOCUME~1\shaie\LOCALS~1\Temp\lucene_junit_lock\lucene-wn1v4z-test.lock > >>> [junit] at > >>> > >>> > >>> > org.apache.lucene.store.NativeFSLock.release(NativeFSLockFactory.java:311) > >>> [junit] at > >>> > >>> > >>> > > org.apache.lucene.store.NativeFSLockFactory.acquireTestLock(NativeFSLockFactory.java:86) > >>> [junit] ... 9 more > >>> > >>> The exception is thrown from NativeFSLock.release() b/c > >>> it fails to > >>> delete the lock file. I think I know what the problem is > >>> - and > >>> it must > >>> be related to the large number of JVMs that are created > >>> w/ the > >>> parallel > >>> tests: > >>> * Suppose that JVM1 draws the number '1' for the test > >>> lock file - it > >>> thus creates lock1. > >>> * Now suppose that JVM2 draws the same number, magically > >>> somehow > >>> - it > >>> thus creates lock1 as well. > >>> * The code of acquireTestLock in NativeFSLockFactory > >>> looks like > >>> this: > >>> Lock l = makeLock(randomLockName); > >>> try { > >>> l.obtain(); > >>> l.release(); > >>> --> both will create the same test Lock file. Then > >>> l.obtain() > >>> probably > >>> returns false for one of them, but it's not checked. > >>> * Then in release there are a couple of things to note: > >>> 1) the method is synced on the instance, which does not > >>> affect > >>> the two JVMs. > >>> 2) suppose that both JVMs pass through the if (exists()) > >>> check. Then > >>> JVM1 releases the lock, and deletes the file. > >>> 3) Now JVM2 kicks in, calls lock.release() which has no > >>> effect > >>> (from the > >>> jdoc: "If this lock object is invalid then invoking this > >>> method > >>> has no > >>> effect." ). Then when it comes to path.delete(), the > >>> file isn't > >>> there, > >>> the method returns false and thus an exception is thrown > >>> ... > >>> > >>> This situation is extremely unlikely to happen, but > >>> still, it > >>> happens on > >>> my machine quite frequently since the parallel tests. > I'm > >>> thinking that > >>> acquireTestLock should be less strict, but perhaps we > >>> can fix it > >>> if we > >>> replace the line: > >>> if (!path.delete()) (line 310) > >>> with this > >>> if (!path.delete() && path.exists()) > >>> > >>> I.e., if the lock file fails to delete but is still > >>> there, throw the > >>> exception ... > >>> > >>> What do you think? > >>> > >>> Shai > >>> > >>> On Tue, Apr 27, 2010 at 10:21 PM, Robert Muir > >>> <[email protected] <mailto:[email protected]> > >>> <mailto:[email protected] <mailto:[email protected]>> > >>> <mailto:[email protected] <mailto:[email protected]> > >>> <mailto:[email protected] <mailto:[email protected]>>>> wrote: > >>> > >>> > >>> > >>> On Tue, Apr 27, 2010 at 3:06 PM, Andi Vajda > >>> <[email protected] <mailto:[email protected]> > >>> <mailto:[email protected] <mailto:[email protected] > >> > >>> <mailto:[email protected] <mailto:[email protected] > > > >>> > >>> <mailto:[email protected] > >>> <mailto:[email protected]>>>> wrote: > >>> > >>> > >>> I've had similar random failures on Mac OS X > >>> 10.6. They > >>> started > >>> happening recently, about two weeks ago. > >>> > >>> > >>> Thats just too randomly close to when i last worked > >>> on this > >>> build > >>> system stuff for LUCENE-1709... perhaps I made it > >>> worse > >>> instead of > >>> better. > >>> > >>> -- > >>> Robert Muir > >>> [email protected] <mailto:[email protected]> > >>> <mailto:[email protected] <mailto:[email protected]>> > >>> <mailto:[email protected] <mailto:[email protected]> > >>> <mailto:[email protected] <mailto:[email protected]>>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> <mailto:[email protected]> > >>> <mailto:[email protected] > >>> <mailto:[email protected]>> > >>> > >>> For additional commands, e-mail: [email protected] > >>> <mailto:[email protected]> > >>> <mailto:[email protected] > >>> <mailto:[email protected]>> > >>> > >>> > >>> > >>> > >>> -- > >>> - Mark > >>> > >>> http://www.lucidimagination.com > >>> > >>> > --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> <mailto:[email protected]> > >>> For additional commands, e-mail: [email protected] > >>> <mailto:[email protected]> > >>> > >>> > >> > >> > >> -- > >> - Mark > >> > >> http://www.lucidimagination.com > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Robert Muir [email protected]
