NightOwl888 commented on issue #935: URL: https://github.com/apache/lucenenet/issues/935#issuecomment-2118228819
Interesting. I tried with `net8.0` (which has been changed as the default IDE test platform since the last time I checked) with the same configuration you used and it doesn't fail there. But then I changed the test target to `net6.0` and it failed on the first run. So, that probably explains why you are not reproducing it. I think I worked out what is happening, though. ### Issue 1 The error indicates that `dwpt` is `null` on this line: https://github.com/NightOwl888/lucenenet/blob/878261fa920a961288639863da28d7c9f5d0d800/src/Lucene.Net/Index/DocumentsWriterFlushControl.cs#L471 Digging into the `DocumentsWriterPerThreadPool`, there is only 1 way it could return `null`, and that is if this method is called: https://github.com/NightOwl888/lucenenet/blob/878261fa920a961288639863da28d7c9f5d0d800/src/Lucene.Net/Index/DocumentsWriterPerThreadPool.cs#L91 We cannot pass [this line](https://github.com/NightOwl888/lucenenet/blob/878261fa920a961288639863da28d7c9f5d0d800/src/Lucene.Net/Index/DocumentsWriterFlushControl.cs#L465) unless `perThread.IsInitialized` returns `true`. And that can only happen if `dwpt` is not `null`. Therefore, to fail in this way, `DocumentsWriterPerThreadPool.Reset(perThread, closed)` must be called on a different thread while [these lines](https://github.com/NightOwl888/lucenenet/blob/878261fa920a961288639863da28d7c9f5d0d800/src/Lucene.Net/Index/DocumentsWriterFlushControl.cs#L467-L470) are executing. This explains why it is an exceedingly rare event. I checked and there are 4 calls to `DocumentsWriterPerThreadPool.Reset(perThread, closed)`. All of them are called within `UninterruptableMonitor.Enter(this)` except for this one: https://github.com/NightOwl888/lucenenet/blob/878261fa920a961288639863da28d7c9f5d0d800/src/Lucene.Net/Index/DocumentsWriterFlushControl.cs#L818 So, moving the `UninterruptableMonitor.Enter(this)` outside of that call should fix the problem. ### Issue 2 After that fix, we still got some failures. But it was also clear from the stack trace what was happening. If `TryLock()` returns `false`, we end up on this line: https://github.com/NightOwl888/lucenenet/blob/fix/GH-935-documentswriter-concurrency/src/Lucene.Net/Index/DocumentsWriterFlushControl.cs#L484 The [documentation for ReentrantLock.tryLock()](https://docs.oracle.com/javase%2F8%2Fdocs%2Fapi%2F%2F/java/util/concurrent/locks/ReentrantLock.html#tryLock--) states: > Even when this lock has been set to use a fair ordering policy, a call to tryLock() will immediately acquire the lock if it is available, whether or not other threads are currently waiting for the lock. This "barging" behavior can be useful in certain circumstances, even though it breaks fairness. If you want to honor the fairness setting for this lock, then use [tryLock(0, TimeUnit.SECONDS) ](https://docs.oracle.com/javase%2F8%2Fdocs%2Fapi%2F%2F/java/util/concurrent/locks/ReentrantLock.html#tryLock-long-java.util.concurrent.TimeUnit-)which is almost equivalent (it also detects interruption). > > If the current thread already holds this lock then the hold count is incremented by one and the method returns true. > > If the lock is held by another thread then this method will return immediately with the value false. So, it can return false, but I don't think it ever does in practice since a `null` return value from `InternalTryCheckOutForFlush()` will cause an error (assert in test, NullReferenceException in production). So, apparently Lucene relies on the "barging" behavior of this method to be put in the front of the queue. .NET has no such behavior, but I don't think it really matters which order the locks are taken in as long as `ReentrantLock.TryLock()` returns `true`. So, I changed it to call `UninterruptableMonitor.Enter(object, ref bool)` instead of `UninterruptableMonitor.TryEnter(object)`, which will always acquire the lock and return `true`. This probably won't perform as well because the method is supposed to return immediately and may end up deadlocking, but so far it seems to perform better and never deadlock. ----------------------- I am running the tests now to see if the above changes work on all platforms, but locally this seems promising. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org