NightOwl888 commented on issue #935:
URL: https://github.com/apache/lucenenet/issues/935#issuecomment-2118228819

   Interesting. I tried with `net8.0` (which has been changed as the default 
IDE test platform since the last time I checked) with the same configuration 
you used and it doesn't fail there. But then I changed the test target to 
`net6.0` and it failed on the first run. So, that probably explains why you are 
not reproducing it.
   
   I think I worked out what is happening, though. 
   
   ### Issue 1
   
   The error indicates that `dwpt` is `null` on this line:
   
   
https://github.com/NightOwl888/lucenenet/blob/878261fa920a961288639863da28d7c9f5d0d800/src/Lucene.Net/Index/DocumentsWriterFlushControl.cs#L471
   
   Digging into the `DocumentsWriterPerThreadPool`, there is only 1 way it 
could return `null`, and that is if this method is called: 
https://github.com/NightOwl888/lucenenet/blob/878261fa920a961288639863da28d7c9f5d0d800/src/Lucene.Net/Index/DocumentsWriterPerThreadPool.cs#L91
   
   We cannot pass [this 
line](https://github.com/NightOwl888/lucenenet/blob/878261fa920a961288639863da28d7c9f5d0d800/src/Lucene.Net/Index/DocumentsWriterFlushControl.cs#L465)
 unless `perThread.IsInitialized` returns `true`. And that can only happen if 
`dwpt` is not `null`.
   
   Therefore, to fail in this way, 
`DocumentsWriterPerThreadPool.Reset(perThread, closed)` must be called on a 
different thread while [these 
lines](https://github.com/NightOwl888/lucenenet/blob/878261fa920a961288639863da28d7c9f5d0d800/src/Lucene.Net/Index/DocumentsWriterFlushControl.cs#L467-L470)
 are executing. This explains why it is an exceedingly rare event.
   
   I checked and there are 4 calls to 
`DocumentsWriterPerThreadPool.Reset(perThread, closed)`. All of them are called 
within `UninterruptableMonitor.Enter(this)` except for this one:
   
   
https://github.com/NightOwl888/lucenenet/blob/878261fa920a961288639863da28d7c9f5d0d800/src/Lucene.Net/Index/DocumentsWriterFlushControl.cs#L818
   
   So, moving the `UninterruptableMonitor.Enter(this)` outside of that call 
should fix the problem.
   
   ### Issue 2
   
   After that fix, we still got some failures. But it was also clear from the 
stack trace what was happening. If `TryLock()` returns `false`, we end up on 
this line:
   
   
https://github.com/NightOwl888/lucenenet/blob/fix/GH-935-documentswriter-concurrency/src/Lucene.Net/Index/DocumentsWriterFlushControl.cs#L484
   
   The [documentation for 
ReentrantLock.tryLock()](https://docs.oracle.com/javase%2F8%2Fdocs%2Fapi%2F%2F/java/util/concurrent/locks/ReentrantLock.html#tryLock--)
 states:
   
   > Even when this lock has been set to use a fair ordering policy, a call to 
tryLock() will immediately acquire the lock if it is available, whether or not 
other threads are currently waiting for the lock. This "barging" behavior can 
be useful in certain circumstances, even though it breaks fairness. If you want 
to honor the fairness setting for this lock, then use [tryLock(0, 
TimeUnit.SECONDS) 
](https://docs.oracle.com/javase%2F8%2Fdocs%2Fapi%2F%2F/java/util/concurrent/locks/ReentrantLock.html#tryLock-long-java.util.concurrent.TimeUnit-)which
 is almost equivalent (it also detects interruption).
   >
   > If the current thread already holds this lock then the hold count is 
incremented by one and the method returns true.
   >
   > If the lock is held by another thread then this method will return 
immediately with the value false.
   
   So, it can return false, but I don't think it ever does in practice since a 
`null` return value from `InternalTryCheckOutForFlush()` will cause an error 
(assert in test, NullReferenceException in production). So, apparently Lucene 
relies on the "barging" behavior of this method to be put in the front of the 
queue.
   
   .NET has no such behavior, but I don't think it really matters which order 
the locks are taken in as long as `ReentrantLock.TryLock()` returns `true`. So, 
I changed it to call `UninterruptableMonitor.Enter(object, ref bool)` instead 
of `UninterruptableMonitor.TryEnter(object)`, which will always acquire the 
lock and return `true`.
   
   This probably won't perform as well because the method is supposed to return 
immediately and may end up deadlocking, but so far it seems to perform better 
and never deadlock.
   
   -----------------------
   
   I am running the tests now to see if the above changes work on all 
platforms, but locally this seems promising.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to