The curse of TestSearcherManager_Mem()

Van Den Berghe, Vincent Thu, 27 Jul 2017 02:30:46 -0700

I give up.

I think I found at least one of the reason why the TestSearcherManager_Mem() 
sporadically fails. But I can't prove it's the only one. But I give up looking 
for others, since investing more time would seriously damage my mental health. 
So I'm going to explain my reasoning in the hope others can validate it. And 
I'll give up looking for others, if any.


It's simple enough to know the answer to the immediate questions "why" the test 
fails: in the MockDirectoryWrapper.Dispose(bool) an exception is thrown because 
the OpenFiles collection is not empty.
Calls to MockDirectoryWrapper.AddFileHandle add entries to OpenFiles or 
increment their reference count, and calls to 
MockDirectoryWrapper.RemoveOpenFle decrements the reference count and removes 
an entry when its reference count drops to 0. It's easy enough.
Therefore, there are 3 possibilities:

1.       calls to RemoveOpenFile are being missed.

2.       calls to RemoveOpenFile are not being missed, but the argument used to 
look up the entries (IDisposable c) is incorrect.

3.       Calls to AddFileHandle are done with an incorrect argument 
(IDisposable c) to add/update the entries.

I've identified one instance of possibility 3 (see a previous e-mail). This 
improved the situation somewhat, but didn't correct it.
I checked possibility 2 in the case of test failure, but found no instances. 
All calls have their correct argument.
I've checked possibility 1, and calls are indeed being missed. In one case of 
failure, there were 214 calls to AddFileHandle, but only 205 calls to 
RemoveOpenFile. This yielded 9 open files, which was exactly the number of 
items in the OpenFiles dictionary. These numbers are variable, since the 
parameters of the test are randomized.
There is another dictionary in MockDirectoryWrapper called OpenFilesDeleted: 
this holds the number of open files for which deletion is asked, but the files 
are still open. In case of failure, this collection is always empty. So no 
files are being deleted while open.
So somewhere in Lucene.net, Inputs are not being closed. But where?
It's difficult to reproduce the exact same failure every time, since at 
(pseudo)random times the tests will create different decoders, docvalues etc or 
reuse different searches.  In addition, it's difficult to find out a why a call 
is never made.
Since the debugger can't break on something that's doesn't happen, maybe we can 
trace where something happens: this could give us a clue on what doesn't 
happen. After dozens of runs, I found that the common stack trace seems to be 
this when a file is opened (that never closes):

   at Lucene.Net.Index.SegmentReader..ctor(SegmentCommitInfo si, Int32 
termInfosIndexDivisor, IOContext context) in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\SegmentReader.cs:line 112
   at Lucene.Net.Index.ReadersAndUpdates.GetReader(IOContext context) in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\ReadersAndUpdates.cs:line 170
   at Lucene.Net.Index.ReadersAndUpdates.GetReadOnlyClone(IOContext context) in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\ReadersAndUpdates.cs:line 301
   at Lucene.Net.Index.StandardDirectoryReader.Open(IndexWriter writer, 
SegmentInfos infos, Boolean applyAllDeletes) in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\StandardDirectoryReader.cs:line 
126
   at Lucene.Net.Index.IndexWriter.GetReader(Boolean applyAllDeletes) in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\IndexWriter.cs:line 388
   at Lucene.Net.Index.StandardDirectoryReader.DoOpenFromWriter(IndexCommit 
commit) in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\StandardDirectoryReader.cs:line 
369
   at Lucene.Net.Index.StandardDirectoryReader.DoOpenIfChanged(IndexCommit 
commit) in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\StandardDirectoryReader.cs:line 
336
   at Lucene.Net.Index.StandardDirectoryReader.DoOpenIfChanged() in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\StandardDirectoryReader.cs:line 
325
   at Lucene.Net.Index.DirectoryReader.OpenIfChanged(DirectoryReader oldReader) 
in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\DirectoryReader.cs:line 176
   at Lucene.Net.Search.SearcherManager.RefreshIfNeeded(IndexSearcher 
referenceToRefresh) in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Search\SearcherManager.cs:line 126
   at Lucene.Net.Search.ReferenceManager`1.DoMaybeRefresh() in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Search\ReferenceManager.cs:line 203
   at Lucene.Net.Search.ReferenceManager`1.MaybeRefresh() in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Search\ReferenceManager.cs:line 263
   at Lucene.Net.Search.TestSearcherManager.get_CurrentSearcher() in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net.Tests\Search\TestSearcherManager.cs:line
 198
   at 
Lucene.Net.Index.ThreadedIndexingAndSearchingTestCase.ThreadAnonymousInnerClassHelper2.Run()
 in 
D:\SOURCE\GitHub\lucenenet\src\Lucene.Net.TestFramework\Index\ThreadedIndexingAndSearchingTestCase.cs:line
 455
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext 
executionContext, ContextCallback callback, Object state, Boolean 
preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, 
ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, 
ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()


It looks like file are being opened in the SegmentReader constructor. Just to 
make our lives miserable, the matter is made more complex by having actual 
close operations hide behind a reference counting scheme.
But further down the stack, it's triggered in 
TestSearcherManager.CurrentSearcher property by a call to:

                    if (mgr.MaybeRefresh())

In that same CurrentSearcher property, the searcher is being tracked by a 
SearcherLifetimeManager instance using the Record(s) method:

                        long token = lifetimeMGR.Record(s);

And in the implementation of that method is this:

            var tracker = _searchers.GetOrAdd(version, l => { 
factoryMethodCalled = true; return new SearcherTracker(searcher); });

But (as you boys and girls now by now) this is incorrect! In GetOrAdd for a 
ConcurrentDictionary, there is ABSOLUTELY NO GUARANTEE that the factory method 
is called exactly once only when the element is not present.
And since new SearcherTracker(searcher) does:

                searcher.IndexReader.IncRef();

... we *may* end up with incorrect counts. But not always. Sometimes. Every now 
and then. Occasionally. On and off. Enough to drive you nuts.
The standard solution is to wrap the value as a Lazy<>, like this:

        private readonly ConcurrentDictionary<long, Lazy<SearcherTracker>> 
_searchers = new ConcurrentDictionary<long, Lazy<SearcherTracker>>();

and then change the code in the Record() as follows:

       var tracker = _searchers.GetOrAdd(version, l => new 
Lazy<SearcherTracker>(() => { factoryMethodCalled = true; return new 
SearcherTracker(searcher); })).Value;

This will guarantee a single call to the SearchTracker constructor every time.
In the Acquire method, we change the if() statement as follows:

            Lazy<SearcherTracker> tracker;
            if (_searchers.TryGetValue(version, out tracker) && 
tracker.IsValueCreated && tracker.Value.Searcher.IndexReader.TryIncRef())
            {
                return tracker.Value.Searcher;
            }

Note that the test "tracker!=null" has been replaced by tracker.IsValueCreated. 
 In .NET, it is never the case that tracker is null if TryGetValue succeeds, 
since values added are never null, so the test tracked!=null made no sense.
I'm not sure tracker.IsValueCreated makes much sense either: this would mean 
that the entry would exist, but the call to Acquire() would happen between the 
return of the GetOrAdd method and the call to the Value property in the 
Record() method. The probability of this is quite small, but if we want to make 
sure all trackers are created in the Record() method, we need this test.

The rest of the changes is trivial:

In the Prune method, we need to collect the trackers a bit differently:

     var trackers = _searchers.Values.Select(item => item.Value).ToList();

... and try to remove it with the lazy value

Lazy<SearcherTracker> _;
       _searchers.TryRemove(tracker.Version, out _);

Ihe Dispose() method constructs the list of items to close as follows:

                IList<SearcherTracker> toClose = new 
List<SearcherTracker>(_searchers.Values.Select(item => item.Value));

(thank you, Linq)
... and try to remove it with the lazy value, just like in the Prune method.

Lazy<SearcherTracker> _;
       _searchers.TryRemove(tracker.Version, out _);


The alternative solution is to trade in concurrency for a "locked dictionary 
approach", which is not really as sexy and may create a contention point where 
none existed.

I've checked the remainder of the lucene source for instances of .GetOrAdd and 
.AddOrUpdate with an add factory, and there are a few. But this, I think, was 
the one to catch. I'll leave the others are an exercise for the reader.

And now if you excuse me, I need a drink. Stick a fork in me, I'm done.

Vincent

The curse of TestSearcherManager_Mem()

Reply via email to