Vincent,

If you have the time, I'd appreciate your assistance with a fix for a 
long-standing concurrency bug. I have been putting together wrapper console 
application for the various utilities that ship with Lucene and discovered that 
2 of them are non-functional because of this bug, but on the upside is now 
there is a reliable way to reproduce it. I suspect the bug is also causing some 
of the random test failures that we are seeing on certain FSDirectory 
implementations.

I have pushed the WIP application to my local repository 
(https://github.com/NightOwl888/lucenenet/tree/cli/src/tools/lucene-cli). It 
only runs on .NET Core and in Visual Studio 2015 Update 3. I don't think it 
makes sense to support .NET framework for this utility since .NET Core will run 
side-by-side with .NET Framework anyway.

You can run a specific commands directly on the command line or in Visual 
Studio 2015. There is a server that needs to be started first, and then a 
client that connects. The problem seems to be the server.

Command Line

dotnet lucene-cli.dll lock verify-server 127.0.0.4 10

dotnet lucene-cli.dll lock stress-test 3 127.0.0.4 <THE_PORT> 
NativeFSLockFactory F:\temp2 50 10

Note the port is dynamically chosen by the server at runtime and displayed on 
the console.

Visual Studio 2015

In Visual Studio 2015, you can just copy everything after "dotnet 
lucene-cli.dll" and paste it into the project properties > Debug > Application 
Arguments text box. Do note I am not sure if those options are optimal (or even 
if they may be causing the issue).

What I Have Found

When the client calls the server, the server locks on LockVerifyServer.cs line 
129 
(https://github.com/NightOwl888/lucenenet/blob/cli/src/Lucene.Net/Store/LockVerifyServer.cs#L129).
 I tried removing that line, and it gets a bit further and then crashes with 
this error:

An unhandled exception of type 'System.Exception' occurred in 
System.Private.CoreLib.ni.dll

Additional information: System.IO.IOException: Unable to read data from the 
transport connection: An existing connection was forcibly closed by the remote 
host. ---> System.Net.Sockets.SocketException: An existing connection was 
forcibly closed by the remote host

   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 
size, SocketFlags socketFlags)

   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 
size)

   --- End of inner exception stack trace ---

   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 
size)

   at System.IO.Stream.ReadByte()

   at System.IO.BinaryReader.InternalReadOneChar()

   at Lucene.Net.Store.LockVerifyServer.ThreadAnonymousInnerClassHelper.Run() 
in F:\Projects\lucenenet\src\Lucene.Net\Store\LockVerifyServer.cs:line 135


I suspect that has something to do with removing the wait so the timing is off, 
but I compared the thread handling code to some similar tests and it looks the 
same (including the call to Wait()), so I haven't worked out why that method 
call isn't completing in this case.

I believe this bug is related to a couple of intermittently failing tests that 
also seem to indicate the LockFactory is broken.

https://teamcity.jetbrains.com/viewLog.html?buildId=1101813&tab=buildResultsDiv&buildTypeId=LuceneNet_PortableBuilds_TestOnNet451
https://teamcity.jetbrains.com/viewLog.html?buildId=1084071&tab=buildResultsDiv&buildTypeId=LuceneNet_PortableBuilds_TestOnNet451
https://teamcity.jetbrains.com/viewLog.html?buildId=1071425&tab=buildResultsDiv&buildTypeId=LuceneNet_PortableBuilds_TestOnNet451

Namely, the TestLockFactory.StressTestLocks and 
TestLockFactory.TestStressLocksNativeFSLockFactory tests.


FYI, the TestIndexWriter.TestTwoThreadsInterruptDeadlock test also fails 
intermittently, and is apparently concurrency related. I don't recall which 
tests they were, but I discovered a while back that if you put the [Repeat(20)] 
attribute on them, they would fail more consistently. I also noticed that they 
always fail if MMapDirectory is made as the only option provided by the test 
framework.

Anyway, I would really appreciate if you could have a look to see if you can 
work out what is going on.


Thanks,
Shad Storhaug (NightOwl888)


Reply via email to