As promised,here is more info on [2]
If you apply the NotAuthorizedException patch in the previous mail, and you add
the following in file.Flush(true); in FSIndexOutput.Dispose like this:
protected override void Dispose(bool disposing)
{
if (disposing)
{
parent.OnIndexOutputClosed(this);
// only close the file if it has not been closed yet
if (isOpen)
{
IOException priorE = null;
try
{
base.Dispose(disposing);
}
catch (IOException ioe)
{
priorE = ioe;
}
finally
{
isOpen = false;
file.Flush(true);
IOUtils.CloseWhileHandlingException(priorE, file);
}
}
}
}
... and you comment out the FSync method of the class , which turned out to be
the source of the problem for me:
protected virtual void Fsync(string name)
{
//IOUtils.Fsync(Path.Combine(m_directory.FullName, name), false);
}
... the previous program have the same behavior in both 32-bit and 64-bit
implementations: "access denied" which suggests that we need to catch that
exception in more than one place.
Be careful: start with an empty directory before you're running it! Once files
are corrupted, there is no recovery from it.
To be continued.
Vincent
From: Shad Storhaug [mailto:[email protected]]
Sent: Thursday, July 06, 2017 11:32 PM
To: Van Den Berghe, Vincent <[email protected]>
Cc: [email protected]
Subject: Debugging Help Requested
Vincent,
If you have the time, I'd appreciate your assistance with a fix for a
long-standing concurrency bug. I have been putting together wrapper console
application for the various utilities that ship with Lucene and discovered that
2 of them are non-functional because of this bug, but on the upside is now
there is a reliable way to reproduce it. I suspect the bug is also causing some
of the random test failures that we are seeing on certain FSDirectory
implementations.
I have pushed the WIP application to my local repository
(https://github.com/NightOwl888/lucenenet/tree/cli/src/tools/lucene-cli). It
only runs on .NET Core and in Visual Studio 2015 Update 3. I don't think it
makes sense to support .NET framework for this utility since .NET Core will run
side-by-side with .NET Framework anyway.
You can run a specific commands directly on the command line or in Visual
Studio 2015. There is a server that needs to be started first, and then a
client that connects. The problem seems to be the server.
Command Line
dotnet lucene-cli.dll lock verify-server 127.0.0.4 10
dotnet lucene-cli.dll lock stress-test 3 127.0.0.4 <THE_PORT>
NativeFSLockFactory F:\temp2 50 10
Note the port is dynamically chosen by the server at runtime and displayed on
the console.
Visual Studio 2015
In Visual Studio 2015, you can just copy everything after "dotnet
lucene-cli.dll" and paste it into the project properties > Debug > Application
Arguments text box. Do note I am not sure if those options are optimal (or even
if they may be causing the issue).
What I Have Found
When the client calls the server, the server locks on LockVerifyServer.cs line
129
(https://github.com/NightOwl888/lucenenet/blob/cli/src/Lucene.Net/Store/LockVerifyServer.cs#L129).
I tried removing that line, and it gets a bit further and then crashes with
this error:
An unhandled exception of type 'System.Exception' occurred in
System.Private.CoreLib.ni.dll
Additional information: System.IO.IOException: Unable to read data from the
transport connection: An existing connection was forcibly closed by the remote
host. ---> System.Net.Sockets.SocketException: An existing connection was
forcibly closed by the remote host
at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32
size, SocketFlags socketFlags)
at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32
size)
--- End of inner exception stack trace ---
at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32
size)
at System.IO.Stream.ReadByte()
at System.IO.BinaryReader.InternalReadOneChar()
at Lucene.Net.Store.LockVerifyServer.ThreadAnonymousInnerClassHelper.Run()
in F:\Projects\lucenenet\src\Lucene.Net\Store\LockVerifyServer.cs:line 135
I suspect that has something to do with removing the wait so the timing is off,
but I compared the thread handling code to some similar tests and it looks the
same (including the call to Wait()), so I haven't worked out why that method
call isn't completing in this case.
I believe this bug is related to a couple of intermittently failing tests that
also seem to indicate the LockFactory is broken.
https://teamcity.jetbrains.com/viewLog.html?buildId=1101813&tab=buildResultsDiv&buildTypeId=LuceneNet_PortableBuilds_TestOnNet451
https://teamcity.jetbrains.com/viewLog.html?buildId=1084071&tab=buildResultsDiv&buildTypeId=LuceneNet_PortableBuilds_TestOnNet451
https://teamcity.jetbrains.com/viewLog.html?buildId=1071425&tab=buildResultsDiv&buildTypeId=LuceneNet_PortableBuilds_TestOnNet451
Namely, the TestLockFactory.StressTestLocks and
TestLockFactory.TestStressLocksNativeFSLockFactory tests.
FYI, the TestIndexWriter.TestTwoThreadsInterruptDeadlock test also fails
intermittently, and is apparently concurrency related. I don't recall which
tests they were, but I discovered a while back that if you put the [Repeat(20)]
attribute on them, they would fail more consistently. I also noticed that they
always fail if MMapDirectory is made as the only option provided by the test
framework.
Anyway, I would really appreciate if you could have a look to see if you can
work out what is going on.
Thanks,
Shad Storhaug (NightOwl888)