Hi

Has anyone experienced any problems w/ Lucene indexes on a shared SMB2
network drive?

We've hit a scenario where it seems the FS cache refuses to check for
existence of files on the shared network drive. Specifically, we hit the
following exception:

java.io.FileNotFoundException: Z:\index\segments_p8 (The system cannot find
the file specified.)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at
org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.<init>(FSDirectory.java:552)
at
org.apache.lucene.store.FSDirectory$FSIndexInput.<init>(FSDirectory.java:582)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:482)
at org.apache.lucene.index.SegmentInfos$2.doBody(SegmentInfos.java:369)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653)
at
org.apache.lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:366)
at
org.apache.lucene.index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java:188)
at org.apache.lucene.index.MultiReader.isCurrent(MultiReader.java:352)

The environment:
* 3 Windows Server 2008 machines
** Machine A - hosts the index
** Machine B - indexes and search
** Machine C - just search
* Machine A and B map Machine C on drive Z.
* The exception happens on Machine C only, i.e. on the machine that does
just 'search'.

According to my understanding, FindSegmentFile attempts to read the latest
segment from segments.gen and directory listing and if there is a problem,
it will do a gen-readahead until success or defaultGenLookaheadCount is
exhausted.

So by hitting this exception we thought of the following explanation: the FS
cache 'decides' the file does not exist, due to a stale directory cache, and
refuses to check whether the file actually exists on the remote machine.

Does that sound reasonable?

Some more information:
* We use Lucene 2.4.0
* Other runs are executed on those machines currently, and so it will take
about a week until we can run the same scenario again. I thought that
perhaps we can discuss this until then.
* Unfortunately we weren't able to get an infoStream output before the
machines started another run, so we hope to get it next time. Anyway, it's
not easily reproduced.
* There isn't any other process which touches this directory, such that it
may remove index files.

We know the same code runs well on NFS (4). We haven't checked yet if SMB
1.0 works ok. Some pointers we've found:

A known issue on MS, w/ some C++ fixes:
http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?dg=microsoft.public.win32.programmer.networks&tid=69e63e38-7d91-4306-ab6e-a615e1c6afaa&cat=en_US_bc89adf4-f184-4d3d-aaee-122567385744&lang=en&cr=US&sloc=&p=1

Info on how to disable SMB 2.0 on Windows:
http://www.petri.co.il/how-to-disable-smb-2-on-windows-vista-or-server-2008.htm

Currently, we think to bypass the problem by wrapping calls to isCurrent and
reopen w/ a try-catch FileNotFoundException and use the reader we have at
hand. Later, we will attempt the isCurrent again. Since SMB caching seems to
be time-controlled, we expect the cache to be refreshed after several
seconds, and those calls will succeed.
I wonder though if this can't get us into hitting the exception 'forever'.
E.g., imagine a system which indexes at very high rates. Isn't it possible
that we'll hit this exception every time we call isCurrent?

I'm not sure if there is anything we can do in Lucene, besides sleeping in
FindSegmentsFile for several seconds which is not reasonable.
Maybe a way out would be, I think, having FindSegmentsFile try to read ahead
and then backwards. At some point, we ought to find a segment that's
readable, even if an old one, no?

Any help will be appreciated.

Shai

Reply via email to