This is spooky -- it looks like SMB2 (which was introduced with Windows
Vista & Windows Server 2008) now does "aggressive" client-side
caching, such that the cache can be wrong about the current state of
the directory.

At least it sort of sounds like Microsoft considers it a real issue:

> Yes, this is a known product issue of SMB2.
>
> SMB2 does implicit attribute and directory metadata caching at all
> times, whereas SMB1 was much stricter about when it would do so. The
> caches are consistent when changes are made by the client, but if
> changes are made from another client they may not be reflected until
> the cache times out.

This will definitely cause problems (like the exception you're
hitting) for Lucene.  It's exactly the same problems we had with NFS,
but the readahead in SegmentInfos.FindSegmentsFile worked around that.
It sounds like for SMB2 that readahead is not working, presumably
because (unlike NFS) the client does not check back w/ the server if
it believes (based on its stale cache) that the file does not exist.  Sigh.

SMB1 did not have this problem, in my experience.

I wonder if, from javaland, we have some way to force the cache to
become coherent.

One simple workaround at the app level is to simply retry on hitting
an errant "segments_N file not found" exception.

Mike

On Thu, Aug 13, 2009 at 8:50 AM, Shai Erera<ser...@gmail.com> wrote:
> Hi
>
> Has anyone experienced any problems w/ Lucene indexes on a shared SMB2
> network drive?
>
> We've hit a scenario where it seems the FS cache refuses to check for
> existence of files on the shared network drive. Specifically, we hit the
> following exception:
>
> java.io.FileNotFoundException: Z:\index\segments_p8 (The system cannot find
> the file specified.)
> at java.io.RandomAccessFile.open(Native Method)
> at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
> at
> org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.<init>(FSDirectory.java:552)
> at
> org.apache.lucene.store.FSDirectory$FSIndexInput.<init>(FSDirectory.java:582)
> at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488)
> at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:482)
> at org.apache.lucene.index.SegmentInfos$2.doBody(SegmentInfos.java:369)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653)
> at
> org.apache.lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:366)
> at
> org.apache.lucene.index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java:188)
> at org.apache.lucene.index.MultiReader.isCurrent(MultiReader.java:352)
>
> The environment:
> * 3 Windows Server 2008 machines
> ** Machine A - hosts the index
> ** Machine B - indexes and search
> ** Machine C - just search
> * Machine A and B map Machine C on drive Z.
> * The exception happens on Machine C only, i.e. on the machine that does
> just 'search'.
>
> According to my understanding, FindSegmentFile attempts to read the latest
> segment from segments.gen and directory listing and if there is a problem,
> it will do a gen-readahead until success or defaultGenLookaheadCount is
> exhausted.
>
> So by hitting this exception we thought of the following explanation: the FS
> cache 'decides' the file does not exist, due to a stale directory cache, and
> refuses to check whether the file actually exists on the remote machine.
>
> Does that sound reasonable?
>
> Some more information:
> * We use Lucene 2.4.0
> * Other runs are executed on those machines currently, and so it will take
> about a week until we can run the same scenario again. I thought that
> perhaps we can discuss this until then.
> * Unfortunately we weren't able to get an infoStream output before the
> machines started another run, so we hope to get it next time. Anyway, it's
> not easily reproduced.
> * There isn't any other process which touches this directory, such that it
> may remove index files.
>
> We know the same code runs well on NFS (4). We haven't checked yet if SMB
> 1.0 works ok. Some pointers we've found:
>
> A known issue on MS, w/ some C++ fixes:
> http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?dg=microsoft.public.win32.programmer.networks&tid=69e63e38-7d91-4306-ab6e-a615e1c6afaa&cat=en_US_bc89adf4-f184-4d3d-aaee-122567385744&lang=en&cr=US&sloc=&p=1
>
> Info on how to disable SMB 2.0 on Windows:
> http://www.petri.co.il/how-to-disable-smb-2-on-windows-vista-or-server-2008.htm
>
> Currently, we think to bypass the problem by wrapping calls to isCurrent and
> reopen w/ a try-catch FileNotFoundException and use the reader we have at
> hand. Later, we will attempt the isCurrent again. Since SMB caching seems to
> be time-controlled, we expect the cache to be refreshed after several
> seconds, and those calls will succeed.
> I wonder though if this can't get us into hitting the exception 'forever'.
> E.g., imagine a system which indexes at very high rates. Isn't it possible
> that we'll hit this exception every time we call isCurrent?
>
> I'm not sure if there is anything we can do in Lucene, besides sleeping in
> FindSegmentsFile for several seconds which is not reasonable.
> Maybe a way out would be, I think, having FindSegmentsFile try to read ahead
> and then backwards. At some point, we ought to find a segment that's
> readable, even if an old one, no?
>
> Any help will be appreciated.
>
> Shai
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to