Hi Has anyone experienced any problems w/ Lucene indexes on a shared SMB2 network drive?
We've hit a scenario where it seems the FS cache refuses to check for existence of files on the shared network drive. Specifically, we hit the following exception: java.io.FileNotFoundException: Z:\index\segments_p8 (The system cannot find the file specified.) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243) at org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.<init>(FSDirectory.java:552) at org.apache.lucene.store.FSDirectory$FSIndexInput.<init>(FSDirectory.java:582) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:482) at org.apache.lucene.index.SegmentInfos$2.doBody(SegmentInfos.java:369) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653) at org.apache.lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:366) at org.apache.lucene.index.DirectoryIndexReader.isCurrent(DirectoryIndexReader.java:188) at org.apache.lucene.index.MultiReader.isCurrent(MultiReader.java:352) The environment: * 3 Windows Server 2008 machines ** Machine A - hosts the index ** Machine B - indexes and search ** Machine C - just search * Machine A and B map Machine C on drive Z. * The exception happens on Machine C only, i.e. on the machine that does just 'search'. According to my understanding, FindSegmentFile attempts to read the latest segment from segments.gen and directory listing and if there is a problem, it will do a gen-readahead until success or defaultGenLookaheadCount is exhausted. So by hitting this exception we thought of the following explanation: the FS cache 'decides' the file does not exist, due to a stale directory cache, and refuses to check whether the file actually exists on the remote machine. Does that sound reasonable? Some more information: * We use Lucene 2.4.0 * Other runs are executed on those machines currently, and so it will take about a week until we can run the same scenario again. I thought that perhaps we can discuss this until then. * Unfortunately we weren't able to get an infoStream output before the machines started another run, so we hope to get it next time. Anyway, it's not easily reproduced. * There isn't any other process which touches this directory, such that it may remove index files. We know the same code runs well on NFS (4). We haven't checked yet if SMB 1.0 works ok. Some pointers we've found: A known issue on MS, w/ some C++ fixes: http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?dg=microsoft.public.win32.programmer.networks&tid=69e63e38-7d91-4306-ab6e-a615e1c6afaa&cat=en_US_bc89adf4-f184-4d3d-aaee-122567385744&lang=en&cr=US&sloc=&p=1 Info on how to disable SMB 2.0 on Windows: http://www.petri.co.il/how-to-disable-smb-2-on-windows-vista-or-server-2008.htm Currently, we think to bypass the problem by wrapping calls to isCurrent and reopen w/ a try-catch FileNotFoundException and use the reader we have at hand. Later, we will attempt the isCurrent again. Since SMB caching seems to be time-controlled, we expect the cache to be refreshed after several seconds, and those calls will succeed. I wonder though if this can't get us into hitting the exception 'forever'. E.g., imagine a system which indexes at very high rates. Isn't it possible that we'll hit this exception every time we call isCurrent? I'm not sure if there is anything we can do in Lucene, besides sleeping in FindSegmentsFile for several seconds which is not reasonable. Maybe a way out would be, I think, having FindSegmentsFile try to read ahead and then backwards. At some point, we ought to find a segment that's readable, even if an old one, no? Any help will be appreciated. Shai