[jira] Commented: (LUCENE-673) Exceptions when using Lucene over NFS

Michael McCandless (JIRA) Sat, 21 Oct 2006 06:19:41 -0700

    [ 
http://issues.apache.org/jira/browse/LUCENE-673?page=comments#action_12444041 ] 
            
Michael McCandless commented on LUCENE-673:
-------------------------------------------


Yes, you are absolutely correct.

The current implementation of Lucene's "point in time" searching
capability (ie, once an IndexSearcher is open, it searches the
"snapshot" of the index at that point in time, even as writer(s) are
changing the index), directly relies on specific filesystem semantics
of "deletes of still open files".

But, these semantics differ drastically across filesystems:

  * On WIN32 local filesystems you get "Access Denied" when trying to
    delete open files.  Lucene catches this & retries.

  * On UNIX local filesystems, the delete succeeds but the underlying
    file is still present & usable by open file handles ("delete on
    last close") until they are closed.

  * But, on NFS, there is absolutely no support for this.  NFS server
    (until version 4) is stateless and so makes no effort to let you
    continue to access deleted files.

This means, at best for NFS (with "lock-less commits" fixes -- still
in progress) we can hope to reliably instantiate a reader (ie, no more
intermittent exceptions on loading the segments), but, you will not be
able to use the "point in time searching".  Meaning, when running a
search, you must expect to get a "stale NFS handle" IOException, and
re-open your index when that happens.

I think, in the future, it would make sense to change how Lucene
implements "point in time searching" so that it doesn't rely on
filesystem semantics at all (which are clearly quite different in this
area) and, instead, explicitly keeps segments_N files (and the
segments they reference) in the filesystem until "it's decided" (via
some policy, eg, "keep the last N generations" or "keep past N days
worth") that they should be pruned.

Note that such an explicit implementation would also resolve a
limitation of the current "point in time searching" which is: you
can't close your searcher and re-open it at that same point in time.
If your searcher crashes, or JVM crashes, or whatever, you are forced
at that point to switch up to the current index.  You don't have the
freedom to re-open the snapshot you had been using.  An explicit
implementation would fix that.

The "lock-less commits" changes would make this quite straightforward
as a future change, but I'm not aiming to do that for starters --
"progress not perfection"!


> Exceptions when using Lucene over NFS
> -------------------------------------
>
>                 Key: LUCENE-673
>                 URL: http://issues.apache.org/jira/browse/LUCENE-673
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: NFS server/client
>            Reporter: Michael McCandless
>
> I'm opening this issue to track details on the known problems with
> Lucene over NFS.
> The summary is: if you have one machine writing to an index stored on
> an NFS mount, and other machine(s) reading (and periodically
> re-opening the index) then sometimes on re-opening the index the
> reader will hit a FileNotFound exception.
> This has hit many users because this is a natural way to "scale up"
> your searching (single writer, multiple readers) across machines.  The
> best current workaround (I think?) is to take the approach Solr takes
> (either by actually using Solr or copying/modifying its approach) to
> take snapshots of the index and then have the readers open the
> snapshots instead of the "live" index being written to.
> I've been working on two patches for Lucene:
>   * A locking (LockFactory) implementation using native OS locks
>   * Lock-less commits
> (I'll open separate issues with the details for those).
> I have a simple stress test where one machine is constantly adding
> docs to an index over NFS, and another machine is constantly
> re-opening the index searcher over NFS.
> These tests have revealed new details (at least for me!) about the
> root cause of our NFS problems:
>   * Even when using native locks over NFS, Lucene still hits these
>     exceptions!
>     I was surprised by this because I had always thought (assumed?)
>     the NFS problem was because the "simple" file-based locking was
>     not correct over NFS, and that switching to native OS filesystem
>     locking would resolve it, but it doesn't.
>     I can reproduce the "FileNotFound" exceptions even when using NFS
>     V4 (the latest NFS protocol), so this is not just a "your NFS
>     server is too old" issue.
>   * Then, when running the same stress test with the lock-less
>     changes, I don't hit any exceptions.  I've tested on NFS version
>     2, 3 and 4 (using the "nfsvers=N" mount option).
> I think this means that in fact (as Hoss at one point suggested I
> believe), the NFS problems are likely due to the cache coherence of
> the NFS file system (I think the "segments" file in particular)
> against the existence of the actual segment data files.
> In other words, even if you lock correctly, on the reader side it will
> sometimes see stale contents of the "segments" file which lead it to
> try to open a now deleted segment data file.
> So I think this is good news / bad news: the bad news is, native
> locking doesn't fix our problems with NFS (as at least I had expected
> it to).  But the good news is, it looks like (still need to do more
> thorough testing of this) the changes for lock-less commits do enable
> Lucene to work fine over NFS.
> [One quick side note in case it helps others: to get native locks
> working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
> install nfs-common" on the NFS client machines.  Before I did this I
> would hit "No locks available" IOExceptions on calling the "tryLock"
> method.  The default nfs server install on the server machine just
> worked because it runs in kernel mode and it start a lockd process.]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-673) Exceptions when using Lucene over NFS

Reply via email to