[jira] Commented: (LUCENE-710) Implement "point in time" searching without relying on filesystem semantics

Michael McCandless (JIRA) Tue, 23 Jan 2007 14:02:12 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466840
 ]


Michael McCandless commented on LUCENE-710:
-------------------------------------------

> I don't really understand this interface and so I cannot see how you
> intend to rewrite the IndexFileDeleter as you describe, but I agree
> that if this can be done it is a better solution. So I am okay with
> waiting for this approach to mature into code.

The deletion policy is called on creation of a writer (onInit) and
once per commit (onCommit) and is given a List of existing commits (=
SegmentInfos instances) in the index.  The policy then decides which
commits should be removed and IndexFileDeleter translates that request
(using reference counting, because a given index file may still be
reference by commits that are not yet deleted) into which specific
files to remove.

For example, onCommit you would typically see a List of length 2: the
prior commit and the new one.  And the default policy
(KeepOnlyLastCommit) would at this point remove the prior one.

Realize that the "commit on close" mode (autoCommit=false) for
IndexWriter (that I'm doing as part of this issue) actually keeps 2
SegmentInfos alive at any given time: first is the segments_N file in
the index, and second is the "in memory" SegmentInfos that haven't yet
been committed to a segments_N file.  It's only on close when the
commit takes place that the deleter then deletes the previous
segments_N commit.

> (I would prefer the DeletionPolicy to be a pluggable *interface* and
> the IndexFileDeleter to be an internal *class*, so that at least we
> do not expose now something that would stand in our way in the
> future. But again, since I do not fully understand your solution
> maybe please bear with me if this is not making sense.)

Good point: I agree an interface here is cleaner.  I will use an
interface (not subclass) and make IndexFileDeleter entirely internal.
The deletion policy doesn't need to see any details of the
IndexFileDeleter class.


> Implement "point in time" searching without relying on filesystem semantics
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-710
>                 URL: https://issues.apache.org/jira/browse/LUCENE-710
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> This was touched on in recent discussion on dev list:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/41700#41700
> and then more recently on the user list:
>   http://www.gossamer-threads.com/lists/lucene/java-user/42088
> Lucene's "point in time" searching currently relies on how the
> underlying storage handles deletion files that are held open for
> reading.
> This is highly variable across filesystems.  For example, UNIX-like
> filesystems usually do "close on last delete", and Windows filesystem
> typically refuses to delete a file open for reading (so Lucene retries
> later).  But NFS just removes the file out from under the reader, and
> for that reason "point in time" searching doesn't work on NFS
> (see LUCENE-673 ).
> With the lockless commits changes (LUCENE-701 ), it's quite simple to
> re-implement "point in time searching" so as to not rely on filesystem
> semantics: we can just keep more than the last segments_N file (as
> well as all files they reference).
> This is also in keeping with the design goal of "rely on as little as
> possible from the filesystem".  EG with lockless we no longer re-use
> filenames (don't rely on filesystem cache being coherent) and we no
> longer use file renaming (because on Windows it can fails).  This
> would be another step of not relying on semantics of "deleting open
> files".  The less we require from filesystem the more portable Lucene
> will be!
> Where it gets interesting is what "policy" we would then use for
> removing segments_N files.  The policy now is "remove all but the last
> one".  I think we would keep this policy as the default.  Then you
> could imagine other policies:
>   * Keep past N day's worth
>   * Keep the last N
>   * Keep only those in active use by a reader somewhere (note: tricky
>     how to reliably figure this out when readers have crashed, etc.)
>   * Keep those "marked" as rollback points by some transaction, or
>     marked explicitly as a "snaphshot".
>   * Or, roll your own: the "policy" would be an interface or abstract
>     class and you could make your own implementation.
> I think for this issue we could just create the framework
> (interface/abstract class for "policy" and invoke it from
> IndexFileDeleter) and then implement the current policy (delete all
> but most recent segments_N) as the default policy.
> In separate issue(s) we could then create the above more interesting
> policies.
> I think there are some important advantages to doing this:
>   * "Point in time" searching would work on NFS (it doesn't now
>     because NFS doesn't do "delete on last close"; see LUCENE-673 )
>     and any other Directory implementations that don't work
>     currently.
>   * Transactional semantics become a possibility: you can set a
>     snapshot, do a bunch of stuff to your index, and then rollback to
>     the snapshot at a later time.
>   * If a reader crashes or machine gets rebooted, etc, it could choose
>     to re-open the snapshot it had previously been using, whereas now
>     the reader must always switch to the last commit point.
>   * Searchers could search the same snapshot for follow-on actions.
>     Meaning, user does search, then next page, drill down (Solr),
>     drill up, etc.  These are each separate trips to the server and if
>     searcher has been re-opened, user can get inconsistent results (=
>     lost trust).  But with, one series of search interactions could
>     explicitly stay on the snapshot it had started with.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-710) Implement "point in time" searching without relying on filesystem semantics

Reply via email to