[jira] Commented: (LUCENE-710) Implement "point in time" searching without relying on filesystem semantics

Doron Cohen (JIRA) Fri, 19 Jan 2007 10:56:52 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466126
 ]


Doron Cohen commented on LUCENE-710:
------------------------------------

>   * Second, change how IndexFileDeleter works: have it keep track of
>     which commits are still live and which one is pending (as the
>     SegmentInfos in IndexWriter, not yet written to disk).
> 
>     Allow IndexFileDeleter to be subclassed to implement different
>     "deletion policies".
> 
>     The base IndexFileDeleter class will use ref counts to figure out
>     which individual index files are still referenced by one or more
>     "segments_N" commits or by the uncommitted "in-memory"
>     SegmentInfos.  Then the policy is invoked on commit (and also on
>     init) and can choose which commits (if any) to now remove.
> 
>     Add constructors to IndexWriter allowing you to pass in your own
>     deleter. The default policy would still be "delete all past
>     commits as soon as a new commit is written" (this is how deleting
>     happens today).
> 
>     For NFS we can then try different policies as discussed on those
>     threads above (there were at least 4 proposals).  They all have
>     different tradeoffs.  I would open separate issues for these
>     policies after this issue is resolved.
> 

This ties solving the NFS issue with an extendable-file-deletion policy.
I am wondering is this the right way, or, perhaps, should the reference 
counting be considered alone, apart from the deletion policy.
(Would modifying IndexFileDeleter to base on ref-count make it simpler
or harder to maintain?)

Also, IndexFileDeleter is doing delicate work - not sure you want 
applications to mess with it. Better let applications control some
simple well defined behavior, maybe the same way that a sorter 
allows applications to provide a comparator, but keeps the sorting 
algorithm for itself.

Back to reference counting,- how about the following approach:
- Add to Directory a FileReferenceCounter data member, get()/set() etc.
- Add a class FileReferenceCounter with simple general methods:
  void increment (String name)
  void decrement (String name)
  int getRefCount (String name)
- Default implementation would do nothing, i.e. would not record 
  references, and always return 0.
- IndexReader, upon opening a segment, would call increment(segName)
- IndexReader, upon closing a segment, would call decrement(segName)
- IndexFileDeleter, before removing a file belonging to a certain segment, 
  would verify getRefCount(segName)==0.
- Notice that the FilereferenceCounter is available from the Directory, 
  so no constructors should be added to IndexWriter/Reader.

So, this is adding to Directory a general file utility, no knowledge of 
index structure required in Directory. Also, IndexFileDeleter can remain 
as today, and at some later point can be made more powerful with various 
deletion policies - but those policies remain unrelated to the NFS 
issue - they can focus on point-in-time issues, where I think it 
stemmed from. 

An NFS geared FileReferenceCounter would then be able to keep alive 
"counter files", name those files based on counted fileName plus
processID plus machID, base getRefCount on safety window since file 
was last touched, etc. All this is left out from point-in-time 
policies (how many/time points-in-time should be retained).

> Implement "point in time" searching without relying on filesystem semantics
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-710
>                 URL: https://issues.apache.org/jira/browse/LUCENE-710
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> This was touched on in recent discussion on dev list:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/41700#41700
> and then more recently on the user list:
>   http://www.gossamer-threads.com/lists/lucene/java-user/42088
> Lucene's "point in time" searching currently relies on how the
> underlying storage handles deletion files that are held open for
> reading.
> This is highly variable across filesystems.  For example, UNIX-like
> filesystems usually do "close on last delete", and Windows filesystem
> typically refuses to delete a file open for reading (so Lucene retries
> later).  But NFS just removes the file out from under the reader, and
> for that reason "point in time" searching doesn't work on NFS
> (see LUCENE-673 ).
> With the lockless commits changes (LUCENE-701 ), it's quite simple to
> re-implement "point in time searching" so as to not rely on filesystem
> semantics: we can just keep more than the last segments_N file (as
> well as all files they reference).
> This is also in keeping with the design goal of "rely on as little as
> possible from the filesystem".  EG with lockless we no longer re-use
> filenames (don't rely on filesystem cache being coherent) and we no
> longer use file renaming (because on Windows it can fails).  This
> would be another step of not relying on semantics of "deleting open
> files".  The less we require from filesystem the more portable Lucene
> will be!
> Where it gets interesting is what "policy" we would then use for
> removing segments_N files.  The policy now is "remove all but the last
> one".  I think we would keep this policy as the default.  Then you
> could imagine other policies:
>   * Keep past N day's worth
>   * Keep the last N
>   * Keep only those in active use by a reader somewhere (note: tricky
>     how to reliably figure this out when readers have crashed, etc.)
>   * Keep those "marked" as rollback points by some transaction, or
>     marked explicitly as a "snaphshot".
>   * Or, roll your own: the "policy" would be an interface or abstract
>     class and you could make your own implementation.
> I think for this issue we could just create the framework
> (interface/abstract class for "policy" and invoke it from
> IndexFileDeleter) and then implement the current policy (delete all
> but most recent segments_N) as the default policy.
> In separate issue(s) we could then create the above more interesting
> policies.
> I think there are some important advantages to doing this:
>   * "Point in time" searching would work on NFS (it doesn't now
>     because NFS doesn't do "delete on last close"; see LUCENE-673 )
>     and any other Directory implementations that don't work
>     currently.
>   * Transactional semantics become a possibility: you can set a
>     snapshot, do a bunch of stuff to your index, and then rollback to
>     the snapshot at a later time.
>   * If a reader crashes or machine gets rebooted, etc, it could choose
>     to re-open the snapshot it had previously been using, whereas now
>     the reader must always switch to the last commit point.
>   * Searchers could search the same snapshot for follow-on actions.
>     Meaning, user does search, then next page, drill down (Solr),
>     drill up, etc.  These are each separate trips to the server and if
>     searcher has been re-opened, user can get inconsistent results (=
>     lost trust).  But with, one series of search interactions could
>     explicitly stay on the snapshot it had started with.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-710) Implement "point in time" searching without relying on filesystem semantics

Reply via email to