[jira] Updated: (LUCENE-710) Implement "point in time" searching without relying on filesystem semantics

Doron Cohen (JIRA) Thu, 15 Mar 2007 02:09:31 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Doron Cohen updated LUCENE-710:
-------------------------------

    Attachment: 710.review.diff

I was too slow in reviewing this, so while I was studying the new code it was 
committed... 

Anyhow I have a few comments and a question - I think JIRA LUCENE-710 is still 
the place for this discussion even though the issue is already resolved. 

The attached 710.comments.diff  implements a few suggested changes.

I like the definition and use of IndexDeletePolicy and CommitPoint - this is 
very flexible and clear, and would indeed allow to implement NFS suited logic. 
These two concepts are central to implementing such logic, and I thought their 
Javadocs should be enhanced (included in the attached).

IndexFileDeleter - it is nice that this became non public and somewhat simpler. 
I added some internal documentation (not javadocs) in that file as I learned 
how it works. I think these would be useful for others diving into this code. I 
also modified some variable names for clarity (in the attached). 

I don't understand yet why we allow a deletion policy to delete *all* commits 
(including the most recent) - TestDeletionPolicy explains this as: "This is 
useful for adding to a big index w/ autoCommit =false when you know readers are 
not using it." - so, would I risk losing the big index should uncommited 
additions fail? what does one earn by this? I first thought we should prevent 
(exception) deleting the most recent commit, but I must be missing something - 
could you elaborate on this?

checkpoints() is another - more internal - new concept in this code. At writing 
this I don't fully understand it. IndexWriter has its own checkpoint() method, 
but it also calls IndexFileDeleter.checkpoint(). IndexReader only calls 
IndexFileDeletion.checkpoint() - it does not have a checkpoint() itself.   
...mmm... For IndexReader it makes sense since it always commits only at 
close(), or at explicit calls to commit(). Perhaps I understand it better 
now... Ok, I added some documentation for this in IndexWriter, I think it would 
also help others. (in the attached.)

This issue also introduced constants for file names - hasSingleNorms (i.e. nrm) 
 and SINGLE_NORMS_EXTENSION (.fN) were confusing/collating - so I modified .fN 
to PLAIN_NORMS_EXTENSION.

This issue moved some files logic SegmentInfo. The -1/1/0 logic and especially 
with norms is confusing, and at least I have to re-read the code carefully each 
time again and again to be convinced that it is correct. It would be nice when 
we can get rid of some of the backward compatibility cases here. Anyhow I added 
some documentation and also replaced the -1/1/0 with constants, I think this 
makes it easier to understand.

Regards,
Doron


> Implement "point in time" searching without relying on filesystem semantics
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-710
>                 URL: https://issues.apache.org/jira/browse/LUCENE-710
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: 710.review.diff, LUCENE-710.patch, 
> LUCENE-710.take2.patch, LUCENE-710.take3.patch
>
>
> This was touched on in recent discussion on dev list:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/41700#41700
> and then more recently on the user list:
>   http://www.gossamer-threads.com/lists/lucene/java-user/42088
> Lucene's "point in time" searching currently relies on how the
> underlying storage handles deletion files that are held open for
> reading.
> This is highly variable across filesystems.  For example, UNIX-like
> filesystems usually do "close on last delete", and Windows filesystem
> typically refuses to delete a file open for reading (so Lucene retries
> later).  But NFS just removes the file out from under the reader, and
> for that reason "point in time" searching doesn't work on NFS
> (see LUCENE-673 ).
> With the lockless commits changes (LUCENE-701 ), it's quite simple to
> re-implement "point in time searching" so as to not rely on filesystem
> semantics: we can just keep more than the last segments_N file (as
> well as all files they reference).
> This is also in keeping with the design goal of "rely on as little as
> possible from the filesystem".  EG with lockless we no longer re-use
> filenames (don't rely on filesystem cache being coherent) and we no
> longer use file renaming (because on Windows it can fails).  This
> would be another step of not relying on semantics of "deleting open
> files".  The less we require from filesystem the more portable Lucene
> will be!
> Where it gets interesting is what "policy" we would then use for
> removing segments_N files.  The policy now is "remove all but the last
> one".  I think we would keep this policy as the default.  Then you
> could imagine other policies:
>   * Keep past N day's worth
>   * Keep the last N
>   * Keep only those in active use by a reader somewhere (note: tricky
>     how to reliably figure this out when readers have crashed, etc.)
>   * Keep those "marked" as rollback points by some transaction, or
>     marked explicitly as a "snaphshot".
>   * Or, roll your own: the "policy" would be an interface or abstract
>     class and you could make your own implementation.
> I think for this issue we could just create the framework
> (interface/abstract class for "policy" and invoke it from
> IndexFileDeleter) and then implement the current policy (delete all
> but most recent segments_N) as the default policy.
> In separate issue(s) we could then create the above more interesting
> policies.
> I think there are some important advantages to doing this:
>   * "Point in time" searching would work on NFS (it doesn't now
>     because NFS doesn't do "delete on last close"; see LUCENE-673 )
>     and any other Directory implementations that don't work
>     currently.
>   * Transactional semantics become a possibility: you can set a
>     snapshot, do a bunch of stuff to your index, and then rollback to
>     the snapshot at a later time.
>   * If a reader crashes or machine gets rebooted, etc, it could choose
>     to re-open the snapshot it had previously been using, whereas now
>     the reader must always switch to the last commit point.
>   * Searchers could search the same snapshot for follow-on actions.
>     Meaning, user does search, then next page, drill down (Solr),
>     drill up, etc.  These are each separate trips to the server and if
>     searcher has been re-opened, user can get inconsistent results (=
>     lost trust).  But with, one series of search interactions could
>     explicitly stay on the snapshot it had started with.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-710) Implement "point in time" searching without relying on filesystem semantics

Reply via email to