Re: Lucene 2.1, soon

Chuck Williams Thu, 18 Jan 2007 22:54:46 -0800

I need to support NFS and would not want to rely on the reader
refreshing in X minutes.  Setting X too small risks a query failure and
setting X too large wastes disk space.  X would need to be set for 100%
reader availability, implying a large value and a lot of disk space waste.


I like the idea of customizable delete policies in IndexFileDeleter.  My
current application does not have the need for multiple processes
accessing the same index, only many threads in a single process.  There
are multiple processes cooperating, but each has its own piece of the
index stored separately.  So, an in-memory reference count scheme would
work best.

The point is that different applications have different needs.  This
could be addressed well by ensuring that IndexFileDeleter is nicely
customizable and has a few common policies available such as:  delete
immediately (current), delete after obsolete for X minutes, keep
in-memory reference counts, and keep persistent reference counts.  These
strategies might be used respectively by:  linux or windows app with
local file system, multiple processes sharing an index on nfs, single
process with an index on nfs or more efficient strategy for single
process on Windows, alternative solution for multiple processes with an
index on nfs.

Reference count schemes might best be done at the Directory level,
analogous to what Linux does.  So long as all readers and writer use the
same Directory it is easy to keep reference counts.

Perhaps IndexFileDeleter should be integrated into Directory?

Of course one might complain that this is throwing in the towel,
implementing a bunch of options instead of one elegant solution.

Chuck


Michael McCandless wrote on 01/18/2007 03:37 PM:
> Doron Cohen wrote:
>> I am not happy with complicating the readers like this, conceptually
>> adding back commit locks (for deletion), this time with a keep-a-life
>> thread, and again making readers not read-only.
>>
>> To my understanding the only remaining issue with NFS is: a reader
>> might get an IO exception in case writer removed an old file that
>> the reader is using.
>>
>> It is not a possible corruption that we try to solve, right?
>>
>> For that I think it is not worth to add that stuff again.
>>
>> A writer's "two steps" policy - delete only files that
>> "would have not been in use unless a reader did not refresh for X
>> minutes"
>> is "fair enough" I think.
>>
>> By "two steps" I mean, start measuring time not from when segment to be
>> deleted was created, but rather from when its "next generation" was
>> created.
>
> Right, this was my original proposed deletion policy (below) for
> things to work on NFS.
>
> It does assume/require your application can refresh readers within the
> specified time period.  A commit (and any segments that then ref count
> to zero) gets deleted after they have been "obsoleted" for more than X
> minutes.
>
> Even though it's not perfect (progress not perfection!), I like it the
> best of the three options discussed on this thread so far because 1)
> it leaves the readers read only, and 2) it should work on all versions
> of NFS.
>
> This would just be a different deletion policy, and it wouldn't be the
> default one.  We would leave the default as "keep only last commit
> and delete old one immediately", for backwards compatibility.
>
> Finally, an application can always make their own deletion policy
> (subclass IndexFileDeleter) if they need to.
>
> Mike
>
>> Michael McCandless <[EMAIL PROTECTED]> wrote on 18/01/2007
>> 14:24:16:
>>
>>> Marvin Humphrey wrote:
>>>> On Jan 17, 2007, at 1:16 PM, Michael McCandless wrote:
>>>>
>>>>> This is the solution I have in mind for LUCENE-710: change the
>>>>> IndexFileDeleter so that instead of always immediately deleting the
>>>>> last commit when a new commit happens, allow some time before doing
>>>>> so.  This way readers have a chance to refresh.  The actual time
>>>>> would
>>>>> be settable by the developer.  So if you set it to 6 hours, then, a
>>>>> commit would remain usable for at least 6 hours after it had been
>>>>> obsoleted by a new commit.  This means if you can ensure your readers
>>>>> refresh within 6 hours of a new commit happening, then the writer
>>>>> will
>>>>> never delete an "in-use" commit.
>>>> I've been mulling this over.  If you set the interval to 6 hours, and
>>>> there's a lot of churn (e.g. if you optimize frequently), you'll
>>>> end up
>>
>>>> with a lot of wasted disk space.  On the flip side, the user has to
>>>> set
>>
>>>> up some sort of trigger for refreshing the IndexReaders anyway.  It's
>>>> still not user-friendly by default, and we'd be polluting the API with
>> a
>>>> hateful workaround.
>>> Well, 6 hours would be a long time for such a high turnover site.
>>> They would presumably set the time to something like 10 minutes
>>> instead.
>>>
>>> I think we should decouple the deletion policy from commits.  This way
>>> developers could subclass and make their own deletion policy that
>>> suits their application.  The IndexFileDeleter base class would do all
>>> the legwork to keep ref counts to all specific index files based on
>>> all segments_N commits that are still "live".  Then the deletion
>>> policy just decides which commits should be deleted, when.  (This is
>>> roughly what's outlined in LUCENE-710).
>>>
>>> The current policy is to delete all prior commits after a new commit
>>> and that would remain the default.
>>>
>>> Chuck's idea (reference counting via filesystem) would be another
>>> policy.  My proposal (delete by time after being obsoleted) would be
>>> another policy, etc.
>>>
>>>> The real problem is NFS.  For background, see
>>>> <http://nfs.sourceforge.net/#section_d>, item D2, which deals with NFS
>>>> and "delete on last close".
>>>>
>>>> Now I wonder.  Version 4 of the NFS protocol introduces state, so it's
>>>> possible to implement file locking.  Can we lock a segments file, then
>>>> have IndexFileDeleter detect which segments are locked that way?  And
>> if
>>>> that's the case, can we detect whether the locking mechanism is
>>>> failing
>>
>>>> and throw an exception if someone tries to use an earlier version of
>> NFS?
>>> Locking and NFS makes me very nervous :)
>>>
>>>> I'd be cool with making it impossible to put an index on an NFS volume
>>>> prior to version 4.  That puts the blame where it belongs.
>>> Well, most times users have no control over which NFS server and/or
>>> client version is in use, so I think taking this approach of "pinning
>>> the blame" can only hurt our users.  I would rather find a solution
>>> that's more portable, if we can (like the ref counting idea Chuck
>>> brought up).
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene 2.1, soon

Reply via email to