Re: allowing applications to control docids change? (e.g. setKeepDeletes(boolean)?)

Doron Cohen Tue, 16 Jan 2007 00:05:45 -0800

robert engels <[EMAIL PROTECTED]> wrote on 15/01/2007 16:37:35:

> I did a cursory review of the discussion.
>
> The problem I see is that in the checkpoint tx files you need a
> 'delete file' for every segment where a deletion SHOULD occur when it
> is commited, but if you have multiple open transactions being
> created, as soon as one is applied (committed), the deletions being
> tracked in the other tx are no longer valid. This would imply that
> only a single tx can be active, and if that is the case, there are
> easier methods.


My example (scenario A) can in fact be a non database application (the
other characteristics remain).

As a database application, to my understanding the (newly suggested)
transaction support in Lucene is single tx. I can't see how multiple tx can
be done within Lucene (and I don't think it should be done). Even if it was
possible, I think indexing would become very inefficient. I think the
motivation for adding (some) tx support is different, and tx support would
be minimal, definitely not multiple tx.

>
> Simple example:
>
> Consider the index with documents A (doc 0), and B (doc 1) stored in
> a single segment (S1).
>
> User 1 open a tx to modify I, deletes A and inserts C. tx needs to
> track that 0 is deleted, C is added.
> User 2 open a tx to modify I, deletes B and inserts D. tx needs to
> track that 1 is deleted, D is added.
>
> tx1 commits, and optimizes, creates a new segment S2, containing B
> (doc 0), C (doc 1).
>
> Now tx2 commits:
>
> It cannot apply its changes to S2, since if doc 1 is deleted,
> document C will be removed from the index.
>
> It can't figure out what doc numbers in S2 correspond to doc numbers
> in S1 either (since there is no unique key).
>
> How do you propose to solve this? Am I missing something here?
>
>
> On Jan 15, 2007, at 6:01 PM, Doron Cohen wrote:
>
> >
> > Note: discussion started originally in
> > http://www.nabble.com/adding-%22explicit-commits%22-to-Lucene--
> > t3011270.html
> >
> >
> > robert engels <[EMAIL PROTECTED]> wrote on 15/01/2007 13:23:14:
> >
> >> I think that you will find a much larger performance decrease in
> >> doing things this way - if the external resource is a db, or any
> >> networked accessed resource.
> >
> > That's possible - let's see - I had two scenarios in mind:
> >
> > Scenario A:
> > (1) documents are maintained in a database;
> > (2) each document is not small - it may have lots of text;
> > (3) some parsing may be required before tokenizing;
> > (4) there are many documents;
> > (5) there is a specific document property <P> which is used during
> > search,
> > for, say - filtering.
> > (6) <P> may change often, and re-indexing documents just because of
> > that is
> > too expensive.
> >
> > Scenario B:
> > (1) to (5) as above.
> > (6) <P> is computable from (say) a database, but is unsteady and
> > cannot be
> > indexed - this is the case that came out in
> > http://www.gossamer-threads.com/lists/lucene/java-user/44122
> > (7) <P> cannot be used for post filtering, because search results
> > without
> > that filtering might return too many false results.
> > (8) there may be (frequent) document updates.
> >
> > If we accept the assumption that maintaining the changing value in the
> > index is either impossible or inefficient, the application would
> > strive to
> > maintain these values outside of Lucene. Then:
> >
> > Scenario A:
> > - Two 'arrays' can be maintained - <P>, and <B> (==isDeleted)
> > - At search, <P> and <B> are read and used to construct a filter,
> > possibly
> > as an IndexReader (propriately) implementation, and used with e.g.
> > ParallelReader.
> > - At calling optimize, The two arrays are used for creating the
> > 'next' two
> > arrays.
> >
> > Scenario B:
> > - After thinking more about this, I agree with you for this scenario -
> > updating the mapping after optimize is not trivial. (I first
> > thought that
> > this should be easy since we know which docs were updated, but I
> > now see
> > that this would be very expensive.)
> >
> > So, this leaves one scenario that would benefit from this. If
> > Lucene had
> > in-place-update that would not be needed, but this does not seems
> > likely in
> > the near future.
> >
> > Okay, this is quite a stressed example, however a real one.
> >
> > If people think that this may be indeed useful, I can go on and see
> > what it
> > means to implement something like this.
> >
> >>
> >> When even just a single document is changed in the Lucene index you
> >> could have MILLIONS of changes to internal doc ids (if say an early
> >> document was deleted).
> >>
> >> Seems far better to store the external id in the Lucene index.
> >>
> >> You will find that performance penalty of looking up the Lucene
> >> document by the external id (vs. the internal doc #), to be far less
> >> than the performance penalty of updating every document in the
> >> external index when the Lucene index is merged.
> >>
> >> The only case I can see this would be of any benefit is if the Lucene
> >> index RARELY if EVER changes - anything else, and you will have big
> >> problems.
> >>
> >> Now, if the Lucene is changed to support point in time searching
> >> (basically never delete any index files), you might be able to do
> >> what you this. Just create a Directory only creating the segments up
> >> to that time.
> >>
> >> Sounds VERY messy to me.
> >>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: allowing applications to control docids change? (e.g. setKeepDeletes(boolean)?)

Reply via email to