robert engels <[EMAIL PROTECTED]> wrote on 15/01/2007 16:37:35: > I did a cursory review of the discussion. > > The problem I see is that in the checkpoint tx files you need a > 'delete file' for every segment where a deletion SHOULD occur when it > is commited, but if you have multiple open transactions being > created, as soon as one is applied (committed), the deletions being > tracked in the other tx are no longer valid. This would imply that > only a single tx can be active, and if that is the case, there are > easier methods.
My example (scenario A) can in fact be a non database application (the other characteristics remain). As a database application, to my understanding the (newly suggested) transaction support in Lucene is single tx. I can't see how multiple tx can be done within Lucene (and I don't think it should be done). Even if it was possible, I think indexing would become very inefficient. I think the motivation for adding (some) tx support is different, and tx support would be minimal, definitely not multiple tx. > > Simple example: > > Consider the index with documents A (doc 0), and B (doc 1) stored in > a single segment (S1). > > User 1 open a tx to modify I, deletes A and inserts C. tx needs to > track that 0 is deleted, C is added. > User 2 open a tx to modify I, deletes B and inserts D. tx needs to > track that 1 is deleted, D is added. > > tx1 commits, and optimizes, creates a new segment S2, containing B > (doc 0), C (doc 1). > > Now tx2 commits: > > It cannot apply its changes to S2, since if doc 1 is deleted, > document C will be removed from the index. > > It can't figure out what doc numbers in S2 correspond to doc numbers > in S1 either (since there is no unique key). > > How do you propose to solve this? Am I missing something here? > > > On Jan 15, 2007, at 6:01 PM, Doron Cohen wrote: > > > > > Note: discussion started originally in > > http://www.nabble.com/adding-%22explicit-commits%22-to-Lucene-- > > t3011270.html > > > > > > robert engels <[EMAIL PROTECTED]> wrote on 15/01/2007 13:23:14: > > > >> I think that you will find a much larger performance decrease in > >> doing things this way - if the external resource is a db, or any > >> networked accessed resource. > > > > That's possible - let's see - I had two scenarios in mind: > > > > Scenario A: > > (1) documents are maintained in a database; > > (2) each document is not small - it may have lots of text; > > (3) some parsing may be required before tokenizing; > > (4) there are many documents; > > (5) there is a specific document property <P> which is used during > > search, > > for, say - filtering. > > (6) <P> may change often, and re-indexing documents just because of > > that is > > too expensive. > > > > Scenario B: > > (1) to (5) as above. > > (6) <P> is computable from (say) a database, but is unsteady and > > cannot be > > indexed - this is the case that came out in > > http://www.gossamer-threads.com/lists/lucene/java-user/44122 > > (7) <P> cannot be used for post filtering, because search results > > without > > that filtering might return too many false results. > > (8) there may be (frequent) document updates. > > > > If we accept the assumption that maintaining the changing value in the > > index is either impossible or inefficient, the application would > > strive to > > maintain these values outside of Lucene. Then: > > > > Scenario A: > > - Two 'arrays' can be maintained - <P>, and <B> (==isDeleted) > > - At search, <P> and <B> are read and used to construct a filter, > > possibly > > as an IndexReader (propriately) implementation, and used with e.g. > > ParallelReader. > > - At calling optimize, The two arrays are used for creating the > > 'next' two > > arrays. > > > > Scenario B: > > - After thinking more about this, I agree with you for this scenario - > > updating the mapping after optimize is not trivial. (I first > > thought that > > this should be easy since we know which docs were updated, but I > > now see > > that this would be very expensive.) > > > > So, this leaves one scenario that would benefit from this. If > > Lucene had > > in-place-update that would not be needed, but this does not seems > > likely in > > the near future. > > > > Okay, this is quite a stressed example, however a real one. > > > > If people think that this may be indeed useful, I can go on and see > > what it > > means to implement something like this. > > > >> > >> When even just a single document is changed in the Lucene index you > >> could have MILLIONS of changes to internal doc ids (if say an early > >> document was deleted). > >> > >> Seems far better to store the external id in the Lucene index. > >> > >> You will find that performance penalty of looking up the Lucene > >> document by the external id (vs. the internal doc #), to be far less > >> than the performance penalty of updating every document in the > >> external index when the Lucene index is merged. > >> > >> The only case I can see this would be of any benefit is if the Lucene > >> index RARELY if EVER changes - anything else, and you will have big > >> problems. > >> > >> Now, if the Lucene is changed to support point in time searching > >> (basically never delete any index files), you might be able to do > >> what you this. Just create a Directory only creating the segments up > >> to that time. > >> > >> Sounds VERY messy to me. > >> --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]