I uploaded a patch to LUCENE-3424 which implements sequence ids for IW. Add, update and delete returns a long seqID for every operation and commit returns the largest committed seq id.
When writing transaction logs or a journal (however you wanna call it) - the biggest problem here is that in a multithreaded environment operations on the IW don't return in order so you basically have two options, 1. build up a barrier and synchronize the operations as they arrive or 2. somehow sort the logs once they need to be applied. The first option seems like a total wast and kills concurrency entirely. Some apps might be able to tell if two events are independent and guarantee the order of dependent events on top of IW (like ES does). Yet for lucene in general this is not always true since we don't have a fixed primary key. The second options provides nice concurrency and optimizes for the non-failure case. Unless you need to replay the logs you can minimize the overhead concurrency wise. If logs are replayed it somehow needs to be done in two or more steps (1. resort the seq ids & offsets 2. read the entries in order based on 1.) The biggest issue I see here is that you can not read the logs sequentially from disk is almost certainly a perf hit. In a real world systems there could even be a background process that reorders / compacts the logs really. When we replicate documents to another machine with a leader per shard which seems the way solr goes (and ES is doing too?) sequence ids can be used to disambiguate documents with the same ID if you keep track of the ids you indexed in your current session. For instance if you update doc X with seq id N but you already saw doc X with seq id N+1 you can simply drop it. I would be interested in feedback especially for the transaction log ordering simon On Mon, Sep 12, 2011 at 1:04 AM, Michael McCandless <[email protected]> wrote: > I agree: we should figure out just how an app would effectively make > use of this seq ID, in order to understand if this really is gonna > "work" end to end. Else we shouldn't change Lucene's core APIs. > > EG: could ES remove its lock array if Lucene returned a seq ID? How > "bad" is it that ES/Solr/this-new-module would have to order their > transaction log according to Lucene's seq ID? Or maybe it would not > re-order, but rather write the seqID+document in each entry; then on > playback (but also on RT get) it'd have to re-order? > > Mike McCandless > > http://blog.mikemccandless.com > > On Sat, Sep 10, 2011 at 1:45 PM, Simon Willnauer > <[email protected]> wrote: >> On Thu, Sep 8, 2011 at 5:35 PM, Yonik Seeley <[email protected]> >> wrote: >>> On Thu, Sep 8, 2011 at 11:26 AM, Michael McCandless >>> <[email protected]> wrote: >>>> Returning a long seqID seems the least invasive change to make this >>>> total ordering possible? Especially since the DWDQ already computes >>>> this order... >>> >>> +1 >>> This seems like the most powerful option. >> >> I still wonder how we make efficient use of this. If you are ordering >> the logs based on the returned sequence Ids you have to effectively >> delay writing to the log since documents ie. their threads come back >> async and out of order. Even worse if some thread picks up a flush it >> might block for a reasonable amount of time. I am not saying its >> impossible but before we jump on it and get into the DWPT hassle we >> should at least sketch out how to make use of this feature (lemme tell >> you this is not trivial to implement and requires a fair bit of >> refactoring). If somebody has thought about this I'd be happy if you >> could share you ideas here! >> >> simon >>> >>> -Yonik >>> http://www.lucene-eurocon.com - The Lucene/Solr User Conference >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
