Jake, I wonder how often people need reliable transactions for realtime search? Maybe Mysql's t-log could be used sans the database part?
The created_at column for near realtime seems like it could hurt the database due to excessive polling? Has anyone tried it yet? > I wrote up a simple file-based indexing event log in an afternoon Right, however it's probably a long perilous leap from this to a t-log that's production ready. I'm waiting for someone to dive in and mess with Bookkeeper http://wiki.apache.org/hadoop/BookKeeper and report back! Jason On Tue, Jan 12, 2010 at 9:49 PM, Jake Mannix <jake.man...@gmail.com> wrote: > On Tue, Jan 12, 2010 at 8:55 PM, Jason Rutherglen > <jason.rutherg...@gmail.com> wrote: >> >> > Zoie keeps track of an "index version" on disk alongside the Lucene >> > index which it uses to decide where it must reindex from to "catch up" if >> > it >> > there have been incoming indexing events while the server was out of >> > commission. >> >> This begs a little more clarity... Sounds like a transaction log. Oh >> right, with Zoie there's the assumption of an external transaction log >> however it doesn't provide one out of the box? > > The index versioning scheme Zoie uses is independent of what mechanism you > use to implement it. If your indexing technique is to talk to a database > directly, you don't need a transaction log, something as simple as a > "created_at" column will suffice in many situations. I gave a short talk to > demo zoie yesterday, and for it I wrote up a simple file-based indexing > event log in an afternoon. Similarly if you listen on a JMS queue or > basically any other message-queue based system that not "push only", you'll > have some notion of "replay since [timestamp / version / incrementing > counter]", but they're all vendor dependent. > It's not the kind of thing you can just provide out of the box due to this > vendor dependence. On the other hand, if someone came along and said they > wanted to use zoie with RabbitMQ or whatever, we'd certainly accept a patch > for a StreamDataProvider implementation which does that (and maybe one of > the zoie committers would even write it themself it it seemed like a common > enough use case). > -jake > >> >> On Tue, Jan 12, 2010 at 8:43 PM, Jake Mannix <jake.man...@gmail.com> >> wrote: >> > On Tue, Jan 12, 2010 at 8:15 PM, Otis Gospodnetic >> > <otis_gospodne...@yahoo.com> wrote: >> >> >> >> John, you should have a look at Zoie. I just finished adding >> >> LinkedIn's >> >> case study about Zoie to Lucene in Action 2, so this is fresh in my >> >> mind. >> >> >> >> :) >> > >> > Yep, Zoie ( http://zoie.googlecode.com ) will handle the server restart >> > part, in that while yes, you lose what is in RAM, Zoie keeps track of an >> > "index version" on disk alongside the Lucene index which it uses to >> > decide >> > where it must reindex from to "catch up" if it there have been incoming >> > indexing events while the server was out of commission. >> > Zoie does not support multiple servers using the same index, because >> > each >> > zoie instance has IndexWriter instances, and you'll get locking problems >> > trying to do that. You could have one Zoie instance effectively as the >> > "master/writer/realtime reader", and a bunch of raw Lucene "slaves" >> > which >> > could read off of that index, but as you say, could not get access to >> > the >> > RAMDirectory information until it was flushed to disk. >> > Why do you need a "cluster" of servers hitting the same index? Are they >> > different applications (with different search logic, so they need to be >> > different instances), or is it just to try and utilize your hardware >> > efficiently? If it's for performance reasons, you might find you get >> > better >> > use of your CPU cores by just sharding your one index into smaller ones, >> > each having their own Zoie instance, and putting a "broker" on top of >> > them >> > searching across all and mergesorting the results. Often even this >> > isn't >> > necessary, because Zoie will be opening the disk-backed IndexReader in >> > readonly mode, and thus all the synchronized blocks are gone, and one >> > single >> > Zoie instance will easily saturate your cpu cores by simple >> > multi-threading >> > by your appserver. >> > If you really needed to do many different kinds of writes (from >> > different >> > applications) and also have applications not involved in the writing >> > also >> > seeing (in real-time) these writes, then you could still do it with >> > Zoie, >> > but it would take some interesting architectural juggling (write your >> > own >> > StreamDataProvider class which takes input from a variety of sources and >> > merges them together to feed to one Zoie instance, then a broker on top >> > of >> > zoie which serves out IndexReaders to different applications living on >> > top >> > which can wrap them up in their own business logic as they saw fit... as >> > long as it was ok to have all the applications in the same JVM, of >> > course). >> > -jake >> > >> >> >> >> Otis >> >> -- >> >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch >> >> >> >> >> >> >> >> ----- Original Message ---- >> >> > From: jchang <jchangkihat...@gmail.com> >> >> > To: java-dev@lucene.apache.org >> >> > Sent: Tue, January 12, 2010 6:10:56 PM >> >> > Subject: Lucene 2.9.0 Near Real Time Indexing and Service >> >> > Crashes/restarts >> >> > >> >> > >> >> > Lucene 2.9.0 has near real time indexing, writing to a RAMDir which >> >> > gets >> >> > flushed to disk when you do a search. >> >> > >> >> > Does anybody know how this works out with service restarts (both >> >> > orderly >> >> > shutdown and a crash)? If the service goes down while indexed items >> >> > are >> >> > in >> >> > RAMDir but not on disk, are they lost? Or is there some kind of log >> >> > recovery? >> >> > >> >> > Also, does anybody know the impact of this which clustered lucene >> >> > servers? >> >> > If you have numerous servers running off one index, I assume there is >> >> > no >> >> > way >> >> > for the other services to pick up the newly indexed items until they >> >> > are >> >> > flushed to disk, correct? I'd be happy if that is not so, but I >> >> > suspect >> >> > it >> >> > is so. >> >> > >> >> > Thanks, >> >> > John >> >> > -- >> >> > View this message in context: >> >> > >> >> > >> >> > http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27136539.html >> >> > Sent from the Lucene - Java Developer mailing list archive at >> >> > Nabble.com. >> >> > >> >> > >> >> > --------------------------------------------------------------------- >> >> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> >> > For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org