Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restarts

Jake Mannix Tue, 12 Jan 2010 21:50:06 -0800

On Tue, Jan 12, 2010 at 8:55 PM, Jason Rutherglen <
[email protected]> wrote:


> > Zoie keeps track of an "index version" on disk alongside the Lucene index
> which it uses to decide where it must reindex from to "catch up" if it there
> have been incoming indexing events while the server was out of commission.
>
> This begs a little more clarity... Sounds like a transaction log.  Oh
> right, with Zoie there's the assumption of an external transaction log
> however it doesn't provide one out of the box?
>

The index versioning scheme Zoie uses is independent of what mechanism you
use to implement it.  If your indexing technique is to talk to a database
directly, you don't need a transaction log, something as simple as a
"created_at" column will suffice in many situations.  I gave a short talk to
demo zoie yesterday, and for it I wrote up a simple file-based indexing
event log in an afternoon.  Similarly if you listen on a JMS queue or
basically any other message-queue based system that not "push only", you'll
have some notion of "replay since [timestamp / version / incrementing
counter]", but they're all vendor dependent.

It's not the kind of thing you can just provide out of the box due to this
vendor dependence.  On the other hand, if someone came along and said they
wanted to use zoie with RabbitMQ or whatever, we'd certainly accept a patch
for a StreamDataProvider implementation which does that (and maybe one of
the zoie committers would even write it themself it it seemed like a common
enough use case).

  -jake


>
> On Tue, Jan 12, 2010 at 8:43 PM, Jake Mannix <[email protected]>
> wrote:
> > On Tue, Jan 12, 2010 at 8:15 PM, Otis Gospodnetic
> > <[email protected]> wrote:
> >>
> >> John, you should have a look at Zoie.  I just finished adding LinkedIn's
> >> case study about Zoie to Lucene in Action 2, so this is fresh in my
> mind.
> >>
> >> :)
> >
> > Yep, Zoie ( http://zoie.googlecode.com ) will handle the server restart
> > part, in that while yes, you lose what is in RAM, Zoie keeps track of an
> > "index version" on disk alongside the Lucene index which it uses to
> decide
> > where it must reindex from to "catch up" if it there have been incoming
> > indexing events while the server was out of commission.
> > Zoie does not support multiple servers using the same index, because each
> > zoie instance has IndexWriter instances, and you'll get locking problems
> > trying to do that.  You could have one Zoie instance effectively as the
> > "master/writer/realtime reader", and a bunch of raw Lucene "slaves" which
> > could read off of that index, but as you say, could not get access to the
> > RAMDirectory information until it was flushed to disk.
> > Why do you need a "cluster" of servers hitting the same index?  Are they
> > different applications (with different search logic, so they need to be
> > different instances), or is it just to try and utilize your hardware
> > efficiently?  If it's for performance reasons, you might find you get
> better
> > use of your CPU cores by just sharding your one index into smaller ones,
> > each having their own Zoie instance, and putting a "broker" on top of
> them
> > searching across all and mergesorting the results.  Often even this isn't
> > necessary, because Zoie will be opening the disk-backed IndexReader in
> > readonly mode, and thus all the synchronized blocks are gone, and one
> single
> > Zoie instance will easily saturate your cpu cores by simple
> multi-threading
> > by your appserver.
> > If you really needed to do many different kinds of writes (from different
> > applications) and also have applications not involved in the writing also
> > seeing (in real-time) these writes, then you could still do it with Zoie,
> > but it would take some interesting architectural juggling (write your own
> > StreamDataProvider class which takes input from a variety of sources and
> > merges them together to feed to one Zoie instance, then a broker on top
> of
> > zoie which serves out IndexReaders to different applications living on
> top
> > which can wrap them up in their own business logic as they saw fit... as
> > long as it was ok to have all the applications in the same JVM, of
> course).
> >   -jake
> >
> >>
> >>  Otis
> >> --
> >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >>
> >>
> >>
> >> ----- Original Message ----
> >> > From: jchang <[email protected]>
> >> > To: [email protected]
> >> > Sent: Tue, January 12, 2010 6:10:56 PM
> >> > Subject: Lucene 2.9.0 Near Real Time Indexing and Service
> >> > Crashes/restarts
> >> >
> >> >
> >> > Lucene 2.9.0 has near real time indexing, writing to a RAMDir which
> gets
> >> > flushed to disk when you do a search.
> >> >
> >> > Does anybody know how this works out with service restarts (both
> orderly
> >> > shutdown and a crash)?  If the service goes down while indexed items
> are
> >> > in
> >> > RAMDir but not on disk, are they lost?  Or is there some kind of log
> >> > recovery?
> >> >
> >> > Also, does anybody know the impact of this which clustered lucene
> >> > servers?
> >> > If you have numerous servers running off one index, I assume there is
> no
> >> > way
> >> > for the other services to pick up the newly indexed items until they
> are
> >> > flushed to disk, correct?  I'd be happy if that is not so, but I
> suspect
> >> > it
> >> > is so.
> >> >
> >> > Thanks,
> >> > John
> >> > --
> >> > View this message in context:
> >> >
> >> >
> http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27136539.html
> >> > Sent from the Lucene - Java Developer mailing list archive at
> >> > Nabble.com.
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [email protected]
> >> > For additional commands, e-mail: [email protected]
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restarts

Reply via email to