Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restarts

Jason Rutherglen Tue, 12 Jan 2010 22:15:17 -0800

Jake,

I wonder how often people need reliable transactions for
realtime search? Maybe Mysql's t-log could be used sans the
database part?


The created_at column for near realtime seems like it could hurt
the database due to excessive polling? Has anyone tried it yet?

> I wrote up a simple file-based indexing event log in an
afternoon

Right, however it's probably a long perilous leap from this to a t-log
that's production ready.

I'm waiting for someone to dive in and mess with Bookkeeper
http://wiki.apache.org/hadoop/BookKeeper and report back!

Jason

On Tue, Jan 12, 2010 at 9:49 PM, Jake Mannix <jake.man...@gmail.com> wrote:
> On Tue, Jan 12, 2010 at 8:55 PM, Jason Rutherglen
> <jason.rutherg...@gmail.com> wrote:
>>
>> > Zoie keeps track of an "index version" on disk alongside the Lucene
>> > index which it uses to decide where it must reindex from to "catch up" if 
>> > it
>> > there have been incoming indexing events while the server was out of
>> > commission.
>>
>> This begs a little more clarity... Sounds like a transaction log.  Oh
>> right, with Zoie there's the assumption of an external transaction log
>> however it doesn't provide one out of the box?
>
> The index versioning scheme Zoie uses is independent of what mechanism you
> use to implement it.  If your indexing technique is to talk to a database
> directly, you don't need a transaction log, something as simple as a
> "created_at" column will suffice in many situations.  I gave a short talk to
> demo zoie yesterday, and for it I wrote up a simple file-based indexing
> event log in an afternoon.  Similarly if you listen on a JMS queue or
> basically any other message-queue based system that not "push only", you'll
> have some notion of "replay since [timestamp / version / incrementing
> counter]", but they're all vendor dependent.
> It's not the kind of thing you can just provide out of the box due to this
> vendor dependence.  On the other hand, if someone came along and said they
> wanted to use zoie with RabbitMQ or whatever, we'd certainly accept a patch
> for a StreamDataProvider implementation which does that (and maybe one of
> the zoie committers would even write it themself it it seemed like a common
> enough use case).
>   -jake
>
>>
>> On Tue, Jan 12, 2010 at 8:43 PM, Jake Mannix <jake.man...@gmail.com>
>> wrote:
>> > On Tue, Jan 12, 2010 at 8:15 PM, Otis Gospodnetic
>> > <otis_gospodne...@yahoo.com> wrote:
>> >>
>> >> John, you should have a look at Zoie.  I just finished adding
>> >> LinkedIn's
>> >> case study about Zoie to Lucene in Action 2, so this is fresh in my
>> >> mind.
>> >>
>> >> :)
>> >
>> > Yep, Zoie ( http://zoie.googlecode.com ) will handle the server restart
>> > part, in that while yes, you lose what is in RAM, Zoie keeps track of an
>> > "index version" on disk alongside the Lucene index which it uses to
>> > decide
>> > where it must reindex from to "catch up" if it there have been incoming
>> > indexing events while the server was out of commission.
>> > Zoie does not support multiple servers using the same index, because
>> > each
>> > zoie instance has IndexWriter instances, and you'll get locking problems
>> > trying to do that.  You could have one Zoie instance effectively as the
>> > "master/writer/realtime reader", and a bunch of raw Lucene "slaves"
>> > which
>> > could read off of that index, but as you say, could not get access to
>> > the
>> > RAMDirectory information until it was flushed to disk.
>> > Why do you need a "cluster" of servers hitting the same index?  Are they
>> > different applications (with different search logic, so they need to be
>> > different instances), or is it just to try and utilize your hardware
>> > efficiently?  If it's for performance reasons, you might find you get
>> > better
>> > use of your CPU cores by just sharding your one index into smaller ones,
>> > each having their own Zoie instance, and putting a "broker" on top of
>> > them
>> > searching across all and mergesorting the results.  Often even this
>> > isn't
>> > necessary, because Zoie will be opening the disk-backed IndexReader in
>> > readonly mode, and thus all the synchronized blocks are gone, and one
>> > single
>> > Zoie instance will easily saturate your cpu cores by simple
>> > multi-threading
>> > by your appserver.
>> > If you really needed to do many different kinds of writes (from
>> > different
>> > applications) and also have applications not involved in the writing
>> > also
>> > seeing (in real-time) these writes, then you could still do it with
>> > Zoie,
>> > but it would take some interesting architectural juggling (write your
>> > own
>> > StreamDataProvider class which takes input from a variety of sources and
>> > merges them together to feed to one Zoie instance, then a broker on top
>> > of
>> > zoie which serves out IndexReaders to different applications living on
>> > top
>> > which can wrap them up in their own business logic as they saw fit... as
>> > long as it was ok to have all the applications in the same JVM, of
>> > course).
>> >   -jake
>> >
>> >>
>> >>  Otis
>> >> --
>> >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >> > From: jchang <jchangkihat...@gmail.com>
>> >> > To: java-dev@lucene.apache.org
>> >> > Sent: Tue, January 12, 2010 6:10:56 PM
>> >> > Subject: Lucene 2.9.0 Near Real Time Indexing and Service
>> >> > Crashes/restarts
>> >> >
>> >> >
>> >> > Lucene 2.9.0 has near real time indexing, writing to a RAMDir which
>> >> > gets
>> >> > flushed to disk when you do a search.
>> >> >
>> >> > Does anybody know how this works out with service restarts (both
>> >> > orderly
>> >> > shutdown and a crash)?  If the service goes down while indexed items
>> >> > are
>> >> > in
>> >> > RAMDir but not on disk, are they lost?  Or is there some kind of log
>> >> > recovery?
>> >> >
>> >> > Also, does anybody know the impact of this which clustered lucene
>> >> > servers?
>> >> > If you have numerous servers running off one index, I assume there is
>> >> > no
>> >> > way
>> >> > for the other services to pick up the newly indexed items until they
>> >> > are
>> >> > flushed to disk, correct?  I'd be happy if that is not so, but I
>> >> > suspect
>> >> > it
>> >> > is so.
>> >> >
>> >> > Thanks,
>> >> > John
>> >> > --
>> >> > View this message in context:
>> >> >
>> >> >
>> >> > http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27136539.html
>> >> > Sent from the Lucene - Java Developer mailing list archive at
>> >> > Nabble.com.
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restarts

Reply via email to