Re: Realtime & distributed

John Wang Sun, 11 Oct 2009 14:33:13 -0700

Eric:

   For more specific Zoie questions, let's move it to the zoie discussion
group instead.


Thanks

-John

On Sun, Oct 11, 2009 at 2:31 PM, John Wang <john.w...@gmail.com> wrote:

> Hi Eric:
>
> I regret the direction the thread has taken and partly take responsibility
> for it...
>
> As to your question:
>
> We have 2 nodes per commodity server, each holding 5 million docs (although
> given the numbers we are seeing, we think we were a bit too conservative,
> and may increase to 10). In terms of indexing, each partition is doing
> indexing in realtime. We have total about 12 partitions, so 6 machines. With
> about 4 - 5 replications.
>
> RamDir only holds the transient index, once flushed to the disk index,
> RamDir is emptied. So yes, it is the second part of your question. The trick
> is the synchronization logic as well as handling of deletes and updates
> between ram and disk index.
>
> I am not sure I can disclose what HW we are using, but Zoie is designed to
> run on commodity HW.
>
> I think it is always a good idea to archive your data. Since with our
> setup, we have replications that holds its own copy of the index, so there
> is already redundancy. So having a set of offline nodes doing just indexing
> is not necc.
>
> Yes, we are working hard to make zoie 2.9 compatible. As Jake has
> previously mentioned, Lucene 2.9 has changed alot internally
>
> (I personally think these changes are awesome and really allows
> applications the flexibility to unleash the powers of the lucene. Plus these
> changes are very performance oriented for incremental indexing, which is
> important to us. Much kudos to the lucene team and the contributors)
>
> so to fully take advantage of this work while maintaining backward
> compatibility is not a trivial project.
>
> Expect to see another maintenance release of zoie before 2.9. We hope to
> have 2.9 work done soon, but in terms of timing, lucene 3.0 (partiticularly
> looking forward to custom indexing) is also coming out, we are deciding
> whether to wait and include 3.0.
>
> Hope this helps.
>
> -John
>
>
> On Sun, Oct 11, 2009 at 1:51 PM, Angel, Eric <ean...@business.com> wrote:
>
>> Man, this thread really went south.  Anyhow, I have a few questions about
>> Zoie:
>>
>> * How many nodes are you using to support the speeds you desire at LI?
>> * Am I wrong to assume that the RAMDir holds the entire index - just as
>> the FSDir?  Or does RAMDir only hold a portion of the index that hasn't yet
>> been flushed to disk?
>> * Katta is supposed to be able to be able to run on commodity hardware -
>> is that the same case for Zoie?
>> * Would you agree that it's a good idea to build an "offline" index
>> parallel to the online index in case there is a crash on the online index
>> and data is lost?
>> * I see that there are plans to have Zoie use Lucene 2.9.  How long would
>> you say before it's available?
>>
>> Thanks,
>>
>> E
>>
>> -----Original Message-----
>> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
>> Sent: Sat 10/10/2009 12:16 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Realtime & distributed
>>
>> John,
>>
>> Actually everyone is entitled to their technical opinion and
>> none of the comments were misleading. Jake and yourself
>> validated that they are true in your comments. I'm simply trying
>> to create better technology as is everyone on here. The process
>> takes time and coordination between many parties of many
>> backgrounds around the globe. Sometimes there are differences of
>> opinion, however those are easily ironed out over time (and quite
>> frankly in this case benchmarks).
>>
>> However I am very concerned about your ignorant disregard of some of the
>> most basic human rights in existence.
>>
>> -J
>>
>> On Thu, Oct 8, 2009 at 10:26 PM, John Wang <john.w...@gmail.com> wrote:
>> > Jason:
>> >        I would really appreciate it if you would stop making false
>> > statements and misinformation. Everyone is entitled to his/her opinions
>> on
>> > technologies, but deliberately making misleading and false information
>> on
>> > such a distribution is just unethical, and you'll end up just
>> discrediting
>> > yourself.
>> >
>> >        Making unsubstantiated comments while not willing to put in any
>> > effort is the primary reason you are no longer working at Linkedin and
>> on
>> > Zoie.
>> >
>> > "The problem
>> > with this is, merging in the background becomes really tricky
>> > unless it's performed inside of IndexWriter" - *what does this really
>> mean?
>> > Merging happens regardless in an incremental indexing system. Especially
>> > with high indexing load, segments are created often, merging is
>> crucial.*
>> > "There is the Zoie system which uses the RAMDir
>> > solution, however it's implemented using a customized deleted
>> > doc set based on a bloomfilter backed by an inefficient RB tree
>> > which slows down queries"  -* if you ever spend the time to read the
>> code,
>> > (even when you were working on it), it is just not true. We did have an
>> RB
>> > set for deleted docs, quite a few releases ago, and we changed to a
>> special
>> > type of bloomfilter set backed by a hash int set. You knew this and was
>> part
>> > of the discussion on it, and now saying such a thing is just plain
>> > disappointing.*
>> >
>> >        Thanks Jake for the clarification, and Eric, let me know if you
>> to
>> > know more in detail with how we are dealing with realtime
>> indexing/search
>> > with Zoie here at linkedin in a production environment powering a real
>> > internet company with real traffic.
>> >
>> > -John
>> >
>> > On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <
>> jason.rutherg...@gmail.com
>> >> wrote:
>> >
>> >> Eric,
>> >>
>> >> Katta doesn't require HDFS which would be slow to search on,
>> >> though Katta can be used to copy indexes out of HDFS onto local
>> >> servers. The best bet is hardware that uses SSDs because merges
>> >> and update latency will greatly decrease and there won't be a
>> >> synchronous IO issue as there is with hard drives. Also, IO
>> >> caches get flushed as large merges occur, which means subsequent
>> >> queries may hit the HD and slow down. With SSDs this is much
>> >> less of an issue.
>> >>
>> >> Today near realtime search (with or without SSDs) comes at a
>> >> price, that is reduced indexing speed due to continued in RAM
>> >> merging. People typically hack something together where indexes
>> >> are held in a RAMDir until being flushed to disk. The problem
>> >> with this is, merging in the background becomes really tricky
>> >> unless it's performed inside of IndexWriter (see LUCENE-1313 and
>> >> IW.getReader). There is the Zoie system which uses the RAMDir
>> >> solution, however it's implemented using a customized deleted
>> >> doc set based on a bloomfilter backed by an inefficient RB tree
>> >> which slows down queries. There's always a trade off when trying
>> >> to build an NRT system, currently.
>> >>
>> >> Also, there isn't a clear way to replicate segments in realtime
>> >> so people usually end up analyzing documents on each replicated
>> >> node, which is redundant. A long term solution here could be a
>> >> distributed transaction log where encoded segments are stored
>> >> and replicated to N nodes.
>> >>
>> >> Deletes can pile up in segments so the
>> >> BalancedSegmentMergePolicy could be used to remove those faster
>> >> than LogMergePolicy, however I haven't tested it, and it may be
>> >> trying to not do large segment merges altogether which IMO
>> >> is less than ideal because query performance soon degrades
>> >> (similar to an unoptimized index).
>> >>
>> >> Hopefully in the future we can offer searching over
>> >> IndexWriter's RAM buffer where indexing and search speed would
>> >> be roughly what it is today. That combined with a way to insure
>> >> segments don't get flushed out of the IO cache during large
>> >> segment merges would mean really efficient NRT, even on systems
>> >> with HDs. In the interim, you'd need to play around and see what
>> >> works for your requirements.
>> >>
>> >> -J
>> >>
>> >> On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ean...@business.com>
>> wrote:
>> >> >
>> >> > Does anyone have any recommendations?  I've looked at Katta, but it
>> >> doesn't
>> >> > seem to support realtime searching.  It also uses hdfs, which I've
>> heard
>> >> can
>> >> > be slow.  I'm looking to serve 40gb of indexes and support about 1
>> >> million
>> >> > updates per day.
>> >> >
>> >> > Thx
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

Re: Realtime & distributed

Reply via email to