RE: Realtime & distributed

Angel, Eric Sun, 11 Oct 2009 13:52:59 -0700

Man, this thread really went south.  Anyhow, I have a few questions about Zoie:


* How many nodes are you using to support the speeds you desire at LI?
* Am I wrong to assume that the RAMDir holds the entire index - just as the 
FSDir?  Or does RAMDir only hold a portion of the index that hasn't yet been 
flushed to disk?
* Katta is supposed to be able to be able to run on commodity hardware - is 
that the same case for Zoie?
* Would you agree that it's a good idea to build an "offline" index parallel to 
the online index in case there is a crash on the online index and data is lost?
* I see that there are plans to have Zoie use Lucene 2.9.  How long would you 
say before it's available?

Thanks,

E

-----Original Message-----
From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
Sent: Sat 10/10/2009 12:16 PM
To: java-user@lucene.apache.org
Subject: Re: Realtime & distributed
 
John,

Actually everyone is entitled to their technical opinion and
none of the comments were misleading. Jake and yourself
validated that they are true in your comments. I'm simply trying
to create better technology as is everyone on here. The process
takes time and coordination between many parties of many
backgrounds around the globe. Sometimes there are differences of
opinion, however those are easily ironed out over time (and quite
frankly in this case benchmarks).

However I am very concerned about your ignorant disregard of some of the
most basic human rights in existence.

-J

On Thu, Oct 8, 2009 at 10:26 PM, John Wang <john.w...@gmail.com> wrote:
> Jason:
>        I would really appreciate it if you would stop making false
> statements and misinformation. Everyone is entitled to his/her opinions on
> technologies, but deliberately making misleading and false information on
> such a distribution is just unethical, and you'll end up just discrediting
> yourself.
>
>        Making unsubstantiated comments while not willing to put in any
> effort is the primary reason you are no longer working at Linkedin and on
> Zoie.
>
> "The problem
> with this is, merging in the background becomes really tricky
> unless it's performed inside of IndexWriter" - *what does this really mean?
> Merging happens regardless in an incremental indexing system. Especially
> with high indexing load, segments are created often, merging is crucial.*
> "There is the Zoie system which uses the RAMDir
> solution, however it's implemented using a customized deleted
> doc set based on a bloomfilter backed by an inefficient RB tree
> which slows down queries"  -* if you ever spend the time to read the code,
> (even when you were working on it), it is just not true. We did have an RB
> set for deleted docs, quite a few releases ago, and we changed to a special
> type of bloomfilter set backed by a hash int set. You knew this and was part
> of the discussion on it, and now saying such a thing is just plain
> disappointing.*
>
>        Thanks Jake for the clarification, and Eric, let me know if you to
> know more in detail with how we are dealing with realtime indexing/search
> with Zoie here at linkedin in a production environment powering a real
> internet company with real traffic.
>
> -John
>
> On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <jason.rutherg...@gmail.com
>> wrote:
>
>> Eric,
>>
>> Katta doesn't require HDFS which would be slow to search on,
>> though Katta can be used to copy indexes out of HDFS onto local
>> servers. The best bet is hardware that uses SSDs because merges
>> and update latency will greatly decrease and there won't be a
>> synchronous IO issue as there is with hard drives. Also, IO
>> caches get flushed as large merges occur, which means subsequent
>> queries may hit the HD and slow down. With SSDs this is much
>> less of an issue.
>>
>> Today near realtime search (with or without SSDs) comes at a
>> price, that is reduced indexing speed due to continued in RAM
>> merging. People typically hack something together where indexes
>> are held in a RAMDir until being flushed to disk. The problem
>> with this is, merging in the background becomes really tricky
>> unless it's performed inside of IndexWriter (see LUCENE-1313 and
>> IW.getReader). There is the Zoie system which uses the RAMDir
>> solution, however it's implemented using a customized deleted
>> doc set based on a bloomfilter backed by an inefficient RB tree
>> which slows down queries. There's always a trade off when trying
>> to build an NRT system, currently.
>>
>> Also, there isn't a clear way to replicate segments in realtime
>> so people usually end up analyzing documents on each replicated
>> node, which is redundant. A long term solution here could be a
>> distributed transaction log where encoded segments are stored
>> and replicated to N nodes.
>>
>> Deletes can pile up in segments so the
>> BalancedSegmentMergePolicy could be used to remove those faster
>> than LogMergePolicy, however I haven't tested it, and it may be
>> trying to not do large segment merges altogether which IMO
>> is less than ideal because query performance soon degrades
>> (similar to an unoptimized index).
>>
>> Hopefully in the future we can offer searching over
>> IndexWriter's RAM buffer where indexing and search speed would
>> be roughly what it is today. That combined with a way to insure
>> segments don't get flushed out of the IO cache during large
>> segment merges would mean really efficient NRT, even on systems
>> with HDs. In the interim, you'd need to play around and see what
>> works for your requirements.
>>
>> -J
>>
>> On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <ean...@business.com> wrote:
>> >
>> > Does anyone have any recommendations?  I've looked at Katta, but it
>> doesn't
>> > seem to support realtime searching.  It also uses hdfs, which I've heard
>> can
>> > be slow.  I'm looking to serve 40gb of indexes and support about 1
>> million
>> > updates per day.
>> >
>> > Thx
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Realtime & distributed

Reply via email to