Eric, Katta doesn't require HDFS which would be slow to search on, though Katta can be used to copy indexes out of HDFS onto local servers. The best bet is hardware that uses SSDs because merges and update latency will greatly decrease and there won't be a synchronous IO issue as there is with hard drives. Also, IO caches get flushed as large merges occur, which means subsequent queries may hit the HD and slow down. With SSDs this is much less of an issue.
Today near realtime search (with or without SSDs) comes at a price, that is reduced indexing speed due to continued in RAM merging. People typically hack something together where indexes are held in a RAMDir until being flushed to disk. The problem with this is, merging in the background becomes really tricky unless it's performed inside of IndexWriter (see LUCENE-1313 and IW.getReader). There is the Zoie system which uses the RAMDir solution, however it's implemented using a customized deleted doc set based on a bloomfilter backed by an inefficient RB tree which slows down queries. There's always a trade off when trying to build an NRT system, currently. Also, there isn't a clear way to replicate segments in realtime so people usually end up analyzing documents on each replicated node, which is redundant. A long term solution here could be a distributed transaction log where encoded segments are stored and replicated to N nodes. Deletes can pile up in segments so the BalancedSegmentMergePolicy could be used to remove those faster than LogMergePolicy, however I haven't tested it, and it may be trying to not do large segment merges altogether which IMO is less than ideal because query performance soon degrades (similar to an unoptimized index). Hopefully in the future we can offer searching over IndexWriter's RAM buffer where indexing and search speed would be roughly what it is today. That combined with a way to insure segments don't get flushed out of the IO cache during large segment merges would mean really efficient NRT, even on systems with HDs. In the interim, you'd need to play around and see what works for your requirements. -J On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric <[email protected]> wrote: > > Does anyone have any recommendations? I've looked at Katta, but it doesn't > seem to support realtime searching. It also uses hdfs, which I've heard can > be slow. I'm looking to serve 40gb of indexes and support about 1 million > updates per day. > > Thx > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
