Re: Newbie: best practice for building sharded SOLR indexes

Michael Stack Sun, 07 Dec 2008 01:45:56 -0800

tim robertson wrote:

Can someone please help me with the best way to build up SOLR indexes
from data held in HBase, that will be too large to sit on a single
machine (100s millions rows)?
I am assuming in a 20 node Hadoop cluster, I should build a 20 shard
index and use SOLRs distributed search?

You could do your sharding by server but what happens if an hbase nodecrashes during your indexing job? The regions that were on server 20will be distributed among the remaining 19. If 20 comes back, balancingmay put other than original regions on 20th server.

Natural 'unit' in hbase is the region. You might shard by region. Ifso, there are table input formats that split tables by region. Couldserve as input to your mapreduce indexing job. See in our mapredpackage. There is a mapreduce job that makes a full-text index of atables' contents as an example.

If you wanted to do it by server, studying the TableInputFormat andorganize splits by region address.


Will your hbase instance be changing while the index job runs?

How do you make a SOLR shard? Is it a special lucene index format withrequired fields or does SOLR not care and will serve any lucene index?

What is the best way to build each shard please? - use HBase as input
source to Map reduce and push into the local node index in a
Map/Reduce operation?

Would katta help, http://katta.wiki.sourceforge.net/? Invoke it afteryour MR indexing job finishes to push the shards out to serving local disks?



St.Ack

Re: Newbie: best practice for building sharded SOLR indexes

Reply via email to