tim robertson wrote:
Can someone please help me with the best way to build up SOLR indexes
from data held in HBase, that will be too large to sit on a single
machine (100s millions rows)?
I am assuming in a 20 node Hadoop cluster, I should build a 20 shard
index and use SOLRs distributed search?
You could do your sharding by server but what happens if an hbase node crashes during your indexing job? The regions that were on server 20 will be distributed among the remaining 19. If 20 comes back, balancing may put other than original regions on 20th server.

Natural 'unit' in hbase is the region. You might shard by region. If so, there are table input formats that split tables by region. Could serve as input to your mapreduce indexing job. See in our mapred package. There is a mapreduce job that makes a full-text index of a tables' contents as an example.

If you wanted to do it by server, studying the TableInputFormat and organize splits by region address.

Will your hbase instance be changing while the index job runs?

How do you make a SOLR shard? Is it a special lucene index format with required fields or does SOLR not care and will serve any lucene index?

What is the best way to build each shard please? - use HBase as input
source to Map reduce and push into the local node index in a
Map/Reduce operation?
Would katta help, http://katta.wiki.sourceforge.net/? Invoke it after your MR indexing job finishes to push the shards out to serving local disks?


St.Ack

Reply via email to