tim robertson wrote:
Can someone please help me with the best way to build up SOLR indexes
from data held in HBase, that will be too large to sit on a single
machine (100s millions rows)?
I am assuming in a 20 node Hadoop cluster, I should build a 20 shard
index and use SOLRs distributed search?
You could do your sharding by server but what happens if an hbase node
crashes during your indexing job? The regions that were on server 20
will be distributed among the remaining 19. If 20 comes back, balancing
may put other than original regions on 20th server.
Natural 'unit' in hbase is the region. You might shard by region. If
so, there are table input formats that split tables by region. Could
serve as input to your mapreduce indexing job. See in our mapred
package. There is a mapreduce job that makes a full-text index of a
tables' contents as an example.
If you wanted to do it by server, studying the TableInputFormat and
organize splits by region address.
Will your hbase instance be changing while the index job runs?
How do you make a SOLR shard? Is it a special lucene index format with
required fields or does SOLR not care and will serve any lucene index?
What is the best way to build each shard please? - use HBase as input
source to Map reduce and push into the local node index in a
Map/Reduce operation?
Would katta help, http://katta.wiki.sourceforge.net/? Invoke it after
your MR indexing job finishes to push the shards out to serving local disks?
St.Ack