Thanks Michael for your reply and help, > You could do your sharding by server but what happens if an hbase node > crashes during your indexing job? The regions that were on server 20 will > be distributed among the remaining 19. If 20 comes back, balancing may put > other than original regions on 20th server.
Ok, understood. I presume for deployment then it would make sense to separate HDFS machines and the machines with the lucene index also? (right now I am not looking for capacity - I will worry about replicated indexes for hits per second later, which I presume is fairly easy with more hardware...) > Natural 'unit' in hbase is the region. You might shard by region. If so, > there are table input formats that split tables by region. Could serve as > input to your mapreduce indexing job. See in our mapred package. There is > a mapreduce job that makes a full-text index of a tables' contents as an > example. > If you wanted to do it by server, studying the TableInputFormat and organize > splits by region address. I've just read up on what a region is and this sounds like a good start for shard strategy. I'll get some tests running on the TableIndexFormat and look at the code behind it. > Will your hbase instance be changing while the index job runs? Not intentionally... > How do you make a SOLR shard? Is it a special lucene index format with > required fields or does SOLR not care and will serve any lucene index? Good questions and highlighting my newness to this, including lucene! So far I have generated my SOLR indexes from a big tab file, into a single index which proved too big for one machine. SOLR does not manage shards during writing, and you must do the sharding yourself, so I just split my tab file into 2 and loaded one into each. I was under the impression lucene could not do structured searches (a column value between 10-20 and date after 01/01/2008 kind of stuff), hence looking straight to SOLR. I will get into them more and find out to answer these questions - too many technologies to learn... > Would katta help, http://katta.wiki.sourceforge.net/? Invoke it after your > MR indexing job finishes to push the shards out to serving local disks? Thanks for the pointer - I will look into it. Thanks again, Tim
