On Mon, Sep 10, 2012 at 6:20 PM, Matt Corgan <mcor...@hotpads.com> wrote:
> ... snipping lots of helpful use cases... It seems like portions of what you discussed would probably be nominally impacted by indexes while other would be very impacted. Also seems like compound-qualifier indexing would potentially be of interest to you... (although I'm not sure how much it would buy you). Are you going to be at the powwow tomorrow? > Seems like there are 3 categories of sparseness: > 1) sparse indexes (like ipAddress) where a per-table approach is more > efficient for reads > 2) dense indexes (like eventType) where there are likely values of every > index key on each region > 3) very dense indexes (like male/female) where you should just be doing a > table scan anyway > Yes. I probably shouldn't have used the male/female example since you're right that a table scan is probably the best the option in that case. For category one, I was imagining a situation of more extreme sparseness such as one target row in a large number of regions. This is the place where the all region checking of region-based approach is the most egregious. I'd probably put anything that was in at a small percentages of regions as the second case. (I also wonder if, in the single row scenario, a judicious use of bloomfilters might provide satisfactory performance even if you do need to hit all regions-- one of the things we've used as a guiding principle for our search stuff is that if you're trying to hit realtime, you can actually eat the most latency on the smallest scan since you have so little data to move around...depends on allowable memory usage I suppose.) > Why is the per-region > approach more beneficial than the per-table? Is it because it's easier to > plug into hbase's existing per-region MapReduce splitter? > Part of it has to do with a bunch of non-HBase work I've been doing over the past few years. That's why I really hope people share as many use cases as possible... so that the conclusions that come out of our work are representative of everyone's needs (as much as possible). What makes me lean towards region-level for a lot of use cases are the following: (I hadn't even really thought about the existing MR splitter.) - How to maintain consistency (maybe this is unimportant?) - How to avoid network bottleneck as the cluster expands (in the case of a per-table approach, you're going to have pass primary keys around constantly except in the case that the only value you want is the indexed value and you saved that entire value in the index table.) - How to maximize scale. (In the per table case, a particular set of indexed values will probably be colocated among a fraction of all nodes. Any kind of parallel/MR job will then be constrained by these nodes.) - How to minimize long term storage cost of indexes. (If we have region-level relationships, we can get more tightly coupled over time and use more efficient compact approaches like the store file position approach I tossed out in one of my other emails.) I spent some time in the Cassandra community doing a review of various indexing use cases. I should go take another look to see what they do and how it works for them... >> Thanks for starting the important discussion. Lots to talk about. Lots to potentially do. It will be interesting to see who has time to put against this as that will probably substantially constrain all of our great ideas :) Jacques