Re: HBase Developer's Pow-wow.

Jacques Mon, 10 Sep 2012 21:04:37 -0700

On Mon, Sep 10, 2012 at 6:20 PM, Matt Corgan <mcor...@hotpads.com> wrote:


> ... snipping lots of helpful use cases...


It seems like portions of what you discussed would probably be nominally
impacted by indexes while other would be very impacted.  Also seems like
compound-qualifier indexing would potentially be of interest to you...
(although I'm not sure how much it would buy you). Are you going to be at
the powwow tomorrow?


> Seems like there are 3 categories of sparseness:
> 1) sparse indexes (like ipAddress) where a per-table approach is more
> efficient for reads
> 2) dense indexes (like eventType) where there are likely values of every
> index key on each region
> 3) very dense indexes (like male/female) where you should just be doing a
> table scan anyway
>


Yes.  I probably shouldn't have used the male/female example since you're
right that a table scan is probably the best the option in that case.  For
category one, I was imagining a situation of more extreme sparseness such
as one target row in a large number of regions.  This is the place where
the all region checking of region-based approach is the most egregious.
 I'd probably put anything that was in at a small percentages of regions as
the second case. (I also wonder if, in the single row scenario, a judicious
use of bloomfilters might provide satisfactory performance even if you do
need to hit all regions-- one of the things we've used as a guiding
principle for our search stuff is that if you're trying to hit realtime,
you can actually eat the most latency on the smallest scan since you have
so little data to move around...depends on allowable memory usage I
suppose.)


> Why is the per-region
> approach more beneficial than the per-table?  Is it because it's easier to
> plug into hbase's existing per-region MapReduce splitter?
>

Part of it has to do with a bunch of non-HBase work I've been doing over
the past few years.  That's why I really hope people share as many use
cases as possible... so that the conclusions that come out of our work are
representative of everyone's needs (as much as possible).  What makes me
lean towards region-level for a lot of use cases are the following:  (I
hadn't even really thought about the existing MR splitter.)
- How to maintain consistency (maybe this is unimportant?)
- How to avoid network bottleneck as the cluster expands (in the case of a
per-table approach, you're going to have pass primary keys around
constantly except in the case that the only value you want is the indexed
value and you saved that entire value in the index table.)
- How to maximize scale.  (In the per table case, a particular set of
indexed values will probably be colocated among a fraction of all nodes.
 Any kind of parallel/MR job will then be constrained by these nodes.)
- How to minimize long term storage cost of indexes.  (If we have
region-level relationships, we can get more tightly coupled over time and
use more efficient compact approaches like the store file position approach
I tossed out in one of my other emails.)

I spent some time in the Cassandra community doing a review of various
indexing use cases.  I should go take another look to see what they do and
how it works for them...


>> Thanks for starting the important discussion.

Lots to talk about. Lots to potentially do.  It will be interesting to see
who has time to put against this as that will probably substantially
constrain all of our great ideas :)

Jacques

Re: HBase Developer's Pow-wow.

Reply via email to