Re: Simple stastics per region

Andrew Purtell Tue, 26 Feb 2013 15:27:13 -0800

If this is going to be a CP then other CPs need an easy way to use the
output stats. If a subsequent proposal from core requires statistics from
this CP does that then mandate it itself must be a CP? What if that can't
work?


Putting the stats into a table addresses the first concern.

For the second, it is an issue that comes up I think when building a
generally useful shared function as a CP. Please consider inserting my
earlier comments about OSGi here, in that we trend toward a real module
system if we're not careful (unless that is the aim).


On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[email protected]>wrote:

> TL;DR Making it part of the UI and ensuring that you don't load things the
> wrong way seem to be the only reasons for making this part of core -
> certainly not bad reasons. They are fairly easy to handle as a CP though,
> so maybe its not necessary immediately.
>
> I ended up writing a simple stats framework last week (ok, its like 6
> classes) that makes it easy to create your own stats for a table. Its all
> coprocessor based, and as Lars suggested, hooks up to the major compactions
> to let you build per-column-per-region stats and writes it to a 'system'
> table = "_stats_".
>
> With the framework you could easily write your own custom stats, from
> simple things like min/max keys to things like fixed width or fixed depth
> histograms, or even more complicated. There has been some internal
> discussion around how to make this available to the community (as part of
> Phoenix, core in HBase, an independent github project, ...?).
>
> The biggest isssue around having it all CP based is that you need to be
> really careful to ensure that it comes _after_ all the other compaction
> coprocessors. This way you know exactly what keys come out and have correct
> statistics (for that point in time). Not a huge issue - you just need to be
> careful. Baking the stats framework into HBase is really nice in that we
> can be sure we never mess this up.
>
> Building it into the core of HBase isn't going to get us per-region
> statistics without a whole bunch of pain - compactions per store make this
> a pain to actualize; there isn't a real advantage here, as I'd like to keep
> it per CF, if only not to change all the things.
>
> Further, this would be a great first use-case for real system tables.
> Mixing this data with .META. is going to be a bit of a mess, especially for
> doing clean scans, etc. to read the stats. Also, I'd be gravely concerned
> to muck with such important state, especially if we make a 'statistic' a
> pluggable element (so people can easily expand their own).
>
> And sure, we could make it make pretty graphs on the UI, no harm in it and
> very little overhead :)
>
> -------------------
> Jesse Yates
> @jesse_yates
> jyates.github.com
>
>
> On Tue, Feb 26, 2013 at 2:08 PM, Stack <[email protected]> wrote:
>
> > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[email protected]>
> wrote:
> >
> > > This topic comes up now and then (see recent discussion about
> translating
> > > multi Gets into Scan+Filter).
> > >
> > > It's not that hard to keep statistics as part of compactions.
> > > I envision two knobs:
> > > 1. Max number of distinct values to track directly. If a column has
> less
> > > this # of values, keep track of their occurrences explicitly.
> > > 2. Number of (equal width) histogram partitions to maintain.
> > >
> > > Statistics would be kept per store (i.e. per region per column family)
> > and
> > > stored into an HBase table (one row per store).Initially we could just
> > > support major compactions that atomically insert a new version of that
> > > statistics for the store.
> > >
> > >
> > Sounds great.
> >
> > In .META. add columns for each each cf on each region row?  Or another
> > table?
> >
> > What kind of stats would you keep?  Would they be useful for operators?
>  Or
> > just for stuff like say Phoenix making decisions?
> >
> >
> >
> > > An simple implementation (not knowing ahead of time how many values it
> > > will see during the compaction) could start by keeping track of
> > individual
> > > values for columns. If it gets past the max # of distinct values to
> > track,
> > > start with equal width histograms (using the distinct values picket up
> so
> > > far to estimate an initial partition width).
> > > If the number of partition gets larger than what was configured it
> would
> > > increase the width and merge the previous counts into the new width
> > (which
> > > means the new partition width must be a multiple of the previous size).
> > > There's probably a lot of other fanciness that could be used here
> > (haven't
> > > spent a lot of time thinking about details).
> > >
> > >
> > > Is this something that should be in core HBase or rather be implemented
> > as
> > > coprocessor?
> > >
> >
> >
> > I think it could go in core if it generated pretty pictures.
> >
> > St.Ack
> >
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Simple stastics per region

Reply via email to