Re: Simple stastics per region

Enis Söztutar Tue, 26 Feb 2013 16:16:05 -0800

+1 for core. I can see that histograms might help us in automatic splits
and merges as well.



On Tue, Feb 26, 2013 at 3:27 PM, Andrew Purtell <[email protected]> wrote:

> If this is going to be a CP then other CPs need an easy way to use the
> output stats. If a subsequent proposal from core requires statistics from
> this CP does that then mandate it itself must be a CP? What if that can't
> work?
>
> Putting the stats into a table addresses the first concern.
>
> For the second, it is an issue that comes up I think when building a
> generally useful shared function as a CP. Please consider inserting my
> earlier comments about OSGi here, in that we trend toward a real module
> system if we're not careful (unless that is the aim).
>
>
> On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[email protected]
> >wrote:
>
> > TL;DR Making it part of the UI and ensuring that you don't load things
> the
> > wrong way seem to be the only reasons for making this part of core -
> > certainly not bad reasons. They are fairly easy to handle as a CP though,
> > so maybe its not necessary immediately.
> >
> > I ended up writing a simple stats framework last week (ok, its like 6
> > classes) that makes it easy to create your own stats for a table. Its all
> > coprocessor based, and as Lars suggested, hooks up to the major
> compactions
> > to let you build per-column-per-region stats and writes it to a 'system'
> > table = "_stats_".
> >
> > With the framework you could easily write your own custom stats, from
> > simple things like min/max keys to things like fixed width or fixed depth
> > histograms, or even more complicated. There has been some internal
> > discussion around how to make this available to the community (as part of
> > Phoenix, core in HBase, an independent github project, ...?).
> >
> > The biggest isssue around having it all CP based is that you need to be
> > really careful to ensure that it comes _after_ all the other compaction
> > coprocessors. This way you know exactly what keys come out and have
> correct
> > statistics (for that point in time). Not a huge issue - you just need to
> be
> > careful. Baking the stats framework into HBase is really nice in that we
> > can be sure we never mess this up.
> >
> > Building it into the core of HBase isn't going to get us per-region
> > statistics without a whole bunch of pain - compactions per store make
> this
> > a pain to actualize; there isn't a real advantage here, as I'd like to
> keep
> > it per CF, if only not to change all the things.
> >
> > Further, this would be a great first use-case for real system tables.
> > Mixing this data with .META. is going to be a bit of a mess, especially
> for
> > doing clean scans, etc. to read the stats. Also, I'd be gravely concerned
> > to muck with such important state, especially if we make a 'statistic' a
> > pluggable element (so people can easily expand their own).
> >
> > And sure, we could make it make pretty graphs on the UI, no harm in it
> and
> > very little overhead :)
> >
> > -------------------
> > Jesse Yates
> > @jesse_yates
> > jyates.github.com
> >
> >
> > On Tue, Feb 26, 2013 at 2:08 PM, Stack <[email protected]> wrote:
> >
> > > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[email protected]>
> > wrote:
> > >
> > > > This topic comes up now and then (see recent discussion about
> > translating
> > > > multi Gets into Scan+Filter).
> > > >
> > > > It's not that hard to keep statistics as part of compactions.
> > > > I envision two knobs:
> > > > 1. Max number of distinct values to track directly. If a column has
> > less
> > > > this # of values, keep track of their occurrences explicitly.
> > > > 2. Number of (equal width) histogram partitions to maintain.
> > > >
> > > > Statistics would be kept per store (i.e. per region per column
> family)
> > > and
> > > > stored into an HBase table (one row per store).Initially we could
> just
> > > > support major compactions that atomically insert a new version of
> that
> > > > statistics for the store.
> > > >
> > > >
> > > Sounds great.
> > >
> > > In .META. add columns for each each cf on each region row?  Or another
> > > table?
> > >
> > > What kind of stats would you keep?  Would they be useful for operators?
> >  Or
> > > just for stuff like say Phoenix making decisions?
> > >
> > >
> > >
> > > > An simple implementation (not knowing ahead of time how many values
> it
> > > > will see during the compaction) could start by keeping track of
> > > individual
> > > > values for columns. If it gets past the max # of distinct values to
> > > track,
> > > > start with equal width histograms (using the distinct values picket
> up
> > so
> > > > far to estimate an initial partition width).
> > > > If the number of partition gets larger than what was configured it
> > would
> > > > increase the width and merge the previous counts into the new width
> > > (which
> > > > means the new partition width must be a multiple of the previous
> size).
> > > > There's probably a lot of other fanciness that could be used here
> > > (haven't
> > > > spent a lot of time thinking about details).
> > > >
> > > >
> > > > Is this something that should be in core HBase or rather be
> implemented
> > > as
> > > > coprocessor?
> > > >
> > >
> > >
> > > I think it could go in core if it generated pretty pictures.
> > >
> > > St.Ack
> > >
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: Simple stastics per region

Reply via email to