If this is going to be a CP then other CPs need an easy way to use the output stats. If a subsequent proposal from core requires statistics from this CP does that then mandate it itself must be a CP? What if that can't work?
Putting the stats into a table addresses the first concern. For the second, it is an issue that comes up I think when building a generally useful shared function as a CP. Please consider inserting my earlier comments about OSGi here, in that we trend toward a real module system if we're not careful (unless that is the aim). On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[email protected]>wrote: > TL;DR Making it part of the UI and ensuring that you don't load things the > wrong way seem to be the only reasons for making this part of core - > certainly not bad reasons. They are fairly easy to handle as a CP though, > so maybe its not necessary immediately. > > I ended up writing a simple stats framework last week (ok, its like 6 > classes) that makes it easy to create your own stats for a table. Its all > coprocessor based, and as Lars suggested, hooks up to the major compactions > to let you build per-column-per-region stats and writes it to a 'system' > table = "_stats_". > > With the framework you could easily write your own custom stats, from > simple things like min/max keys to things like fixed width or fixed depth > histograms, or even more complicated. There has been some internal > discussion around how to make this available to the community (as part of > Phoenix, core in HBase, an independent github project, ...?). > > The biggest isssue around having it all CP based is that you need to be > really careful to ensure that it comes _after_ all the other compaction > coprocessors. This way you know exactly what keys come out and have correct > statistics (for that point in time). Not a huge issue - you just need to be > careful. Baking the stats framework into HBase is really nice in that we > can be sure we never mess this up. > > Building it into the core of HBase isn't going to get us per-region > statistics without a whole bunch of pain - compactions per store make this > a pain to actualize; there isn't a real advantage here, as I'd like to keep > it per CF, if only not to change all the things. > > Further, this would be a great first use-case for real system tables. > Mixing this data with .META. is going to be a bit of a mess, especially for > doing clean scans, etc. to read the stats. Also, I'd be gravely concerned > to muck with such important state, especially if we make a 'statistic' a > pluggable element (so people can easily expand their own). > > And sure, we could make it make pretty graphs on the UI, no harm in it and > very little overhead :) > > ------------------- > Jesse Yates > @jesse_yates > jyates.github.com > > > On Tue, Feb 26, 2013 at 2:08 PM, Stack <[email protected]> wrote: > > > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[email protected]> > wrote: > > > > > This topic comes up now and then (see recent discussion about > translating > > > multi Gets into Scan+Filter). > > > > > > It's not that hard to keep statistics as part of compactions. > > > I envision two knobs: > > > 1. Max number of distinct values to track directly. If a column has > less > > > this # of values, keep track of their occurrences explicitly. > > > 2. Number of (equal width) histogram partitions to maintain. > > > > > > Statistics would be kept per store (i.e. per region per column family) > > and > > > stored into an HBase table (one row per store).Initially we could just > > > support major compactions that atomically insert a new version of that > > > statistics for the store. > > > > > > > > Sounds great. > > > > In .META. add columns for each each cf on each region row? Or another > > table? > > > > What kind of stats would you keep? Would they be useful for operators? > Or > > just for stuff like say Phoenix making decisions? > > > > > > > > > An simple implementation (not knowing ahead of time how many values it > > > will see during the compaction) could start by keeping track of > > individual > > > values for columns. If it gets past the max # of distinct values to > > track, > > > start with equal width histograms (using the distinct values picket up > so > > > far to estimate an initial partition width). > > > If the number of partition gets larger than what was configured it > would > > > increase the width and merge the previous counts into the new width > > (which > > > means the new partition width must be a multiple of the previous size). > > > There's probably a lot of other fanciness that could be used here > > (haven't > > > spent a lot of time thinking about details). > > > > > > > > > Is this something that should be in core HBase or rather be implemented > > as > > > coprocessor? > > > > > > > > > I think it could go in core if it generated pretty pictures. > > > > St.Ack > > > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
