+1 for core. I can see that histograms might help us in automatic splits and merges as well.
On Tue, Feb 26, 2013 at 3:27 PM, Andrew Purtell <[email protected]> wrote: > If this is going to be a CP then other CPs need an easy way to use the > output stats. If a subsequent proposal from core requires statistics from > this CP does that then mandate it itself must be a CP? What if that can't > work? > > Putting the stats into a table addresses the first concern. > > For the second, it is an issue that comes up I think when building a > generally useful shared function as a CP. Please consider inserting my > earlier comments about OSGi here, in that we trend toward a real module > system if we're not careful (unless that is the aim). > > > On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[email protected] > >wrote: > > > TL;DR Making it part of the UI and ensuring that you don't load things > the > > wrong way seem to be the only reasons for making this part of core - > > certainly not bad reasons. They are fairly easy to handle as a CP though, > > so maybe its not necessary immediately. > > > > I ended up writing a simple stats framework last week (ok, its like 6 > > classes) that makes it easy to create your own stats for a table. Its all > > coprocessor based, and as Lars suggested, hooks up to the major > compactions > > to let you build per-column-per-region stats and writes it to a 'system' > > table = "_stats_". > > > > With the framework you could easily write your own custom stats, from > > simple things like min/max keys to things like fixed width or fixed depth > > histograms, or even more complicated. There has been some internal > > discussion around how to make this available to the community (as part of > > Phoenix, core in HBase, an independent github project, ...?). > > > > The biggest isssue around having it all CP based is that you need to be > > really careful to ensure that it comes _after_ all the other compaction > > coprocessors. This way you know exactly what keys come out and have > correct > > statistics (for that point in time). Not a huge issue - you just need to > be > > careful. Baking the stats framework into HBase is really nice in that we > > can be sure we never mess this up. > > > > Building it into the core of HBase isn't going to get us per-region > > statistics without a whole bunch of pain - compactions per store make > this > > a pain to actualize; there isn't a real advantage here, as I'd like to > keep > > it per CF, if only not to change all the things. > > > > Further, this would be a great first use-case for real system tables. > > Mixing this data with .META. is going to be a bit of a mess, especially > for > > doing clean scans, etc. to read the stats. Also, I'd be gravely concerned > > to muck with such important state, especially if we make a 'statistic' a > > pluggable element (so people can easily expand their own). > > > > And sure, we could make it make pretty graphs on the UI, no harm in it > and > > very little overhead :) > > > > ------------------- > > Jesse Yates > > @jesse_yates > > jyates.github.com > > > > > > On Tue, Feb 26, 2013 at 2:08 PM, Stack <[email protected]> wrote: > > > > > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[email protected]> > > wrote: > > > > > > > This topic comes up now and then (see recent discussion about > > translating > > > > multi Gets into Scan+Filter). > > > > > > > > It's not that hard to keep statistics as part of compactions. > > > > I envision two knobs: > > > > 1. Max number of distinct values to track directly. If a column has > > less > > > > this # of values, keep track of their occurrences explicitly. > > > > 2. Number of (equal width) histogram partitions to maintain. > > > > > > > > Statistics would be kept per store (i.e. per region per column > family) > > > and > > > > stored into an HBase table (one row per store).Initially we could > just > > > > support major compactions that atomically insert a new version of > that > > > > statistics for the store. > > > > > > > > > > > Sounds great. > > > > > > In .META. add columns for each each cf on each region row? Or another > > > table? > > > > > > What kind of stats would you keep? Would they be useful for operators? > > Or > > > just for stuff like say Phoenix making decisions? > > > > > > > > > > > > > An simple implementation (not knowing ahead of time how many values > it > > > > will see during the compaction) could start by keeping track of > > > individual > > > > values for columns. If it gets past the max # of distinct values to > > > track, > > > > start with equal width histograms (using the distinct values picket > up > > so > > > > far to estimate an initial partition width). > > > > If the number of partition gets larger than what was configured it > > would > > > > increase the width and merge the previous counts into the new width > > > (which > > > > means the new partition width must be a multiple of the previous > size). > > > > There's probably a lot of other fanciness that could be used here > > > (haven't > > > > spent a lot of time thinking about details). > > > > > > > > > > > > Is this something that should be in core HBase or rather be > implemented > > > as > > > > coprocessor? > > > > > > > > > > > > > I think it could go in core if it generated pretty pictures. > > > > > > St.Ack > > > > > > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) >
