Equal width histograms lend themselves relatively nicely to incremental 
updates, so we can extend that to in place updates later.

As for the lightweight RPC, yeah, or we could just do a Put without retries. I 
think if we fail to update the statistics it should not be considered a failure.
Could keep track of the last statistics update. We might also want a facility 
to update the stats without any compaction (maybe an M/R job).

In addition to histograms it might be also nice to keep track of the region 
min/max values for each column and keys. Maybe we also have to indicate 
(somehow) which columns we want to track in this way.

-- Lars



________________________________
 From: Andrew Purtell <[email protected]>
To: "[email protected]" <[email protected]>; lars hofhansl 
<[email protected]> 
Sent: Saturday, February 23, 2013 9:41 AM
Subject: Re: Simple stastics per region
 

> Statistics would be kept per store (i.e. per region per column family) and 
>stored into an HBase table (one row per store).Initially we could just support 
>major compactions that atomically insert a new version of that statistics for 
>the store.


Will we drop updates to the statistics table if regions of it are in 
transition? (I think that would be ok.)

Should we have a lightweight RPC for server to server communication that does 
not block or retry? 


The above two considerations would avoid a repeat of the region historian 
trouble... ancient history.

Can we expect pretty quickly desire for more than just statistics on data 
contributed after major compactions? That would be fine for characterizing the 
data within, but doesn't provide any information about access patterns to the 
data like I mentioned in the other email.



On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[email protected]> wrote:

This topic comes up now and then (see recent discussion about translating multi 
Gets into Scan+Filter).
>
>It's not that hard to keep statistics as part of compactions.
>I envision two knobs:
>1. Max number of distinct values to track directly. If a column has less this 
># of values, keep track of their occurrences explicitly.
>2. Number of (equal width) histogram partitions to maintain.
>
>Statistics would be kept per store (i.e. per region per column family) and 
>stored into an HBase table (one row per store).Initially we could just support 
>major compactions that atomically insert a new version of that statistics for 
>the store.
>
>An simple implementation (not knowing ahead of time how many values it will 
>see during the compaction) could start by keeping track of individual values 
>for columns. If it gets past the max # of distinct values to track, start with 
>equal width histograms (using the distinct values picket up so far to estimate 
>an initial partition width).
>If the number of partition gets larger than what was configured it would 
>increase the width and merge the previous counts into the new width (which 
>means the new partition width must be a multiple of the previous size).
>There's probably a lot of other fanciness that could be used here (haven't 
>spent a lot of time thinking about details).
>
>
>Is this something that should be in core HBase or rather be implemented as 
>coprocessor?
>
>
>-- Lars
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)

Reply via email to