[
https://issues.apache.org/jira/browse/HBASE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589869#comment-13589869
]
Jesse Yates commented on HBASE-7958:
------------------------------------
Good call Todd.
The origina; intention was to enable histograms over the keyvalues in a region.
They are pretty simple to implement and get people really far, for many cases.
The histograms support things like determining parallelization of scan within a
region (should I use 1, 5, or 100 threads to scan this region) as well as
key/value cardinality (helpful for using non-covered indexes).
Hopefully not getting too far into the implementation details, we could easily
use a compound key structure in the stats table to support a large variety of
stats going forward that adds almost no complication to the intial, histogram
case.
> Statistics per-column family per-region
> ---------------------------------------
>
> Key: HBASE-7958
> URL: https://issues.apache.org/jira/browse/HBASE-7958
> Project: HBase
> Issue Type: New Feature
> Affects Versions: 0.96.0
> Reporter: Jesse Yates
> Fix For: 0.96.0
>
>
> Originating from this discussion on the dev list:
> http://search-hadoop.com/m/coDKU1urovS/Simple+stastics+per+region/v=plain
> Essentially, we should have built-in statistics gathering for HBase tables.
> This allows clients to have a better understanding of the distribution of
> keys within a table and a given region. We could also surface this
> information via the UI.
> There are a couple different proposals from the email, the overview is this:
> We add in something on compactions that gathers stats about the keys that are
> written and then we surface them to a table.
> The possible proposals include:
> *How to implement it?*
> # Coprocessors -
> ** advantage - it easily plugs in and people could pretty easily add their
> own statistics.
> ** disadvantage - UI elements would also require this, we get into dependent
> loading, which leads down the OSGi path. Also, these CPs need to be installed
> _after_ all the other CPs on compaction to ensure they see exactly what gets
> written (doable, but a pain)
> # Built into HBase as a custom scanner
> ** advantage - always goes in the right place and no need to muck about with
> loading CPs etc.
> ** disadvantage - less pluggable, at least for the initial cut
> *Where do we store data?*
> # .META.
> ** advantage - its an existing table, so we can jam it into another CF there
> ** disadvantage - this would make META much larger, possibly leading to
> splits AND will make it much harder for other processes to read the info
> # A new stats table
> ** advantage - cleanly separates out the information from META
> ** disadvantage - should use a 'system table' idea to prevent accidental
> deletion, manipulation by arbitrary clients, but still allow clients to read
> it.
> Once we have this framework, we can then move to an actual implementation of
> various statistics.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira