[ 
https://issues.apache.org/jira/browse/HBASE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15019180#comment-15019180
 ] 

Lars George commented on HBASE-14864:
-------------------------------------

Yes, say you have an epoch like 

{noformat}
$ date -j -f %s 1433763479
Mon Jun  8 04:37:59 PDT 2015
{noformat}

or the date it results to given as {{20150608043759}}, it would be good to say 
instead of sending this to a _random_, hashed location per time, that you 
instead _round_ it to every five minutes. That way all keys from say 
{{201506080435}} to {{201506080440}} arrive at the same bucket. Now when you 
read the data you can fetch it as a block per server, and therefore IO is more 
efficient (per server).

> Add support for bucketing of keys into client library
> -----------------------------------------------------
>
>                 Key: HBASE-14864
>                 URL: https://issues.apache.org/jira/browse/HBASE-14864
>             Project: HBase
>          Issue Type: New Feature
>          Components: Client
>            Reporter: Lars George
>
> This has been discussed and taught so many times, I believe it is time to 
> support it properly. The idea is to be able to assign an optional _bucketing_ 
> strategy to a table, which translates the user given row keys into a bucketed 
> version. This is done by either simple count, or by parts of the key. 
> Possibly some simple functionality should help _compute_ bucket keys. 
> For example, given a key {{<service>\-<epoch>\-<subgroup>-...}} you could 
> imagine that a rule can be defined that takes the _epoch_ part and chunks it 
> into, for example, 5 minute buckets. This allows to store small time series 
> together and make reading (especially over many servers) much more efficient.
> The client also supports the proper scan logic to fan a scan over the buckets 
> as needed. There may be an executor service (implicitly or explicitly 
> provided) that is used to fetch the original data with user visible ordering 
> from the distributed buckets. 
> Note that this has been attempted a few times to various extends out in the 
> field, but then withered away. This is an essential feature that when present 
> in the API will make users consider this earlier, instead of when it is too 
> late (when hot spotting occurs for example).
> The selected bucketing strategy and settings could be stored in the table 
> descriptor key/value pairs. This will allow any client to observe the 
> strategy transparently. If not set the behaviour is the same as today, so the 
> new feature is not touching any critical path in terms of code, and is fully 
> client side. (But could be considered for say UI support as well - if needed).
> The strategies are pluggable using classes, but a few default implementations 
> are supplied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to