[
https://issues.apache.org/jira/browse/HBASE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15019180#comment-15019180
]
Lars George commented on HBASE-14864:
-------------------------------------
Yes, say you have an epoch like
{noformat}
$ date -j -f %s 1433763479
Mon Jun 8 04:37:59 PDT 2015
{noformat}
or the date it results to given as {{20150608043759}}, it would be good to say
instead of sending this to a _random_, hashed location per time, that you
instead _round_ it to every five minutes. That way all keys from say
{{201506080435}} to {{201506080440}} arrive at the same bucket. Now when you
read the data you can fetch it as a block per server, and therefore IO is more
efficient (per server).
> Add support for bucketing of keys into client library
> -----------------------------------------------------
>
> Key: HBASE-14864
> URL: https://issues.apache.org/jira/browse/HBASE-14864
> Project: HBase
> Issue Type: New Feature
> Components: Client
> Reporter: Lars George
>
> This has been discussed and taught so many times, I believe it is time to
> support it properly. The idea is to be able to assign an optional _bucketing_
> strategy to a table, which translates the user given row keys into a bucketed
> version. This is done by either simple count, or by parts of the key.
> Possibly some simple functionality should help _compute_ bucket keys.
> For example, given a key {{<service>\-<epoch>\-<subgroup>-...}} you could
> imagine that a rule can be defined that takes the _epoch_ part and chunks it
> into, for example, 5 minute buckets. This allows to store small time series
> together and make reading (especially over many servers) much more efficient.
> The client also supports the proper scan logic to fan a scan over the buckets
> as needed. There may be an executor service (implicitly or explicitly
> provided) that is used to fetch the original data with user visible ordering
> from the distributed buckets.
> Note that this has been attempted a few times to various extends out in the
> field, but then withered away. This is an essential feature that when present
> in the API will make users consider this earlier, instead of when it is too
> late (when hot spotting occurs for example).
> The selected bucketing strategy and settings could be stored in the table
> descriptor key/value pairs. This will allow any client to observe the
> strategy transparently. If not set the behaviour is the same as today, so the
> new feature is not touching any critical path in terms of code, and is fully
> client side. (But could be considered for say UI support as well - if needed).
> The strategies are pluggable using classes, but a few default implementations
> are supplied.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)