[
https://issues.apache.org/jira/browse/HBASE-11682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101308#comment-14101308
]
Jonathan Hsieh commented on HBASE-11682:
----------------------------------------
{code}
+ <para>Salting in this sense has nothing to do with cryptography, but
refers to adding random
+ data to the start of a row key. In this case, salting refers to adding
a prefix to the row
+ key to cause it to sort differently than it otherwise would. Salting
can be helpful if you
+ have a few keys that come up over and over, along with other rows that
don't fit those keys.
+ In that case, the regions holding rows with the "hot" keys would be
overloaded, compared to
+ the other regions. Salting completely removes ordering, so is often a
poorer choice than
+ hashing. Using totally random row keys for data which is accessed
sequentially would remove
+ the benefit of HBase's row-sorting algorithm and cause very poor
performance, as each get or
+ scan would need to query all regions.</para>
{code}
I don't think this salting example is correct about the ramifications. Both
Nick and I agree that salting is puting some random value in front of the
actual value. This means instead of one sorted list of entries, we'd have many
n sorted lists of entries if the cardinality of the salt is n.
Example: naively we have rowkeys like this:
foo0001
foo0002
foo0003
foo0004
if we us a 4 way salt (a,b,c,d), we could end up with data resorted like this:
a-foo0003
b-foo0001
c-foo0004
d-foo0002
Let say we add some new values to row foo0003. It could get salted with a new
salt, let's say 'c'.
a-foo0003
b-foo0001
*c-foo0003*
c-foo0004
d-foo0002
To read we still could get things read in the original order but we'd have to
have a reader starting from each salt in parallel to get the rows back in
order. (and likely need to do some coalescing of foo0003 to combine the
a-foo0003 and c-foo0003 rows back into one. The effect here in this situtation
is that we could be writing with 4x the throughput now since we would be on 4
different machines.(assuming that the a, b, c, d are balanced onto different
machines).
Nick's point of view (please correct me if I am wrong) says that you could
"salt" the original row key with a one-way hash so that foo0003 would always
get salted with 'a'. This would spread rowkeys that are lexicographically
close (foo0001 and foo0002) to different machines that could help reduce
contention and increase overall throughput but not allow ever allow a single
row to have 4x the throughput like the other approach.
{code}
+ <para>Hashing refers to applying a random one-way function to the row
key, such that a
+ particular row always gets the same arbitrary value applied. This
preserves the sort order
+ so that scans are effective, but spreads out load across a region. One
example where hashing
+ is the right strategy would be if for some reason, a large proportion
of rows started with
+ the same letter. Normally, these would all be sorted into the same
region. You can apply a
+ hash to artificially differentiate them and spread them out.</para>
{code}
Hashing actually totally trashes the sort order -- in fact the goal of hashing
is to evenly disburse entries that are near each other lexicographically as
much as possible.
> Explain hotspotting
> -------------------
>
> Key: HBASE-11682
> URL: https://issues.apache.org/jira/browse/HBASE-11682
> Project: HBase
> Issue Type: Task
> Components: documentation
> Reporter: Misty Stanley-Jones
> Assignee: Misty Stanley-Jones
> Attachments: HBASE-11682-1.patch, HBASE-11682.patch, HBASE-11682.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)