[
https://issues.apache.org/jira/browse/HBASE-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481772#comment-13481772
]
Jesse Yates commented on HBASE-2600:
------------------------------------
We've been doing a lot of thinking over here at Saleforce about this issue and
was thinking about picking up work on this, is Alex is busy. The current
approach is pretty good, and has a lot of merits. We also discussed the option
of using the multi-row transaction stuff (which will be another reason why we
couldn't split META). I did a full write-up/analysis of the options (see
https://dl.dropbox.com/u/6147077/Proposal-HBASE-2600.docx).
What I ended up coming up with is a little bit crazy, but I think it works.
(I'm not dealing with tablenames as hashes, but that is pretty trivial). What
I'm looking to solve are:
(1) replacing start key’s with endkeys
(2) ensuring correct sorting
(3) ensuring correct split behavior to avoid META holes
(4) moving the compound key to their own family/qualifier
There seems to be a couple pieces we can put together to ensure we meet all the
above goals.
First, row keys are encoded as:
For all non-terminal regions:
{code}
<table>0x00<endkey>
{code}
For the terminal region:
{code}
<table>0x01
{code}
Then we can move the encoded name into its own cell, under the
“info:encodedname” column. Next, the regionid is moved to the timestamp and
used for all updates the region in META (this includes offlining and marking
the parent as split). Since regionids are already timestamps by convention,
this doesn't stray that far afield.
META then looks something like:
{code}
<table>0x00<endkey> | info |
| encodedname | <regionid> | <md5 hash>
| regioninfo | <regionid> | <hri – 1>
| server | <regionid> | <server:port>
| server.startcode| <regionid> | <startcode
| splitA | <regionid> | <hri – 3>
| splitB | <regionid> | <hri – 4>
<table>0x01 | info | encodedname | <regionid2>| <hri-4>
| ... | <regionid2>| ...
{code}
Obviously there are some serious implications for how lookups and splits work.
Splits need to take the opposite approach with respect to putting children in
META. Currently, we write the bottom and then the top child, counting on the
htable to retry when it finds an offlined region. Now, we just flip the
ordering by: (1) offline the parent, (2) put the 'top' child and then (3)
insert the bottom child.
The problem lies in making sure that the bottom child sorts before the parent.
In the previous scheme we ensured that sorting by putting a regionid in the row
key. With the above scheme, the 'top' child will always sort before the parent
because it has a lower endkey. The 'bottom' child actual has _exactly the same
row key_ as the parent. However, the bottom child still sorts first because it
has a larger regionid (which is also already baked into the code).
We also must do a check of the timestamp vs. the expected regionid to ensure
that we can get the correct region, but that is a minor overhead.
NOTE: this also gives us provenance of regions, at least until the catalog
janitor cleans up parent regions.
For lookups, you would query for the first region that matches (similar to the
current mechanism):
{code}
<table>0x00<desired key>999999……
{code}
which still finds the correct (bottom) child because its regionid must be
greater than its parent causing it to sort _before_ its parent in the same row.
This give us correct sorting, an easily readable META, and no holes. Oh, and we
can remove all the backwords scanning.
> Change how we do meta tables; from tablename+STARTROW+randomid to instead,
> tablename+ENDROW+randomid
> ----------------------------------------------------------------------------------------------------
>
> Key: HBASE-2600
> URL: https://issues.apache.org/jira/browse/HBASE-2600
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Assignee: Alex Newman
> Attachments:
> 0001-Changed-regioninfo-format-to-use-endKey-instead-of-s.patch,
> 0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen.patch,
> 0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v2.patch,
> 0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v4.patch,
> 0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v6.patch,
> 0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v7.2.patch,
> 0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v8,
> 0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v8.1,
> 0001-HBASE-2600.-Change-how-we-do-meta-tables-from-tablen-v9.patch,
> 0001-HBASE-2600.v10.patch, 0001-HBASE-2600-v11.patch, 2600-trunk-01-17.txt,
> HBASE-2600+5217-Sun-Mar-25-2012-v3.patch,
> HBASE-2600+5217-Sun-Mar-25-2012-v4.patch, hbase-2600-root.dir.tgz, jenkins.pdf
>
>
> This is an idea that Ryan and I have been kicking around on and off for a
> while now.
> If regionnames were made of tablename+endrow instead of tablename+startrow,
> then in the metatables, doing a search for the region that contains the
> wanted row, we'd just have to open a scanner using passed row and the first
> row found by the scan would be that of the region we need (If offlined
> parent, we'd have to scan to the next row).
> If we redid the meta tables in this format, we'd be using an access that is
> natural to hbase, a scan as opposed to the perverse, expensive
> getClosestRowBefore we currently have that has to walk backward in meta
> finding a containing region.
> This issue is about changing the way we name regions.
> If we were using scans, prewarming client cache would be near costless (as
> opposed to what we'll currently have to do which is first a
> getClosestRowBefore and then a scan from the closestrowbefore forward).
> Converting to the new method, we'd have to run a migration on startup
> changing the content in meta.
> Up to this, the randomid component of a region name has been the timestamp of
> region creation. HBASE-2531 "32-bit encoding of regionnames waaaaaaayyyyy
> too susceptible to hash clashes" proposes changing the randomid so that it
> contains actual name of the directory in the filesystem that hosts the
> region. If we had this in place, I think it would help with the migration to
> this new way of doing the meta because as is, the region name in fs is a hash
> of regionname... changing the format of the regionname would mean we generate
> a different hash... so we'd need hbase-2531 to be in place before we could do
> this change.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira