[ 
https://issues.apache.org/jira/browse/HBASE-5504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246885#comment-13246885
 ] 

Eric Newton commented on HBASE-5504:
------------------------------------

Hi,

I work on Accumulo, and I implemented the merge/deleteRange feature.

Our primary use-case was a time-based row, in which data was being deleted as 
it aged.  Over time, this created splits that were empty and resulted in poor 
balancing. The same goes for deleteRange.  It is now very efficient to tell 
Accumulo: delete all rows that fall before "20111231".

I know that stack and Keith Turner have commented on other tickets and 
referenced the FATE architecture.  It really simplified the many steps in merge 
(see the outline in this ticket).  FATE is not very big and would be easy to 
borrow/emulate.

We also added the concept of read and write locks on tables.  Merge grabs a 
write lock on a table, and so does bulk import.  This reduces the number of 
assumptions each of the processes has to understand.

Accumulo has a file garbage collector, so files can survive many splits. But 
this makes merge more complex because files containing data for an unused 
section of a file in a split might be re-used in a merge, and that data might 
have been deleted.  A merge would inadvertently bring that data back. We had at 
least two implementation options:

 1. keep range information with files in the metadata tablet
 2. "chop" files, or compact a file down to the tablet's range before the merge

We chose option 2.  This might appeal to the HBase community, too, since it 
avoids double-indirection and file-deletion issues.  But I regret not using 
option 1 because it would have made merge as fast as split.

Our merge works on a range. But we have tools that allow a user to merge based 
on their tablet size.  But this only works with one range at a time.  It also 
requires the tablet to go online to perform the chop operation. If a user 
decides to change their tablet size from 256M to 1G, and their table size is 
1P, they will wait a very long time.  In the future we will recognize this case 
and chop the tablets in parallel and take the table offline to re-write the 
metadata table.

I know we found a lot of places in Accumulo where we assumed that tablets would 
only split.  It took a lot of testing to discover these assumptions.  
Fortunately merge/split are isomorphic and can be used to test the each other.  
DeleteRange, while very similar to merge, does not have this property and is 
harder to test at scale.

                
> Online Merge
> ------------
>
>                 Key: HBASE-5504
>                 URL: https://issues.apache.org/jira/browse/HBASE-5504
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: client, master, shell, zookeeper
>    Affects Versions: 0.94.0
>            Reporter: Mubarak Seyed
>             Fix For: 0.96.0
>
>
> As discussed, please refer the discussion at 
> [HBASE-4991|https://issues.apache.org/jira/browse/HBASE-4991]
> Design suggestion from Stack:
> {quote}
> I suggest a design below. It has some prerequisites, some general function 
> that this feature could use (and others). The prereqs if you think them good, 
> could be done outside of this JIRA.
> Here's a suggested rough outline of how I think this feature should run. The 
> feature I'm describing below is merge and deleteRegion for I see them as in 
> essence the same thing.
> (C) Client, (M) Master, RS (Region server), ZK (ZooKeeper)
> 1. Client calls merge or deleteRegion API. API is a range of rows. (C)
> 2. Master gets call. (M)
> 3. Master obtains a write lock on table so it can't be disabled from under 
> us. The write lock will also disable splitting. This is one of the prereqs I 
> think. Its HBASE-5494 (Or we could just do something simpler where we have a 
> flag up in zk that splitRegion checks but thats less useful I think; OR we do 
> the dynamic configs issue and set splits to off via a config. change). 
> There'd be a timer for how long we wait on the table lock. (M -> ZK)
> 4. If we get the lock, write intent to merge a range up into zk. It also 
> hoists into zk if its a pure merge or a merge that drops the region data (a 
> deleteRegion call) (M)
> 5. Return to the client either our failed attempt at locking the table or an 
> id of some sort used identifying this running operation; can use it querying 
> status. (M -> C)
> 6. Turn off balancer. TODO/prereq: Do it in a way that is persisted. Balancer 
> switch currently in memory only so if master crashes, new master will come up 
> in balancing mode # (If we had dynamic config. could hoist up to zk a config. 
> that disables the balancer rather than have a balancer-specific flag/znode OR 
> if a write lock outstanding on a table, then the balancer does not balance 
> regions in the locked table - this latter might be the easiest to do) (M)
> 7. Write into zk that just turned off the balancer (If it was on) (M -> ZK)
> 8. Get regions that are involved in the span (M)
> 9. Hoist the list up into zk. (M -> ZK)
> 10. Create region to span the range. (M)
> 11. Write that we did this up into zk. (M -> ZK)
> 12. Close regions in parallel. Confirm close in parallel. (M -> RS)
> 13. Write up into zk regions closed (This might not be necessary since can 
> ask if region is open). (M -> ZK)
> 14. If a merge and not a delete region, move files under new region. Might 
> multithread this (moves should go pretty fast). If a deleteregion, we skip 
> this step. (M)
> 15. On completion mark zk (though may not be necessary since its easy to look 
> in fs to see state of move). (M -> ZK)
> 16. Edit .META. (M)
> 17. Confirm edits went in. (M)
> 18. Move old regions to hbase trash folder TODO: There is no trash folder 
> under /hbase currently. We should add one. (M)
> 19. Enable balancer (if it was off) (M)
> 20. Unlock table (M)
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to