[ 
https://issues.apache.org/jira/browse/HBASE-19893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546111#comment-16546111
 ] 

Toshihiro Suzuki commented on HBASE-19893:
------------------------------------------

I analyzed the test failure of TestRestoreSnapshotFromClientWithRegionReplicas:
https://issues.apache.org/jira/browse/HBASE-19893?focusedCommentId=16468496&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16468496

As the QA files were already deleted, I just uploaded the log of the failed 
test 
[^org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientWithRegionReplicas-output.txt].

The test failed because the snapshot validation failed. The expected number of 
snapshotted regions was 8 but the actual number was 7.
{code}
2018-05-09 05:39:58,418 ERROR [MASTER_TABLE_OPERATIONS-master/dbf4832ee95b:0-0] 
snapshot.TakeSnapshotHandler(215): Failed taking snapshot { 
ss=snaptb1-1525844378682 
table=testOnlineSnapshotAfterSplittingRegions-1525844378682 type=FLUSH } due to 
exception:Regions moved during the snapshot '{ ss=snaptb1-1525844378682 
table=testOnlineSnapshotAfterSplittingRegions-1525844378682 type=FLUSH }'. 
expected=8 snapshotted=7.
org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Regions moved 
during the snapshot '{ ss=snaptb1-1525844378682 
table=testOnlineSnapshotAfterSplittingRegions-1525844378682 type=FLUSH }'. 
expected=8 snapshotted=7.
        at 
org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegions(MasterSnapshotVerifier.java:205)
        at 
org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:119)
        at 
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:202)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}


It looks like snapshotting the split parent region 
(9d446fdf76181457d16ec2690b932729) failed for some reason.
{code}
2018-05-09 05:39:58,418 ERROR [MASTER_TABLE_OPERATIONS-master/dbf4832ee95b:0-0] 
snapshot.MasterSnapshotVerifier(197):  No snapshot region directory found for 
region:{ENCODED => 9d446fdf76181457d16ec2690b932729, NAME => 
'testOnlineSnapshotAfterSplittingRegions-1525844378682,,1525844378727.9d446fdf76181457d16ec2690b932729.',
 STARTKEY => '', ENDKEY => '1'}
{code}
And I found from the above log that the RegionInfo of the split parent region 
didn't have "OFFLINE => true" and "SPLIT => true", meaning updating the offline 
flag and split flag failed, although splitting the region completed.

I think that's why snapshotting the split parent region failed. Actually, I 
haven't figure out this root cause of this test failure yet, but I don't think 
it's related to the patch in this Jira. I'll file this test failure in another 
Jira and I think we can commit the patch. What do you think? 
[[email protected]]

Thanks.

> restore_snapshot is broken in master branch when region splits
> --------------------------------------------------------------
>
>                 Key: HBASE-19893
>                 URL: https://issues.apache.org/jira/browse/HBASE-19893
>             Project: HBase
>          Issue Type: Bug
>          Components: snapshots
>            Reporter: Toshihiro Suzuki
>            Assignee: Toshihiro Suzuki
>            Priority: Critical
>         Attachments: 19893.master.004.patch, 19893.master.004.patch, 
> 19893.master.004.patch, HBASE-19893.master.001.patch, 
> HBASE-19893.master.002.patch, HBASE-19893.master.003.patch, 
> HBASE-19893.master.003.patch, HBASE-19893.master.004.patch, 
> HBASE-19893.master.005.patch, 
> org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientWithRegionReplicas-output.txt
>
>
> When I was investigating HBASE-19850, I found restore_snapshot didn't work in 
> master branch.
>  
> Steps to reproduce are as follows:
> 1. Create a table
> {code:java}
> create "test", "cf"
> {code}
> 2. Load data (2000 rows) to the table
> {code:java}
> (0...2000).each{|i| put "test", "row#{i}", "cf:col", "val"}
> {code}
> 3. Split the table
> {code:java}
> split "test"
> {code}
> 4. Take a snapshot
> {code:java}
> snapshot "test", "snap"
> {code}
> 5. Load more data (2000 rows) to the table and split the table agin
> {code:java}
> (2000...4000).each{|i| put "test", "row#{i}", "cf:col", "val"}
> split "test"
> {code}
> 6. Restore the table from the snapshot 
> {code:java}
> disable "test"
> restore_snapshot "snap"
> enable "test"
> {code}
> 7. Scan the table
> {code:java}
> scan "test"
> {code}
> However, this scan returns only 244 rows (it should return 2000 rows) like 
> the following:
> {code:java}
> hbase(main):038:0> scan "test"
> ROW COLUMN+CELL
>  row78 column=cf:col, timestamp=1517298307049, value=val
> ....
>   row999 column=cf:col, timestamp=1517298307608, value=val
> 244 row(s)
> Took 0.1500 seconds
> {code}
>  
> Also, the restored table should have 2 online regions but it has 3 online 
> regions.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to