[
https://issues.apache.org/jira/browse/HBASE-19893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546111#comment-16546111
]
Toshihiro Suzuki commented on HBASE-19893:
------------------------------------------
I analyzed the test failure of TestRestoreSnapshotFromClientWithRegionReplicas:
https://issues.apache.org/jira/browse/HBASE-19893?focusedCommentId=16468496&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16468496
As the QA files were already deleted, I just uploaded the log of the failed
test
[^org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientWithRegionReplicas-output.txt].
The test failed because the snapshot validation failed. The expected number of
snapshotted regions was 8 but the actual number was 7.
{code}
2018-05-09 05:39:58,418 ERROR [MASTER_TABLE_OPERATIONS-master/dbf4832ee95b:0-0]
snapshot.TakeSnapshotHandler(215): Failed taking snapshot {
ss=snaptb1-1525844378682
table=testOnlineSnapshotAfterSplittingRegions-1525844378682 type=FLUSH } due to
exception:Regions moved during the snapshot '{ ss=snaptb1-1525844378682
table=testOnlineSnapshotAfterSplittingRegions-1525844378682 type=FLUSH }'.
expected=8 snapshotted=7.
org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Regions moved
during the snapshot '{ ss=snaptb1-1525844378682
table=testOnlineSnapshotAfterSplittingRegions-1525844378682 type=FLUSH }'.
expected=8 snapshotted=7.
at
org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegions(MasterSnapshotVerifier.java:205)
at
org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:119)
at
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:202)
at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
It looks like snapshotting the split parent region
(9d446fdf76181457d16ec2690b932729) failed for some reason.
{code}
2018-05-09 05:39:58,418 ERROR [MASTER_TABLE_OPERATIONS-master/dbf4832ee95b:0-0]
snapshot.MasterSnapshotVerifier(197): No snapshot region directory found for
region:{ENCODED => 9d446fdf76181457d16ec2690b932729, NAME =>
'testOnlineSnapshotAfterSplittingRegions-1525844378682,,1525844378727.9d446fdf76181457d16ec2690b932729.',
STARTKEY => '', ENDKEY => '1'}
{code}
And I found from the above log that the RegionInfo of the split parent region
didn't have "OFFLINE => true" and "SPLIT => true", meaning updating the offline
flag and split flag failed, although splitting the region completed.
I think that's why snapshotting the split parent region failed. Actually, I
haven't figure out this root cause of this test failure yet, but I don't think
it's related to the patch in this Jira. I'll file this test failure in another
Jira and I think we can commit the patch. What do you think?
[[email protected]]
Thanks.
> restore_snapshot is broken in master branch when region splits
> --------------------------------------------------------------
>
> Key: HBASE-19893
> URL: https://issues.apache.org/jira/browse/HBASE-19893
> Project: HBase
> Issue Type: Bug
> Components: snapshots
> Reporter: Toshihiro Suzuki
> Assignee: Toshihiro Suzuki
> Priority: Critical
> Attachments: 19893.master.004.patch, 19893.master.004.patch,
> 19893.master.004.patch, HBASE-19893.master.001.patch,
> HBASE-19893.master.002.patch, HBASE-19893.master.003.patch,
> HBASE-19893.master.003.patch, HBASE-19893.master.004.patch,
> HBASE-19893.master.005.patch,
> org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientWithRegionReplicas-output.txt
>
>
> When I was investigating HBASE-19850, I found restore_snapshot didn't work in
> master branch.
>
> Steps to reproduce are as follows:
> 1. Create a table
> {code:java}
> create "test", "cf"
> {code}
> 2. Load data (2000 rows) to the table
> {code:java}
> (0...2000).each{|i| put "test", "row#{i}", "cf:col", "val"}
> {code}
> 3. Split the table
> {code:java}
> split "test"
> {code}
> 4. Take a snapshot
> {code:java}
> snapshot "test", "snap"
> {code}
> 5. Load more data (2000 rows) to the table and split the table agin
> {code:java}
> (2000...4000).each{|i| put "test", "row#{i}", "cf:col", "val"}
> split "test"
> {code}
> 6. Restore the table from the snapshot
> {code:java}
> disable "test"
> restore_snapshot "snap"
> enable "test"
> {code}
> 7. Scan the table
> {code:java}
> scan "test"
> {code}
> However, this scan returns only 244 rows (it should return 2000 rows) like
> the following:
> {code:java}
> hbase(main):038:0> scan "test"
> ROW COLUMN+CELL
> row78 column=cf:col, timestamp=1517298307049, value=val
> ....
> row999 column=cf:col, timestamp=1517298307608, value=val
> 244 row(s)
> Took 0.1500 seconds
> {code}
>
> Also, the restored table should have 2 online regions but it has 3 online
> regions.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)