Ankit Singhal created HBASE-22105:
-------------------------------------

             Summary: Snapshot of the table fails when region is transitioning 
                 Key: HBASE-22105
                 URL: https://issues.apache.org/jira/browse/HBASE-22105
             Project: HBase
          Issue Type: Bug
          Components: snapshots
    Affects Versions: 2.1
            Reporter: Ankit Singhal
            Assignee: Ankit Singhal


Region is moving(MoveRegionProcedure)
Unassign Procedure completed on source node:-
{code:java}
2019-03-18 06:56:12,027 INFO org.apache.hadoop.hbase.master.HMaster: balance 
hri=819809c43ee3f276be024a9e0a1bcce5, source=<ip>,22101,1552916141124, 
destination=<ip>,22101,1552916140986
2019-03-18 06:56:14,334 INFO 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
for pid=44, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=819809c43ee3f276be024a9e0a1bcce5, source=<ip>,22101,1552916141124, 
destination=<ip>,22101,1552916140986
2019-03-18 06:56:14,338 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
subprocedures=[{pid=51, ppid=44, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
UnassignProcedure table=t11, region=819809c43ee3f276be024a9e0a1bcce5, 
override=true, server=<ip>,22101,1552916141124}]
2019-03-18 06:56:14,344 INFO 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
for pid=51, ppid=44, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
UnassignProcedure table=t11, region=819809c43ee3f276be024a9e0a1bcce5, 
override=true, server=<ip>,22101,1552916141124
2019-03-18 06:56:14,349 INFO 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=51 updating 
hbase:meta row=819809c43ee3f276be024a9e0a1bcce5, regionState=CLOSING, 
regionLocation=<ip>,22101,1552916141124
2019-03-18 06:56:14,354 INFO 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch 
pid=51, ppid=44, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; 
UnassignProcedure table=t11, region=819809c43ee3f276be024a9e0a1bcce5, 
override=true, server=<ip>,22101,1552916141124
2019-03-18 06:56:14,529 INFO 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=51 updating 
hbase:meta row=819809c43ee3f276be024a9e0a1bcce5, regionState=CLOSED
{code}
Assign procedure on destination started:-
{code:java}
2019-03-18 06:56:14,545 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished subprocedure 
pid=51, resume processing parent pid=44, state=RUNNABLE:MOVE_REGION_ASSIGN, 
locked=true; MoveRegionProcedure hri=819809c43ee3f276be024a9e0a1bcce5, 
source=<ip>,22101,1552916141124, destination=<ip>,22101,1552916140986
2019-03-18 06:56:14,545 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
subprocedures=[{pid=53, ppid=44, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
AssignProcedure table=t11, region=819809c43ee3f276be024a9e0a1bcce5, 
target=<ip>,22101,1552916140986}]
2019-03-18 06:56:14,545 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=51, ppid=44, 
state=SUCCESS; UnassignProcedure table=t11, 
region=819809c43ee3f276be024a9e0a1bcce5, override=true, 
server=<ip>,22101,1552916141124 in 195msec, unfinishedSiblingCount=1
2019-03-18 06:56:14,550 INFO 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
for pid=53, ppid=44, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=t11, region=819809c43ee3f276be024a9e0a1bcce5, 
target=<ip>,22101,1552916140986
2019-03-18 06:56:14,555 INFO 
org.apache.hadoop.hbase.master.assignment.AssignProcedure: Starting pid=53, 
ppid=44, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true; AssignProcedure 
table=t11, region=819809c43ee3f276be024a9e0a1bcce5, 
target=<ip>,22101,1552916140986; rit=OFFLINE, 
location=<ip>,22101,1552916140986; forceNewPlan=false, retain=false
2019-03-18 06:56:14,706 INFO 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=53 updating 
hbase:meta row=819809c43ee3f276be024a9e0a1bcce5, regionState=OPENING, 
regionLocation=<ip>,22101,1552916140986
{code}
Before Assign completes, snapshot procedure started and failed:-
{code:java}
2019-03-18 06:56:14,788 INFO org.apache.hadoop.hbase.procedure.Procedure: 
Starting procedure 
'cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11'
2019-03-18 06:56:14,830 INFO org.apache.hadoop.hbase.procedure.Procedure: 
Procedure 
'cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11' 
execution completed
2019-03-18 06:56:14,830 INFO org.apache.hadoop.hbase.procedure.ZKProcedureUtil: 
Clearing all znodes for procedure 
cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11including
 nodes /hbase/online-snapshot/acquired /hbase/online-snapshot/reached 
/hbase/online-snapshot/abort
2019-03-18 06:56:14,849 INFO 
org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler: Done 
waiting - online snapshot for 
cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11
2019-03-18 06:56:14,962 ERROR 
org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier: Snapshot { 
ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 
table=t11 type=FLUSH } looks empty
2019-03-18 06:56:14,963 ERROR 
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler: Failed taking 
snapshot { 
ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 
table=t11 type=FLUSH } due to exception:Snapshot { 
ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 
table=t11 type=FLUSH } looks empty
org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Snapshot { 
ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 
table=t11 type=FLUSH } looks empty
        at 
org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegions(MasterSnapshotVerifier.java:172)
        at 
org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:119)
        at 
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:209)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2019-03-18 06:56:14,973 INFO 
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler: Stop taking 
snapshot={ 
ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 
table=t11 type=FLUSH } because: Failed taking snapshot { 
ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 
table=t11 type=FLUSH } due to exception:Snapshot { 
ss=cm-auto-6d40d404-5b14-4bd5-be8b-e67b374d6c22_HOURLY_2019-03-18_06-56-08_t11 
table=t11 type=FLUSH } looks empty

{code}
Assign just completes after it:-
{code:java}
2019-03-18 06:56:14,981 INFO 
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=53 updating 
hbase:meta row=819809c43ee3f276be024a9e0a1bcce5, regionState=OPEN, 
openSeqNum=5, regionLocation=<ip>,22101,1552916140986
2019-03-18 06:56:15,094 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=50, 
state=SUCCESS; org.apache.hadoop.hbase.master.locking.LockProcedure, 
tableName=t11, type=SHARED in 758msec
{code}
Reproducible steps:-
{code:java}
 

hbase(main):003:0> create 't1','f1'
Created table t1

hbase(main):029:0> get 
'hbase:meta','t1,,1553548584756.a53f0cf0e3824955618974942463efd3.','info:state'
COLUMN CELL
 info:state timestamp=1553548585347, value=OPEN

hbase(main):006:0> unassign 
't1,,1553548584756.a53f0cf0e3824955618974942463efd3.'

hbase(main):029:0> get 
'hbase:meta','t1,,1553548584756.a53f0cf0e3824955618974942463efd3.','info:state'
COLUMN CELL
 info:state timestamp=1553548638199, value=CLOSED

hbase(main):031:0> put 
'hbase:meta','t1,,1553548584756.a53f0cf0e3824955618974942463efd3.','info:state','OPENING'
Took 0.0830 seconds

hbase(main):032:0> get 
'hbase:meta','t1,,1553548584756.a53f0cf0e3824955618974942463efd3.','info:state'
COLUMN CELL
 info:state timestamp=1553548863451, value=OPENING


hbase(main):033:0> snapshot 't1','t1_a'

ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot \{ 
ss=t1_a table=t1 type=FLUSH } had an error. t1_a not found in proclist []
 at 
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:380)
 at 
org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1136)
 at 
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
Caused by: org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException via 
Failed taking snapshot \{ ss=t1_a table=t1 type=FLUSH } due to 
exception:Snapshot \{ ss=t1_a table=t1 type=FLUSH } looks 
empty:org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Snapshot \{ 
ss=t1_a table=t1 type=FLUSH } looks empty
 at 
org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:82)
 at 
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:306)
 at 
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:369)
 ... 6 more
Caused by: org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: 
Snapshot \{ ss=t1_a table=t1 type=FLUSH } looks empty
 at 
org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegions(MasterSnapshotVerifier.java:172)
 at 
org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:119)
 at 
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:202)
 at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
{code}
 

We have some handling for OFFLINE region, probably we can do the same for other 
states like CLOSED/OPENING.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to