[jira] [Commented] (HBASE-13895) DATALOSS: Double-assignment ITBLL'ing

Enis Soztutar (JIRA) Sat, 13 Jun 2015 23:43:10 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-13895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584954#comment-14584954
 ]


Enis Soztutar commented on HBASE-13895:
---------------------------------------

The saga continues. 

We have HBASE-13877 patch v1 running on our test gig which surfaces another 
problem with the assignment, and causes data loss. We are moving a region 
(unassign and assign) from one of the region servers who is aborting. [~stack] 
Maybe this is the root cause of your ITBLL failure as well. It is not due to 
double assignment, but assignment happening before SSH finishes.  

While the RS is aborting, master is trying to close the region. The region 
unassign incorrectly assumes that WAL replay is not needed, and thus leaves 
data behind. The region is opened before WAL split is finished.
I think this is an existing bug, not related to HBASE-13877 and HBASE-13605 
patches, but with the v1 patch of HBASE-13877 it surfaces in CM runs since, we 
started aborting the region server if a snapshot request aborts (note that v4 
patch also handles the snapshot interruption as well as flush table 
interruption thus it does not surface this problem since RSs do not abort). CM 
never aborts a region server, so this existing condition was not tested much. 
It is not easy to trigger with HBASE-13877 v1 patch as well, since it needs a 
region move while the RS is aborting. 

Here are the logs: 
The region server reports aborting because of DroppedSnapshotException coming 
from HBASE-13877 v1 (which does not fix snapshot flushes unlike v4) and thus 
causes RS abort correctly: 
{code}
2015-06-13 03:43:41,706 ERROR [PriorityRpcServer.handler=15,queue=1,port=16000] 
master.MasterRpcServices: Region server 
ip-172-31-45-46.ec2.internal,16020,1434166589660 reported a fatal error:
ABORTING region server ip-172-31-45-46.ec2.internal,16020,1434166589660: Replay 
of WAL required. Forcing server shutdown
Cause:
org.apache.hadoop.hbase.DroppedSnapshotException: region: 
IntegrationTestLoadAndVerify,\xCA\x1A\xF2\x86\xBC\xA1\xAF\x0C,1434165321232.8667147ea6a13b3410715dbc90393642.
{code}

While the abort is going on, CM decides to move the region: 
{code}
2015-06-13 03:43:44,186 INFO  
[B.defaultRpcServer.handler=18,queue=0,port=16000] master.RegionStates: 
Transition {fca3c6621486f4de764601ecdcfec665 state=OPEN, ts=1434166908561, 
server=ip-172-31-45-46.ec2.internal,16020,1434166589660} to 
{fca3c6621486f4de764601ecdcfec665 state=PENDING_CLOSE, ts=1434167024186, 
server=ip-172-31-45-46.ec2.internal,16020,1434166589660}
2015-06-13 03:43:44,200 INFO  
[B.defaultRpcServer.handler=18,queue=0,port=16000] master.RegionStates: 
Transition {fca3c6621486f4de764601ecdcfec665 state=PENDING_CLOSE, 
ts=1434167024186, server=ip-172-31-45-46.ec2.internal,16020,1434166589660} to 
{fca3c6621486f4de764601ecdcfec665 state=OFFLINE, ts=1434167024200, 
server=ip-172-31-45-46.ec2.internal,16020,1434166589660}
2015-06-13 03:43:44,208 INFO  
[B.defaultRpcServer.handler=18,queue=0,port=16000] master.AssignmentManager: 
Assigning 
IntegrationTestLoadAndVerify,\x945\xE5\x0DyC^<,1434165321232.fca3c6621486f4de764601ecdcfec665.
 to ip-172-31-45-43.ec2.internal,16020,1434166192348
2015-06-13 03:43:44,778 INFO  [AM.ZK.Worker-pool2-t114] master.RegionStates: 
Transition {fca3c6621486f4de764601ecdcfec665 state=OPENING, ts=1434167024312, 
server=ip-172-31-45-43.ec2.internal,16020,1434166192348} to 
{fca3c6621486f4de764601ecdcfec665 state=OPEN, ts=1434167024778, 
server=ip-172-31-45-43.ec2.internal,16020,1434166192348}
{code}
As you can see, the region went from OPEN -> PENDING_CLOSE -> OFFLINE -> 
PENDING_OPEN -> OPEN, but the region did not do a proper close and flush in the 
previous RS, since the RS was aborting (due to DSE from other region). 

The RS opened the region with a smaller seqId (21818), although previous RS has 
written up to 22334
{code}
2015-06-13 03:43:44,310 INFO  [RS_OPEN_REGION-ip-172-31-45-43:16020-0] 
regionserver.HRegion: Onlined fca3c6621486f4de764601ecdcfec665; next 
sequenceid=21848
{code}

The RS only got expired after the region have already opened elsewhere (notice 
it is 1 second later):
{code}
2015-06-13 03:43:45,704 INFO  [main-EventThread] zookeeper.RegionServerTracker: 
RegionServer ephemeral node deleted, processing expiration 
[ip-172-31-45-46.ec2.internal,16020,1434166589660]
{code}

We still do WAL splitting, but these are effectively lost because region is now 
open elsewhere with a smaller seqId: 
{code}
2015-06-13 03:44:03,742 INFO  
[MASTER_SERVER_OPERATIONS-ip-172-31-45-40:16000-4] master.SplitLogManager: 
finished splitting (more than or equal to) 429923436 bytes in 4 log files in 
[hdfs://ip-172-31-45-41.ec2.internal:8020/apps/hbase/data/WALs/ip-172-31-45-46.ec2.internal,16020,1434166589660-splitting]
 in 17967ms
2015-06-13 03:44:12,826 INFO  
[MASTER_SERVER_OPERATIONS-ip-172-31-45-40:16000-4] 
handler.ServerShutdownHandler: Finished processing of shutdown of 
ip-172-31-45-46.ec2.internal,16020,1434166589660

..
2015-06-13 03:43:50,006 INFO  [split-log-closeStream-3] wal.WALSplitter: Rename 
hdfs://ip-172-31-45-41.ec2.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/fca3c6621486f4de764601ecdcfec665/recovered.edits/0000000000000022138.temp
 to 
hdfs://ip-172-31-45-41.ec2.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/fca3c6621486f4de764601ecdcfec665/recovered.edits/0000000000000022334

{code}





> DATALOSS: Double-assignment ITBLL'ing
> -------------------------------------
>
>                 Key: HBASE-13895
>                 URL: https://issues.apache.org/jira/browse/HBASE-13895
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 1.2.0
>
>
> Opening a place holder till finish analysis.
> I have dataloss running ITBLL at 3B (testing HBASE-13877). Most obvious 
> culprit is the double-assignment that I can see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13895) DATALOSS: Double-assignment ITBLL'ing

Reply via email to