[
https://issues.apache.org/jira/browse/HBASE-13895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584954#comment-14584954
]
Enis Soztutar commented on HBASE-13895:
---------------------------------------
The saga continues.
We have HBASE-13877 patch v1 running on our test gig which surfaces another
problem with the assignment, and causes data loss. We are moving a region
(unassign and assign) from one of the region servers who is aborting. [~stack]
Maybe this is the root cause of your ITBLL failure as well. It is not due to
double assignment, but assignment happening before SSH finishes.
While the RS is aborting, master is trying to close the region. The region
unassign incorrectly assumes that WAL replay is not needed, and thus leaves
data behind. The region is opened before WAL split is finished.
I think this is an existing bug, not related to HBASE-13877 and HBASE-13605
patches, but with the v1 patch of HBASE-13877 it surfaces in CM runs since, we
started aborting the region server if a snapshot request aborts (note that v4
patch also handles the snapshot interruption as well as flush table
interruption thus it does not surface this problem since RSs do not abort). CM
never aborts a region server, so this existing condition was not tested much.
It is not easy to trigger with HBASE-13877 v1 patch as well, since it needs a
region move while the RS is aborting.
Here are the logs:
The region server reports aborting because of DroppedSnapshotException coming
from HBASE-13877 v1 (which does not fix snapshot flushes unlike v4) and thus
causes RS abort correctly:
{code}
2015-06-13 03:43:41,706 ERROR [PriorityRpcServer.handler=15,queue=1,port=16000]
master.MasterRpcServices: Region server
ip-172-31-45-46.ec2.internal,16020,1434166589660 reported a fatal error:
ABORTING region server ip-172-31-45-46.ec2.internal,16020,1434166589660: Replay
of WAL required. Forcing server shutdown
Cause:
org.apache.hadoop.hbase.DroppedSnapshotException: region:
IntegrationTestLoadAndVerify,\xCA\x1A\xF2\x86\xBC\xA1\xAF\x0C,1434165321232.8667147ea6a13b3410715dbc90393642.
{code}
While the abort is going on, CM decides to move the region:
{code}
2015-06-13 03:43:44,186 INFO
[B.defaultRpcServer.handler=18,queue=0,port=16000] master.RegionStates:
Transition {fca3c6621486f4de764601ecdcfec665 state=OPEN, ts=1434166908561,
server=ip-172-31-45-46.ec2.internal,16020,1434166589660} to
{fca3c6621486f4de764601ecdcfec665 state=PENDING_CLOSE, ts=1434167024186,
server=ip-172-31-45-46.ec2.internal,16020,1434166589660}
2015-06-13 03:43:44,200 INFO
[B.defaultRpcServer.handler=18,queue=0,port=16000] master.RegionStates:
Transition {fca3c6621486f4de764601ecdcfec665 state=PENDING_CLOSE,
ts=1434167024186, server=ip-172-31-45-46.ec2.internal,16020,1434166589660} to
{fca3c6621486f4de764601ecdcfec665 state=OFFLINE, ts=1434167024200,
server=ip-172-31-45-46.ec2.internal,16020,1434166589660}
2015-06-13 03:43:44,208 INFO
[B.defaultRpcServer.handler=18,queue=0,port=16000] master.AssignmentManager:
Assigning
IntegrationTestLoadAndVerify,\x945\xE5\x0DyC^<,1434165321232.fca3c6621486f4de764601ecdcfec665.
to ip-172-31-45-43.ec2.internal,16020,1434166192348
2015-06-13 03:43:44,778 INFO [AM.ZK.Worker-pool2-t114] master.RegionStates:
Transition {fca3c6621486f4de764601ecdcfec665 state=OPENING, ts=1434167024312,
server=ip-172-31-45-43.ec2.internal,16020,1434166192348} to
{fca3c6621486f4de764601ecdcfec665 state=OPEN, ts=1434167024778,
server=ip-172-31-45-43.ec2.internal,16020,1434166192348}
{code}
As you can see, the region went from OPEN -> PENDING_CLOSE -> OFFLINE ->
PENDING_OPEN -> OPEN, but the region did not do a proper close and flush in the
previous RS, since the RS was aborting (due to DSE from other region).
The RS opened the region with a smaller seqId (21818), although previous RS has
written up to 22334
{code}
2015-06-13 03:43:44,310 INFO [RS_OPEN_REGION-ip-172-31-45-43:16020-0]
regionserver.HRegion: Onlined fca3c6621486f4de764601ecdcfec665; next
sequenceid=21848
{code}
The RS only got expired after the region have already opened elsewhere (notice
it is 1 second later):
{code}
2015-06-13 03:43:45,704 INFO [main-EventThread] zookeeper.RegionServerTracker:
RegionServer ephemeral node deleted, processing expiration
[ip-172-31-45-46.ec2.internal,16020,1434166589660]
{code}
We still do WAL splitting, but these are effectively lost because region is now
open elsewhere with a smaller seqId:
{code}
2015-06-13 03:44:03,742 INFO
[MASTER_SERVER_OPERATIONS-ip-172-31-45-40:16000-4] master.SplitLogManager:
finished splitting (more than or equal to) 429923436 bytes in 4 log files in
[hdfs://ip-172-31-45-41.ec2.internal:8020/apps/hbase/data/WALs/ip-172-31-45-46.ec2.internal,16020,1434166589660-splitting]
in 17967ms
2015-06-13 03:44:12,826 INFO
[MASTER_SERVER_OPERATIONS-ip-172-31-45-40:16000-4]
handler.ServerShutdownHandler: Finished processing of shutdown of
ip-172-31-45-46.ec2.internal,16020,1434166589660
..
2015-06-13 03:43:50,006 INFO [split-log-closeStream-3] wal.WALSplitter: Rename
hdfs://ip-172-31-45-41.ec2.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/fca3c6621486f4de764601ecdcfec665/recovered.edits/0000000000000022138.temp
to
hdfs://ip-172-31-45-41.ec2.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/fca3c6621486f4de764601ecdcfec665/recovered.edits/0000000000000022334
{code}
> DATALOSS: Double-assignment ITBLL'ing
> -------------------------------------
>
> Key: HBASE-13895
> URL: https://issues.apache.org/jira/browse/HBASE-13895
> Project: HBase
> Issue Type: Bug
> Affects Versions: 1.2.0
> Reporter: stack
> Assignee: stack
> Priority: Critical
> Fix For: 1.2.0
>
>
> Opening a place holder till finish analysis.
> I have dataloss running ITBLL at 3B (testing HBASE-13877). Most obvious
> culprit is the double-assignment that I can see.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)