[
https://issues.apache.org/jira/browse/HBASE-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13767217#comment-13767217
]
Enis Soztutar commented on HBASE-9338:
--------------------------------------
We have confirmed that the data loss is due to the log splitting + moving
region problem:
CM kills an RS (hor8n10) and then moves a region
(d353d3dc87562957ef5c74559ccb48ca)
{code}
13/09/13 03:39:54 INFO actions.Action: Performing action: Restart random region
server
13/09/13 03:39:54 INFO actions.Action: Killing region
server:hor8n10,60020,1379043329928
13/09/13 03:39:56 INFO actions.Action: Killed region
server:hor8n10,60020,1379043329928. Reported num of rs:8
13/09/13 03:39:56 INFO actions.Action: Sleeping for:60000
13/09/13 03:40:05 INFO actions.Action: Performing action: Move random region of
table IntegrationTestBigLinkedList
{code}
The master schedules log splitting and carries it on:
{code}
2013-09-13 03:39:57,718 INFO [MASTER_SERVER_OPERATIONS-hor8n02:60000-1]
master.SplitLogManager: started splitting 6 logs in
[hdfs://hor8n01:8020/apps/hbase/data/WALs/hor8n10,60020,1379043329928-splitting]
{code}
In the mean time, CM action for move region kicks in and causes to unassign the
region, and assign it to hor8n01
{code}
2013-09-13 03:40:06,030 DEBUG [RpcServer.handler=44,port=60000] master.HMaster:
Client=hrt_qa//68.142.245.212 unassign
IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca.
in current location if it is online and reassign.force=false
2013-09-13 03:40:06,030 DEBUG [RpcServer.handler=44,port=60000]
master.AssignmentManager: Starting unassign of
IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca.
(offlining)
2013-09-13 03:40:06,030 DEBUG [RpcServer.handler=44,port=60000]
zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Creating unassigned node
d353d3dc87562957ef5c74559ccb48ca in a CLOSING state
2013-09-13 03:40:06,044 INFO [RpcServer.handler=44,port=60000]
master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca
state=OPEN, ts=1379043386591, server=hor8n10,60020,1379043329928} to
{d353d3dc87562957ef5c74559ccb48ca state=PENDING_CLOSE, ts=1379043606044,
server=hor8n10,60020,1379043329928}
2013-09-13 03:40:06,044 DEBUG [RpcServer.handler=44,port=60000]
zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Deleting existing unassigned
node d353d3dc87562957ef5c74559ccb48ca in expected state M_ZK_REGION_CLOSING
2013-09-13 03:40:06,063 DEBUG [RpcServer.handler=44,port=60000]
zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Deleted unassigned node
d353d3dc87562957ef5c74559ccb48ca in expected state M_ZK_REGION_CLOSING
2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000]
master.RegionStates: Offline d353d3dc87562957ef5c74559ccb48ca with current
state=PENDING_CLOSE, expected state=OFFLINE/SPLITTING/MERGING
2013-09-13 03:40:06,064 INFO [RpcServer.handler=44,port=60000]
master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca
state=PENDING_CLOSE, ts=1379043606044, server=hor8n10,60020,1379043329928} to
{d353d3dc87562957ef5c74559ccb48ca state=OFFLINE, ts=1379043606064, server=null}
2013-09-13 03:40:06,064 INFO [RpcServer.handler=44,port=60000]
master.RegionStates: Offlined d353d3dc87562957ef5c74559ccb48ca from
hor8n10,60020,1379043329928
2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] master.HMaster:
Region
IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca.
is not online on any region server, reassigning it.
2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000]
master.AssignmentManager: No previous transition plan was found (or we are
ignoring an existing plan) for
IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca.
so generated a random one;
hri=IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca.,
src=, dest=hor8n01,60020,1379043340943; 7 (online=7, available=7) available
servers, forceNewPlan=false
2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000]
zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Creating (or updating)
unassigned node d353d3dc87562957ef5c74559ccb48ca with OFFLINE state
2013-09-13 03:40:06,089 INFO [RpcServer.handler=44,port=60000]
master.AssignmentManager: Assigning
IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca.
to hor8n01,60020,1379043340943
2013-09-13 03:40:06,089 INFO [RpcServer.handler=44,port=60000]
master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca
state=OFFLINE, ts=1379043606064, server=null} to
{d353d3dc87562957ef5c74559ccb48ca state=PENDING_OPEN, ts=1379043606089,
server=hor8n01,60020,1379043340943}
2013-09-13 03:40:06,121 DEBUG [AM.ZK.Worker-pool2-t71]
master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING,
server=hor8n01,60020,1379043340943, region=d353d3dc87562957ef5c74559ccb48ca,
current state from region state map ={d353d3dc87562957ef5c74559ccb48ca
state=PENDING_OPEN, ts=1379043606089, server=hor8n01,60020,1379043340943}
2013-09-13 03:40:06,121 INFO [AM.ZK.Worker-pool2-t71] master.RegionStates:
Transitioned from {d353d3dc87562957ef5c74559ccb48ca state=PENDING_OPEN,
ts=1379043606089, server=hor8n01,60020,1379043340943} to
{d353d3dc87562957ef5c74559ccb48ca state=OPENING, ts=1379043606121,
server=hor8n01,60020,1379043340943}
{code}
However, you can see that that at this time, the log splitting is not finished
yet:
{code}
2013-09-13 03:40:08,022 INFO
[hor8n02,60000,1379042853257.splitLogManagerTimeoutMonitor]
master.SplitLogManager: total tasks = 2 unassigned = 0
tasks={/hbase/splitWAL/WALs%2Fhor8n10%2C60020%2C1379043329928-splitting%2Fhor8n10%252C60020%252C1379043329928.1379043472554=last_update
= 1379043597820 last_version = 2 cur_worker_name = hor8n08,60020,1379043383930
status = in_progress incarnation = 0 resubmits = 0 batch = installed = 6 done =
4 error = 0,
/hbase/splitWAL/WALs%2Fhor8n10%2C60020%2C1379043329928-splitting%2Fhor8n10%252C60020%252C1379043329928.1379043582933=last_update
= 1379043597824 last_version = 2 cur_worker_name = hor8n02,60020,1379043313923
status = in_progress incarnation = 0 resubmits = 0 batch = installed = 6 done =
4 error = 0}
{code}
And indeed, you can see the log splits finishing up after the region is opened:
{code}
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:06,786 DEBUG
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Applied 47550, skipped
0, firstSequenceidInLog=30260132, maxSequenceidInLog=30335308,
path=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335308
./hbase-hbase-regionserver-hor8n08:2013-09-13 03:40:07,899 DEBUG
[split-log-closeStream-1] wal.HLogSplitter: Closing
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030002080.temp
./hbase-hbase-regionserver-hor8n08:2013-09-13 03:40:07,971 INFO
[split-log-closeStream-1] wal.HLogSplitter: Closed wap
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030002080.temp
(wrote 2788 edits in 136ms)
./hbase-hbase-regionserver-hor8n08:2013-09-13 03:40:08,189 DEBUG
[split-log-closeStream-1] wal.HLogSplitter: Rename
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030002080.temp
to
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030104001
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:08,571 DEBUG
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits
file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000029347901
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:08,732 DEBUG
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits
file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000029358999
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:08,965 DEBUG
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits
file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030001566
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:09,140 DEBUG
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits
file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030197753
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:09,348 DEBUG
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits
file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030260110
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:09,514 DEBUG
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits
file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335308
./hbase-hbase-regionserver-hor8n02:2013-09-13 03:40:09,962 DEBUG
[split-log-closeStream-2] wal.HLogSplitter: Closing
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335361.temp
./hbase-hbase-regionserver-hor8n02:2013-09-13 03:40:09,992 INFO
[split-log-closeStream-2] wal.HLogSplitter: Closed wap
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335361.temp
(wrote 808 edits in 195ms)
./hbase-hbase-regionserver-hor8n02:2013-09-13 03:40:10,057 DEBUG
[split-log-closeStream-2] wal.HLogSplitter: Rename
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335361.temp
to
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030364775
{code}
Notice that the region is already opened (because it is deleting the recovered
edits files, but 2 log splitting tasks finish after this)
> Test Big Linked List fails on Hadoop 2.1.0
> ------------------------------------------
>
> Key: HBASE-9338
> URL: https://issues.apache.org/jira/browse/HBASE-9338
> Project: HBase
> Issue Type: Bug
> Components: test
> Affects Versions: 0.96.0
> Reporter: Elliott Clark
> Assignee: Elliott Clark
> Priority: Blocker
> Fix For: 0.98.0, 0.96.0
>
> Attachments: HBASE-9338-0.patch, HBASE-9338-1.patch,
> HBASE-9338-TESTING-2.patch, HBASE-9338-TESTING-3.patch
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira