[ 
https://issues.apache.org/jira/browse/HBASE-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13767217#comment-13767217
 ] 

Enis Soztutar commented on HBASE-9338:
--------------------------------------

We have confirmed that the data loss is due to the log splitting + moving 
region problem: 
CM kills an RS (hor8n10) and then moves a region 
(d353d3dc87562957ef5c74559ccb48ca)
{code}
13/09/13 03:39:54 INFO actions.Action: Performing action: Restart random region 
server
13/09/13 03:39:54 INFO actions.Action: Killing region 
server:hor8n10,60020,1379043329928
13/09/13 03:39:56 INFO actions.Action: Killed region 
server:hor8n10,60020,1379043329928. Reported num of rs:8
13/09/13 03:39:56 INFO actions.Action: Sleeping for:60000
13/09/13 03:40:05 INFO actions.Action: Performing action: Move random region of 
table IntegrationTestBigLinkedList
{code}

The master schedules log splitting and carries it on:
{code}
2013-09-13 03:39:57,718 INFO  [MASTER_SERVER_OPERATIONS-hor8n02:60000-1] 
master.SplitLogManager: started splitting 6 logs in 
[hdfs://hor8n01:8020/apps/hbase/data/WALs/hor8n10,60020,1379043329928-splitting]
{code}

In the mean time, CM action for move region kicks in and causes to unassign the 
region, and assign it to hor8n01
{code}
2013-09-13 03:40:06,030 DEBUG [RpcServer.handler=44,port=60000] master.HMaster: 
Client=hrt_qa//68.142.245.212 unassign 
IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca.
 in current location if it is online and reassign.force=false
2013-09-13 03:40:06,030 DEBUG [RpcServer.handler=44,port=60000] 
master.AssignmentManager: Starting unassign of 
IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca.
 (offlining)
2013-09-13 03:40:06,030 DEBUG [RpcServer.handler=44,port=60000] 
zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Creating unassigned node 
d353d3dc87562957ef5c74559ccb48ca in a CLOSING state
2013-09-13 03:40:06,044 INFO  [RpcServer.handler=44,port=60000] 
master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca 
state=OPEN, ts=1379043386591, server=hor8n10,60020,1379043329928} to 
{d353d3dc87562957ef5c74559ccb48ca state=PENDING_CLOSE, ts=1379043606044, 
server=hor8n10,60020,1379043329928}
2013-09-13 03:40:06,044 DEBUG [RpcServer.handler=44,port=60000] 
zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Deleting existing unassigned 
node d353d3dc87562957ef5c74559ccb48ca in expected state M_ZK_REGION_CLOSING
2013-09-13 03:40:06,063 DEBUG [RpcServer.handler=44,port=60000] 
zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Deleted unassigned node 
d353d3dc87562957ef5c74559ccb48ca in expected state M_ZK_REGION_CLOSING
2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] 
master.RegionStates: Offline d353d3dc87562957ef5c74559ccb48ca with current 
state=PENDING_CLOSE, expected state=OFFLINE/SPLITTING/MERGING
2013-09-13 03:40:06,064 INFO  [RpcServer.handler=44,port=60000] 
master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca 
state=PENDING_CLOSE, ts=1379043606044, server=hor8n10,60020,1379043329928} to 
{d353d3dc87562957ef5c74559ccb48ca state=OFFLINE, ts=1379043606064, server=null}
2013-09-13 03:40:06,064 INFO  [RpcServer.handler=44,port=60000] 
master.RegionStates: Offlined d353d3dc87562957ef5c74559ccb48ca from 
hor8n10,60020,1379043329928
2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] master.HMaster: 
Region 
IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca.
 is not online on any region server, reassigning it.
2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] 
master.AssignmentManager: No previous transition plan was found (or we are 
ignoring an existing plan) for 
IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca.
 so generated a random one; 
hri=IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca.,
 src=, dest=hor8n01,60020,1379043340943; 7 (online=7, available=7) available 
servers, forceNewPlan=false
2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] 
zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Creating (or updating) 
unassigned node d353d3dc87562957ef5c74559ccb48ca with OFFLINE state
2013-09-13 03:40:06,089 INFO  [RpcServer.handler=44,port=60000] 
master.AssignmentManager: Assigning 
IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca.
 to hor8n01,60020,1379043340943
2013-09-13 03:40:06,089 INFO  [RpcServer.handler=44,port=60000] 
master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca 
state=OFFLINE, ts=1379043606064, server=null} to 
{d353d3dc87562957ef5c74559ccb48ca state=PENDING_OPEN, ts=1379043606089, 
server=hor8n01,60020,1379043340943}
2013-09-13 03:40:06,121 DEBUG [AM.ZK.Worker-pool2-t71] 
master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, 
server=hor8n01,60020,1379043340943, region=d353d3dc87562957ef5c74559ccb48ca, 
current state from region state map ={d353d3dc87562957ef5c74559ccb48ca 
state=PENDING_OPEN, ts=1379043606089, server=hor8n01,60020,1379043340943}
2013-09-13 03:40:06,121 INFO  [AM.ZK.Worker-pool2-t71] master.RegionStates: 
Transitioned from {d353d3dc87562957ef5c74559ccb48ca state=PENDING_OPEN, 
ts=1379043606089, server=hor8n01,60020,1379043340943} to 
{d353d3dc87562957ef5c74559ccb48ca state=OPENING, ts=1379043606121, 
server=hor8n01,60020,1379043340943}
{code}

However, you can see that that at this time, the log splitting is not finished 
yet: 
{code}
2013-09-13 03:40:08,022 INFO  
[hor8n02,60000,1379042853257.splitLogManagerTimeoutMonitor] 
master.SplitLogManager: total tasks = 2 unassigned = 0 
tasks={/hbase/splitWAL/WALs%2Fhor8n10%2C60020%2C1379043329928-splitting%2Fhor8n10%252C60020%252C1379043329928.1379043472554=last_update
 = 1379043597820 last_version = 2 cur_worker_name = hor8n08,60020,1379043383930 
status = in_progress incarnation = 0 resubmits = 0 batch = installed = 6 done = 
4 error = 0, 
/hbase/splitWAL/WALs%2Fhor8n10%2C60020%2C1379043329928-splitting%2Fhor8n10%252C60020%252C1379043329928.1379043582933=last_update
 = 1379043597824 last_version = 2 cur_worker_name = hor8n02,60020,1379043313923 
status = in_progress incarnation = 0 resubmits = 0 batch = installed = 6 done = 
4 error = 0}
{code}

And indeed, you can see the log splits finishing up after the region is opened:
{code}
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:06,786 DEBUG 
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Applied 47550, skipped 
0, firstSequenceidInLog=30260132, maxSequenceidInLog=30335308, 
path=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335308
./hbase-hbase-regionserver-hor8n08:2013-09-13 03:40:07,899 DEBUG 
[split-log-closeStream-1] wal.HLogSplitter: Closing 
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030002080.temp
./hbase-hbase-regionserver-hor8n08:2013-09-13 03:40:07,971 INFO  
[split-log-closeStream-1] wal.HLogSplitter: Closed wap 
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030002080.temp
 (wrote 2788 edits in 136ms)
./hbase-hbase-regionserver-hor8n08:2013-09-13 03:40:08,189 DEBUG 
[split-log-closeStream-1] wal.HLogSplitter: Rename 
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030002080.temp
 to 
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030104001
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:08,571 DEBUG 
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits 
file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000029347901
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:08,732 DEBUG 
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits 
file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000029358999
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:08,965 DEBUG 
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits 
file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030001566
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:09,140 DEBUG 
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits 
file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030197753
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:09,348 DEBUG 
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits 
file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030260110
./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:09,514 DEBUG 
[RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits 
file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335308
./hbase-hbase-regionserver-hor8n02:2013-09-13 03:40:09,962 DEBUG 
[split-log-closeStream-2] wal.HLogSplitter: Closing 
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335361.temp
./hbase-hbase-regionserver-hor8n02:2013-09-13 03:40:09,992 INFO  
[split-log-closeStream-2] wal.HLogSplitter: Closed wap 
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335361.temp
 (wrote 808 edits in 195ms)
./hbase-hbase-regionserver-hor8n02:2013-09-13 03:40:10,057 DEBUG 
[split-log-closeStream-2] wal.HLogSplitter: Rename 
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335361.temp
 to 
hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030364775
{code}
Notice that the region is already opened (because it is deleting the recovered 
edits files, but 2 log splitting tasks finish after this)

                
> Test Big Linked List fails on Hadoop 2.1.0
> ------------------------------------------
>
>                 Key: HBASE-9338
>                 URL: https://issues.apache.org/jira/browse/HBASE-9338
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.96.0
>            Reporter: Elliott Clark
>            Assignee: Elliott Clark
>            Priority: Blocker
>             Fix For: 0.98.0, 0.96.0
>
>         Attachments: HBASE-9338-0.patch, HBASE-9338-1.patch, 
> HBASE-9338-TESTING-2.patch, HBASE-9338-TESTING-3.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to