[ 
https://issues.apache.org/jira/browse/HBASE-13535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-13535:
--------------------------
    Assignee: stack

> Regions go unassigned when meta-carrying RS is killed
> -----------------------------------------------------
>
>                 Key: HBASE-13535
>                 URL: https://issues.apache.org/jira/browse/HBASE-13535
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>
> hbase-1.1 will be the first release with DLR on by default. I've been running 
>  ITBLLs on a cluster trying to find issues with DLR. My first few runs ran 
> nicely... but the current run failed complaining regions are not online and 
> indeed recovery is stuck making no progress.
> Upon examination, it looks to be an assignment rather than DLR issue. A 
> server carring meta has its meta log replayed first but we are seemingly 
> failing to assign regions after meta is back online.
> Meantime, my regionserver logs are filling with spewing complaint that 
> regions are not online (we should dampen our logging of region not being 
> online... ) and then the split log workers are stuck:
> {code}
> Thread 13206 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-2):
>   State: TIMED_WAITING
>   Blocked count: 45
>   Waited count: 59
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
> Thread 13205 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-1):
>   State: TIMED_WAITING
>   Blocked count: 45
>   Waited count: 59
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
> Thread 13204 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-0):
>   State: TIMED_WAITING
>   Blocked count: 50
>   Waited count: 63
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
>     
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
> {code}
> ...complaining that:
> 2015-04-22 21:28:02,746 DEBUG [RS_LOG_REPLAY_OPS-c2021:16020-1] 
> wal.WALSplitter: Used 134248328 bytes of buffered edits, waiting for IO 
> threads...
> The accounting seems off around here in SSH where it is moving regions that 
> were on dead server to OFFLINE but is reporting no regions to assign:
> {code}
> 143320 2015-04-21 17:05:07,571 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] 
> handler.ServerShutdownHandler: Mark regions in recovery for crashed server 
> c2024.halxg.cloudera.com,16020,1429660802192 before assignment; regions=[]
> 143321 2015-04-21 17:05:07,572 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16000-0] 
> master.RegionStates: Adding to processed servers 
> c2024.halxg.cloudera.com,16020,1429660802192
> 143322 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] 
> master.RegionStates: Transition {8d63312bc39a39727afea627bb20fee4 state=OPEN, 
> ts=1429660996054, server=c2024.halxg.cloudera.com,16020,1429660802192} to 
> {8d63312bc39a39727afea627bb20fee4 state=OFFLINE, ts=1429661107575, 
> server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143323 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] 
> master.RegionStates: Transition {64c97bf39441e09977332c02e628a8c2 state=OPEN, 
> ts=1429660996045, server=c2024.halxg.cloudera.com,16020,1429660802192} to 
> {64c97bf39441e09977332c02e628a8c2 state=OFFLINE, ts=1429661107575, 
> server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143324 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] 
> master.RegionStates: Transition {3f4ea5ea14653cee6006f13c7d06d10b state=OPEN, 
> ts=1429660996066, server=c2024.halxg.cloudera.com,16020,1429660802192} to 
> {3f4ea5ea14653cee6006f13c7d06d10b state=OFFLINE, ts=1429661107575, 
> server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143325 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] 
> master.RegionStates: Transition {6eaf51e55c9c23356a697f286f473db8 state=OPEN, 
> ts=1429660996051, server=c2024.halxg.cloudera.com,16020,1429660802192} to 
> {6eaf51e55c9c23356a697f286f473db8 state=OFFLINE, ts=1429661107575, 
> server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143326 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] 
> master.RegionStates: Transition {1cbe63cd29c7709209adbf4305ebc746 state=OPEN, 
> ts=1429660996062, server=c2024.halxg.cloudera.com,16020,1429660802192} to 
> {1cbe63cd29c7709209adbf4305ebc746 state=OFFLINE, ts=1429661107575, 
> server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143327 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] 
> master.RegionStates: Transition {3a49d17e189b2a26eb73e1d43cf2d0ac state=OPEN, 
> ts=1429660996053, server=c2024.halxg.cloudera.com,16020,1429660802192} to 
> {3a49d17e189b2a26eb73e1d43cf2d0ac state=OFFLINE, ts=1429661107575, 
> server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143328 2015-04-21 17:05:07,576 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] 
> handler.ServerShutdownHandler: Reassigning 0 region(s) that 
> c2024.halxg.cloudera.com,16020,1429660802192 was carrying (and 0 regions(s) 
> that were opening on this server)
> {code}
> ... and indeed these regions are never assigned.  I see DLR on occasion 
> timing out like this....
> 2015-04-22 21:27:31,327 ERROR [RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-0] 
> wal.WALSplitter: Exiting thread
> ....
> Caused by:
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
>  org.apache.hadoop.hbase.NotServingRegionException: Region 
> IntegrationTestBigLinkedList,\xB3333333(,1429660019160.8d63312bc39a39727afea627bb20fee4.
>  is not online on c2024.halxg.cloudera.com,16020,1429661908290
> ... which is one of the regions in the set of OFFLINE but not assigned.
> Digging....



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to