[
https://issues.apache.org/jira/browse/HBASE-13535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-13535:
--------------------------
Assignee: stack
> Regions go unassigned when meta-carrying RS is killed
> -----------------------------------------------------
>
> Key: HBASE-13535
> URL: https://issues.apache.org/jira/browse/HBASE-13535
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Assignee: stack
>
> hbase-1.1 will be the first release with DLR on by default. I've been running
> ITBLLs on a cluster trying to find issues with DLR. My first few runs ran
> nicely... but the current run failed complaining regions are not online and
> indeed recovery is stuck making no progress.
> Upon examination, it looks to be an assignment rather than DLR issue. A
> server carring meta has its meta log replayed first but we are seemingly
> failing to assign regions after meta is back online.
> Meantime, my regionserver logs are filling with spewing complaint that
> regions are not online (we should dampen our logging of region not being
> online... ) and then the split log workers are stuck:
> {code}
> Thread 13206 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-2):
> State: TIMED_WAITING
> Blocked count: 45
> Waited count: 59
> Stack:
> java.lang.Thread.sleep(Native Method)
>
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
>
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
>
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
>
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
>
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
>
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
>
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
> Thread 13205 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-1):
> State: TIMED_WAITING
> Blocked count: 45
> Waited count: 59
> Stack:
> java.lang.Thread.sleep(Native Method)
>
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
>
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
>
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
>
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
>
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
>
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
>
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
> Thread 13204 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-0):
> State: TIMED_WAITING
> Blocked count: 50
> Waited count: 63
> Stack:
> java.lang.Thread.sleep(Native Method)
>
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
>
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
>
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
>
> org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
>
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
>
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
>
> org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
> {code}
> ...complaining that:
> 2015-04-22 21:28:02,746 DEBUG [RS_LOG_REPLAY_OPS-c2021:16020-1]
> wal.WALSplitter: Used 134248328 bytes of buffered edits, waiting for IO
> threads...
> The accounting seems off around here in SSH where it is moving regions that
> were on dead server to OFFLINE but is reporting no regions to assign:
> {code}
> 143320 2015-04-21 17:05:07,571 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0]
> handler.ServerShutdownHandler: Mark regions in recovery for crashed server
> c2024.halxg.cloudera.com,16020,1429660802192 before assignment; regions=[]
> 143321 2015-04-21 17:05:07,572 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16000-0]
> master.RegionStates: Adding to processed servers
> c2024.halxg.cloudera.com,16020,1429660802192
> 143322 2015-04-21 17:05:07,575 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0]
> master.RegionStates: Transition {8d63312bc39a39727afea627bb20fee4 state=OPEN,
> ts=1429660996054, server=c2024.halxg.cloudera.com,16020,1429660802192} to
> {8d63312bc39a39727afea627bb20fee4 state=OFFLINE, ts=1429661107575,
> server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143323 2015-04-21 17:05:07,575 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0]
> master.RegionStates: Transition {64c97bf39441e09977332c02e628a8c2 state=OPEN,
> ts=1429660996045, server=c2024.halxg.cloudera.com,16020,1429660802192} to
> {64c97bf39441e09977332c02e628a8c2 state=OFFLINE, ts=1429661107575,
> server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143324 2015-04-21 17:05:07,575 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0]
> master.RegionStates: Transition {3f4ea5ea14653cee6006f13c7d06d10b state=OPEN,
> ts=1429660996066, server=c2024.halxg.cloudera.com,16020,1429660802192} to
> {3f4ea5ea14653cee6006f13c7d06d10b state=OFFLINE, ts=1429661107575,
> server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143325 2015-04-21 17:05:07,575 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0]
> master.RegionStates: Transition {6eaf51e55c9c23356a697f286f473db8 state=OPEN,
> ts=1429660996051, server=c2024.halxg.cloudera.com,16020,1429660802192} to
> {6eaf51e55c9c23356a697f286f473db8 state=OFFLINE, ts=1429661107575,
> server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143326 2015-04-21 17:05:07,575 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0]
> master.RegionStates: Transition {1cbe63cd29c7709209adbf4305ebc746 state=OPEN,
> ts=1429660996062, server=c2024.halxg.cloudera.com,16020,1429660802192} to
> {1cbe63cd29c7709209adbf4305ebc746 state=OFFLINE, ts=1429661107575,
> server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143327 2015-04-21 17:05:07,575 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0]
> master.RegionStates: Transition {3a49d17e189b2a26eb73e1d43cf2d0ac state=OPEN,
> ts=1429660996053, server=c2024.halxg.cloudera.com,16020,1429660802192} to
> {3a49d17e189b2a26eb73e1d43cf2d0ac state=OFFLINE, ts=1429661107575,
> server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143328 2015-04-21 17:05:07,576 INFO [MASTER_SERVER_OPERATIONS-c2020:16000-0]
> handler.ServerShutdownHandler: Reassigning 0 region(s) that
> c2024.halxg.cloudera.com,16020,1429660802192 was carrying (and 0 regions(s)
> that were opening on this server)
> {code}
> ... and indeed these regions are never assigned. I see DLR on occasion
> timing out like this....
> 2015-04-22 21:27:31,327 ERROR [RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-0]
> wal.WALSplitter: Exiting thread
> ....
> Caused by:
> Caused by:
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
> org.apache.hadoop.hbase.NotServingRegionException: Region
> IntegrationTestBigLinkedList,\xB3333333(,1429660019160.8d63312bc39a39727afea627bb20fee4.
> is not online on c2024.halxg.cloudera.com,16020,1429661908290
> ... which is one of the regions in the set of OFFLINE but not assigned.
> Digging....
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)