[
https://issues.apache.org/jira/browse/HBASE-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281147#comment-14281147
]
stack commented on HBASE-12743:
-------------------------------
bq. I'm trying to reproduce it from
TableNamespaceManager.isTableAvailableAndInitialized(). Suggestions?
[~gustavoanatoly] Are you trying to reproduce the failure when DLR is running?
If so ITBLL + DLR + chaos monkey at a bit of scale on a cluster of 4/5 nodes
seems to turn it up pretty easily. But maybe you are on the particular
exception posted?
[~jeffreyz] I will. My little cluster is currently occupied working on another
issue. Will be back to help on DLR after done with current prob.
> [ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log
> replay=true
> ----------------------------------------------------------------------------------------
>
> Key: HBASE-12743
> URL: https://issues.apache.org/jira/browse/HBASE-12743
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
>
> Master is stuck for two days trying to rejoin cluster after monkey killed and
> restarted it.
> After retrying to get namespace 350 times, Master goes down:
> {code}
> 2014-12-20 18:43:54,285 INFO [c2020:16020.activeMasterManager]
> client.RpcRetryingCaller: Call exception, tries=349, retries=350,
> started=6885331 ms ago, cancelled=false, msg=row 'default' on table
> 'hbase:namespace' at
> region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da.,
> hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=6000000190
> 2014-12-20 18:43:54,303 WARN [c2020:16020.activeMasterManager]
> master.TableNamespaceManager: Caught exception in initializing namespace
> table manager
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=350, exceptions:
> Sat Dec 20 16:49:08 PST 2014,
> RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350},
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException: Region
> hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not
> online on c2023.halxg.cloudera.com,16020,1418988286696
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434)
> {code}
> Seems like 2014-12-20 16:49:03,665 INFO [RS_LOG_REPLAY_OPS-c2021:16020-0]
> wal.WALSplitter: DistributedLogReplay = true
> Seems easy enough to reproduce.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)