[
https://issues.apache.org/jira/browse/HBASE-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915992#action_12915992
]
HBase Review Board commented on HBASE-3047:
-------------------------------------------
Message from: "Jonathan Gray" <[email protected]>
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/915/#review1349
-----------------------------------------------------------
Overall this looks like a good improvement over what we had. I'm still a
little confused about isRunningCluster (or isProperRunningCluster per comments).
Repeat from inline comments, but, is there ever a time a single region is
deployed and we don't want to trigger the failover codepath?
Isn't the case we're really protecting against here that the cluster was not
shutdown properly so the cluster status flag is up when it shouldn't be?
And does this handle case that cluster is killed quickly and then restarted
again so the master ephemeral node is actually still there? Then the RS will
have master node and cluster up node and startup but potentially without a real
master?
trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<http://review.cloudera.org/r/915/#comment4482>
Why is this an "implementation"? Doesn't the HRI represent the actual
connection object? I get that it's an implementation of HRI but normally that
would be used in class names implementing? No biggie, should just be
consistent and seems a weird name to me (I think I was referring to this stuff
as "connection" elsewhere in the class in method names/variable names)
trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/915/#comment4481>
Is this really the exception we want to throw (commons.lang)? Or this is
just short-term temporary?
trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/915/#comment4483>
yay thanks
trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/915/#comment4484>
So case that we are adding for here (but just throwing exception for now)
is master came up, did not think it was fresh cluster (because cluster status
flag in zk up? maybe note in comments above?), but we determine the cluster was
not running because ROOT and META are not assigned.
What about case where other regions are assigned? Should this check
actually be whether _any_ regions are assigned? I think we discussed this, and
I think looking for root/meta covers most cases, but maybe add a TODO?
Though, even in failover case, we'll need to handle ROOT/META not being
properly assigned, so if _any_ regions are assigned we would trigger failover,
if no regions assigned we would assume it actually is a cluster startup and go
into the branch of code which currently throws the exception.
trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/915/#comment4485>
javadoc about what this method does to determine if it's running cluster
trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<http://review.cloudera.org/r/915/#comment4486>
So this method would be "proper running cluster"?
Isn't it the case that if a single region is deployed anywhere we are not
in startup, we are failover?
trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
<http://review.cloudera.org/r/915/#comment4487>
looks good
- Jonathan
> If new master crashes, restart is messy
> ---------------------------------------
>
> Key: HBASE-3047
> URL: https://issues.apache.org/jira/browse/HBASE-3047
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Fix For: 0.90.0
>
> Attachments: 3047.txt
>
>
> If master crashes, the cluster-is-up flag is left stuck on.
> On restart of cluster, regionservers may come up before the master. They'll
> have registered themselves in zk by time the master assumes its role and
> master will think its joining an up and running cluster when in fact this is
> a fresh startup. Other probs. are that there'll be a root region that is bad
> up in zk. Same for meta and at moment we're not handling bad root and meta
> very well.
> Here's sample of kinda of issues we're running into:
> {code}
> 2010-09-25 23:53:13,938 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> java.io.IOException: Call to /10.20.20.188:60020 failed on local
> exception: java.io.IOException: Connection reset by peer
> at
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
> at $Proxy1.getProtocolVersion(Unknown Source)
> at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:412)
> at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:388)
> at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:435)
> at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:345)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:889)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:350)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:209)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:241)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:286)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:326)
> at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:157)
> at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:140)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.rebuildUserRegions(AssignmentManager.java:753)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:174)
> at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:314)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
> at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> {code}
> Notice, we think its a case of processFailover so we think we can just scan
> meta to fixup our inmemory picture of the running cluster, only the scan of
> meta fails because the meta isn not assigned.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.