[
https://issues.apache.org/jira/browse/HBASE-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-3147:
-------------------------
Attachment: HBASE-3147-v6.patch
Here is what I'll commit. It does as Jon suggests removing check of root or
meta carrying inside in shutdown handler since we're doing the check on the
outside now. This patch also includes missing hookup that testing found.
There is still work to do on this issue. What seems to be happening is that a
watcher is not being triggered. Need to figure how that is happening. I'll
see a regionserver with all of its opener handlers stuck waiting on
notification that meta has been deployed.... Other servers will have gotten
their watcher triggered but not one or two in the cluster.... Master is then
stuck timing out this regionservers allocations and then reassigning... calling
open on the rpc which adds region to queue but since all openers are stuck
waiting on meta, the queues don't get processed.
> Regions stuck in transition after rolling restart, perpetual timeout handling
> but nothing happens
> -------------------------------------------------------------------------------------------------
>
> Key: HBASE-3147
> URL: https://issues.apache.org/jira/browse/HBASE-3147
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Fix For: 0.90.0
>
> Attachments: HBASE-3147-v6.patch
>
>
> The rolling restart script is great for bringing on the weird stuff. On my
> little loaded cluster if I run it, it horks the cluster and it doesn't
> recover. I notice two issues that need fixing:
> 1. We'll miss noticing that a server was carrying .META. and it never gets
> assigned -- the shutdown handlers get stuck in perpetual wait on a .META.
> assign that will never happen.
> 2. Perpetual cycling of the this sequence per region not succesfully assigned:
> {code}
> 2010-10-23 21:37:57,404 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: usertable,user510588360,1287547556587.7f2d92497d2d03917afd574ea2aca55b.
> state=PENDING_OPEN, ts=1287869814294 45154 2010-10-23
> 21:37:57,404 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region
> has been PENDING_OPEN or OPENING for too long, reassigning
> region=usertable,user510588360,1287547556587.
> 7f2d92497d2d03917afd574ea2aca55b. 45155 2010-10-23 21:37:57,404 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2bd57d1475046a
> Attempting to transition node 7f2d92497d2d03917afd574ea2aca55b from
> RS_ZK_REGION_OPENING to M_ZK_REGION_OFFLINE 45156 2010-10-23 21:37:57,404
> WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:60000-0x2bd57d1475046a Attempt to transition the unassigned node for
> 7f2d92497d2d03917afd574ea2aca55b from RS_ZK_REGION_OPENING to
> M_ZK_REGION_OFFLINE failed, the node existed but was in the state
> M_ZK_REGION_OFFLINE 45157 2010-10-23 21:37:57,404 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region transitioned OPENING
> to OFFLINE so skipping timeout,
> region=usertable,user510588360,1287547556587.7f2d92497d2d03917afd574ea2aca55b.
>
> ,,,
> {code}
> Timeout period again elapses an then same sequence.
> This is what I've been working on.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.