[
https://issues.apache.org/jira/browse/HBASE-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack resolved HBASE-25.
------------------------
Resolution: Invalid
Fix Version/s: 0.1.0
On a cluster that was running lots of other heavy-duty processes concurrently,
were seeing lots of regionservers going down because could not connect to
master within lease interval. At Jim Firby suggestion, I added logging of how
long we were actually sleeping though we'd asked sleep for 3 second only. Last
night during an upload I caught a message that said we'd slept > 30 seconds,
longer than default sleep period (See HBASE-501). I'm guessing this phenomeon
of threads oversleeping is what we've up to this been calling 'hung server'.
Closing as invalid. Can reopen if the added logging does NOT account for
region servers failing to check in with master within lease period.
> [hbase] Stuck regionserver?
> ---------------------------
>
> Key: HBASE-25
> URL: https://issues.apache.org/jira/browse/HBASE-25
> Project: Hadoop HBase
> Issue Type: Bug
> Components: regionserver
> Reporter: stack
> Assignee: stack
> Priority: Trivial
> Fix For: 0.1.0
>
>
> Looking in logs, a regionserver went down because it could not contact the
> master after 60 seconds. Watching logging, the HRS is repeatedly checking
> all 150 loaded regions over and over again w/ a pause of about 5 seconds
> between runs... then there is a suspicious 60+ second gap with no logging as
> though the regionserver had hung up on something:
> {code}
> 2007-12-03 13:14:54,178 DEBUG hbase.HRegionServer - flushing region
> postlog,img151/60/plakatlepperduzy1hh7.jpg,1196614355635
> 2007-12-03 13:14:54,178 DEBUG hbase.HRegion - Not flushing cache for region
> postlog,img151/60/plakatlepperduzy1hh7.jpg,1196614355635: snapshotMemcaches()
> determined that there was nothing to do
> 2007-12-03 13:14:54,205 DEBUG hbase.HRegionServer - flushing region
> postlog,img247/230/seanpaul4li.jpg,1196615889965
> 2007-12-03 13:14:54,205 DEBUG hbase.HRegion - Not flushing cache for region
> postlog,img247/230/seanpaul4li.jpg,1196615889965: snapshotMemcaches()
> determined that there was nothing to do
> 2007-12-03 13:16:04,305 FATAL hbase.HRegionServer - unable to report to
> master for 67467 milliseconds - aborting server
> 2007-12-03 13:16:04,455 INFO hbase.Leases -
> regionserver/0:0:0:0:0:0:0:0:60020 closing leases
> 2007-12-03 13:16:04,455 INFO hbase.Leases$LeaseMonitor -
> regionserver/0:0:0:0:0:0:0:0:60020.leaseChecker exiting
> {code}
> Master seems to be running fine scanning its ~700 regions. Then you see this
> in log, before the HRS shuts itself down.
> {code}
> 2007-12-03 13:14:31,416 INFO hbase.Leases - HMaster.leaseChecker lease
> expired 153260899/1532608992007-12-03 13:14:31,417 INFO hbase.HMaster -
> XX.XX.XX.102:60020 lease expired
> {code}
> ... and we go on to process shutdown.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.