[
https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108647#comment-14108647
]
Hadoop QA commented on HBASE-11813:
-----------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12664068/catch_all_exceptions.txt
against trunk revision .
ATTACHMENT ID: 12664068
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:red}-1 tests included{color}. The patch doesn't appear to include
any new or modified tests.
Please justify why no new tests are needed for this
patch.
Also please list what manual steps were performed to
verify this patch.
{color:red}-1 patch{color}. The patch command could not apply the patch.
Console output:
https://builds.apache.org/job/PreCommit-HBASE-Build/10556//console
This message is automatically generated.
> CellScanner#advance may infinitely recurse
> ------------------------------------------
>
> Key: HBASE-11813
> URL: https://issues.apache.org/jira/browse/HBASE-11813
> Project: HBase
> Issue Type: Bug
> Reporter: Andrew Purtell
> Assignee: stack
> Priority: Blocker
> Fix For: 0.99.0, 2.0.0, 0.98.6
>
> Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt,
> catch_all_exceptions.txt
>
>
> On user@hbase, [email protected] reported:
> {quote}
> we face a serious issue with our HBase production cluster for two days now.
> Every couple minutes, a random RegionServer gets stuck and does not process
> any requests. In addition this causes the other RegionServers to freeze
> within a minute which brings down the entire cluster. Stopping the affected
> RegionServer unblocks the cluster and everything comes back to normal.
> {quote}
> Subsequent troubleshooting reveals that RPC is getting stuck because we are
> losing RPC handlers. In the .out files we have this:
> {noformat}
> Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
> java.lang.StackOverflowError
> at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
> at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
> at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
> at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
> [...]
> Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020"
> java.lang.StackOverflowError​
> {noformat}
> That is the anonymous CellScanner instance we create from
> CellUtil#createCellScanner:
> {code}
> ​ return new CellScanner() {
> private final Iterator<? extends CellScannable> iterator =
> cellScannerables.iterator();
> private CellScanner cellScanner = null;
> @Override
> public Cell current() {
> return this.cellScanner != null? this.cellScanner.current(): null;
> }
> @Override
> public boolean advance() throws IOException {
> if (this.cellScanner == null) {
> if (!this.iterator.hasNext()) return false;
> this.cellScanner = this.iterator.next().cellScanner();
> }
> if (this.cellScanner.advance()) return true;
> this.cellScanner = null;
> ---> return advance();
> }
> };
> {code}
> That final return statement is the immediate problem.
> We should also fix this so the RegionServer aborts if it loses a handler to
> an Error.
--
This message was sent by Atlassian JIRA
(v6.2#6252)