Hector concurrentHClient pool gives out more connections than its quota
-----------------------------------------------------------------------
Key: CASSANDRA-2157
URL: https://issues.apache.org/jira/browse/CASSANDRA-2157
Project: Cassandra
Issue Type: Bug
Components: Core
Affects Versions: 0.7.0
Reporter: Yang Yang
Hector ConcurrentHClient.java can give up on connection pool grabbing, in line
85 (following all refer to latest 0.7.0 head)
} else {
try {
cassandraClient = availableClientQueue.poll(maxWaitTimeWhenExhausted,
TimeUnit.MILLISECONDS);
if ( cassandraClient == null ) {
numBlocked.decrementAndGet();
throw new
PoolExhaustedException(String.format("maxWaitTimeWhenExhausted exceeded for
thread %s on host %s",
new Object[]{
Thread.currentThread().getName(),
cassandraHost.getName()}
));
}
} catch (InterruptedException ie) {
//monitor.incCounter(Counter.POOL_EXHAUSTED);
numActive.decrementAndGet();
}
so if we specify a maxwaittime, it could give up and **** do a
numActive.decrementAndGet().
but in the HConnectionManager.java
public void operateWithFailover(Operation<?> op) throws HectorException {
in the main loop of this method,
client = getClientFromLBPolicy(excludeHosts);
could throw Exception.
in the catch part, there is a clause for
} else if ( he instanceof PoolExhaustedException ) {
retryable = true;
--retries;
if ( hostPools.size() == 1 ) {
throw he;
}
monitor.incCounter(Counter.POOL_EXHAUSTED);
excludeHosts.add(client.cassandraHost);
}
I guess this is written for the timeout scenario above, so it's supposed to
catch that.
but getClientFromLBPolicy() reconstructs a general HectorException from the
PoolExhaustedException given by borrowClient().
this makes all pool grabbing timeout immediately pop up to client, which I
guess is not the original intention.
so I guess getClientFromLBPolicy() needs to throw directly the original
Exception. so as to trigger the logic in the catch part.
but after I made those changes, I found that I often get ActiveNum() from the
pool to be negative, and TillExhausted to be higher than the quota. this does
not make sense.
this was because that every code path goes through the line "releaseClient()"
in the finally {} clause. so that on the pool grabbing ,
numActive.decrementAndGet() was already executed, and it also gets executed in
the finally clause
this end up creating many connections to the server, which bogs down the server
, we have seen it creating huge cpu load
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira