Hector concurrentHClient pool gives out more connections than its quota
-----------------------------------------------------------------------

                 Key: CASSANDRA-2157
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2157
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.7.0
            Reporter: Yang Yang


Hector ConcurrentHClient.java can give up on connection pool grabbing, in line 
85 (following all refer to latest 0.7.0 head)


     } else {

        try {
          cassandraClient = availableClientQueue.poll(maxWaitTimeWhenExhausted, 
TimeUnit.MILLISECONDS);
          if ( cassandraClient == null ) {
            numBlocked.decrementAndGet();
            throw new 
PoolExhaustedException(String.format("maxWaitTimeWhenExhausted exceeded for 
thread %s on host %s",
                new Object[]{
                Thread.currentThread().getName(),
                cassandraHost.getName()}
            ));
          }
        } catch (InterruptedException ie) {
          //monitor.incCounter(Counter.POOL_EXHAUSTED);
          numActive.decrementAndGet();
        }

so if we specify a maxwaittime, it could give up and **** do a 
numActive.decrementAndGet().


but in the HConnectionManager.java

  public void operateWithFailover(Operation<?> op) throws HectorException {

in the main loop of this method,  

        client =  getClientFromLBPolicy(excludeHosts);
could throw Exception.
  in the catch part,  there is a clause for 

        } else if ( he instanceof PoolExhaustedException ) {
          retryable = true;
          --retries;
          if ( hostPools.size() == 1 ) {
            throw he;
          }
          monitor.incCounter(Counter.POOL_EXHAUSTED);
          excludeHosts.add(client.cassandraHost);
        }

I guess this is written for the timeout scenario above, so it's supposed to 
catch that.
but getClientFromLBPolicy() reconstructs a general HectorException from the 
PoolExhaustedException given by borrowClient().
this makes all pool grabbing timeout immediately pop up to client, which I 
guess is not the original intention.

so I guess getClientFromLBPolicy() needs to throw directly the original 
Exception. so as to trigger the logic in the catch part.

but after I made those changes, I found that I often get ActiveNum() from the 
pool to be negative, and TillExhausted to be higher than the quota. this does 
not make sense.
this was because that every code path goes through the line "releaseClient()" 
in the  finally {} clause. so that on the pool grabbing , 
numActive.decrementAndGet() was already executed, and it also gets executed in 
the finally clause



this end up creating many connections to the server, which bogs down the server 
, we have seen it creating huge cpu load

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to