Hi everyone

We are having problems with our cluster (7 nodes version 2.0.17) when running 
"massive deletes" on one of the nodes (via cql command line). At the beginning 
everything is fine, but after a while we start getting constant 
NoHostAvailableException using the datastax driver:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All 
host(s) tried for query failed (tried: /172.31.7.243:9042 
(com.datastax.driver.core.exceptions.DriverException: Timeout while trying to 
acquire available connection (you may want to increase the driver number of 
per-host connections)), /172.31.7.245:9042 
(com.datastax.driver.core.exceptions.DriverException: Timeout while trying to 
acquire available connection (you may want to increase the driver number of 
per-host connections)), /172.31.7.246:9042 
(com.datastax.driver.core.exceptions.DriverException: Timeout while trying to 
acquire available connection (you may want to increase the driver number of 
per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, 
/172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3 hosts, 
use getErrors() for more details])


All the nodes are running:

UN  172.31.7.244  152.21 GB  256     14.5%  
58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
UN  172.31.7.245  168.4 GB   256     14.5%  
bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
UN  172.31.7.246  177.71 GB  256     13.7%  
8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
UN  172.31.7.247  158.57 GB  256     14.1%  
94022081-a563-4042-81ab-75ffe4d13194  RAC1
UN  172.31.7.243  176.83 GB  256     14.6%  
0dda3410-db58-42f2-9351-068bdf68f530  RAC1
UN  172.31.7.233  159 GB     256     13.6%  
01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
UN  172.31.7.232  166.05 GB  256     15.0%  4d009603-faa9-4add-b3a2-fe24ec16a7c1

but two of them have high cpu load, especially the 232 because I am running a 
lot of deletes using cqlsh in that node.

I know that deletes generate tombstones, but with 7 nodes in the cluster I do 
not think is normal that all the host are not accesible.

We have a replication factor of 3 and for the deletes I am not using any 
consistency (so it is using the default ONE).

I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 
1.6% (using only 3 GB from the 10 which have assigned). But looking at the 
thread pool stats, the mutation stages pending column grows without stop, could 
be that the problem?

I cannot find the reason that originates the timeouts. I already have increased 
the timeouts, but It do not think that is a solution because the timeouts 
indicated another type of error. Anyone have a tip to try to determine where is 
the problem?

Thanks in advance

Reply via email to