Hola Paco,
> the mutation stages pending column grows without stop, could be that the > problem > CPU (near 96%) > Yes, basically I think you are over using this cluster. but two of them have high cpu load, especially the 232 because I am running > a lot of deletes using cqlsh in that node. > Solutions would be to run delete at a slower & constant path, against all the nodes, using a balancing policy or adding capacity if all the nodes are facing the issue and you can't slow deletes. You should also have a look at iowait and steal, see if CPU are really used 100% or masking an other issue. (disk not answering fast enough or hardware / shared instance issue). I had some noisy neighbours at some point while using Cassandra on AWS. I cannot find the reason that originates the timeouts. I don't see it that weird while being overusing some/all the nodes. I already have increased the timeouts, but It do not think that is a > solution because the timeouts indicated another type of error Any relevant logs in Cassandra nodes (other than dropped mutations INFO)? 7 nodes version 2.0.17 Note: Be aware that this Cassandra version is quite old and no longer supported. Plus you might face issues that were solved already. I know that upgrading is not straight forward, but 2.0 --> 2.1 brings an amazing set of optimisations and some fixes too. You should try it out :-). C*heers, ----------------------- Alain Rodriguez - al...@thelastpickle.com France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2016-04-04 14:33 GMT+02:00 Paco Trujillo <f.truji...@genetwister.nl>: > Hi everyone > > > > We are having problems with our cluster (7 nodes version 2.0.17) when > running “massive deletes” on one of the nodes (via cql command line). At > the beginning everything is fine, but after a while we start getting > constant NoHostAvailableException using the datastax driver: > > > > Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: > All host(s) tried for query failed (tried: /172.31.7.243:9042 > (com.datastax.driver.core.exceptions.DriverException: Timeout while trying > to acquire available connection (you may want to increase the driver number > of per-host connections)), /172.31.7.245:9042 > (com.datastax.driver.core.exceptions.DriverException: Timeout while trying > to acquire available connection (you may want to increase the driver number > of per-host connections)), /172.31.7.246:9042 > (com.datastax.driver.core.exceptions.DriverException: Timeout while trying > to acquire available connection (you may want to increase the driver number > of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, / > 172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3 > hosts, use getErrors() for more details]) > > > > > > All the nodes are running: > > > > UN 172.31.7.244 152.21 GB 256 14.5% > 58abea69-e7ba-4e57-9609-24f3673a7e58 RAC1 > > UN 172.31.7.245 168.4 GB 256 14.5% > bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752 RAC1 > > UN 172.31.7.246 177.71 GB 256 13.7% > 8dc7bb3d-38f7-49b9-b8db-a622cc80346c RAC1 > > UN 172.31.7.247 158.57 GB 256 14.1% > 94022081-a563-4042-81ab-75ffe4d13194 RAC1 > > UN 172.31.7.243 176.83 GB 256 14.6% > 0dda3410-db58-42f2-9351-068bdf68f530 RAC1 > > UN 172.31.7.233 159 GB 256 13.6% > 01e013fb-2f57-44fb-b3c5-fd89d705bfdd RAC1 > > UN 172.31.7.232 166.05 GB 256 15.0% > 4d009603-faa9-4add-b3a2-fe24ec16a7c1 > > > > but two of them have high cpu load, especially the 232 because I am > running a lot of deletes using cqlsh in that node. > > > > I know that deletes generate tombstones, but with 7 nodes in the cluster I > do not think is normal that all the host are not accesible. > > > > We have a replication factor of 3 and for the deletes I am not using any > consistency (so it is using the default ONE). > > > > I check the nodes which a lot of CPU (near 96%) and th gc activity remains > on 1.6% (using only 3 GB from the 10 which have assigned). But looking at > the thread pool stats, the mutation stages pending column grows without > stop, could be that the problem? > > > > I cannot find the reason that originates the timeouts. I already have > increased the timeouts, but It do not think that is a solution because the > timeouts indicated another type of error. Anyone have a tip to try to > determine where is the problem? > > > > Thanks in advance >