Vassil Hristov created CASSANDRA-4274:
-----------------------------------------
Summary: Cassandra cluster becomes very slow and looses data after
node failure
Key: CASSANDRA-4274
URL: https://issues.apache.org/jira/browse/CASSANDRA-4274
Project: Cassandra
Issue Type: Bug
Affects Versions: 1.0.8
Environment: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41)
([email protected]) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Mon Jan 16
16:22:28 UTC 2012
Debian GNU/Linux 6.0
Cassandra 1.0.8 Debian package
Reporter: Vassil Hristov
Hi,
in a nutshell: today we experienced a problem with one of our clusters. Our
application became very slow and it turned out to be caused by Cassandra.
Additionally, some data was not persisted properly. A reboot of one of the
nodes fixed the problem, in terms of that the application is now responsive
again and data is written properly, lost data was not recovered.
Now some more details.
The setup: we have 2 nodes, running Cassandra 1.0.8. In our java application,
we use Hector to connect to the nodes. We store some log data in cassandra. The
relevant method looks like this:
{{ storeMessage(mutator, key, message, ttl);
storeMessageInIndex(mutator, key, message, ttl);}}
In the first method, the entire message is stored in the column family
cfMainData under the provided key, and in the second we maintain a manual
index, which is stored in a different column family (cfDateOrderedMessages)
under the same key.
The problem: our support reported that certain operations take extremely long
(200+ seconds, compared to the usual <1 second). According to {{nodetool ring}}
both nodes were up and running. After checking the data, some of it was not
accessible. What's really odd though is that the index was maintained properly,
while the main data was missing. That is, the index would hold a key X, but
cfMainData[X] would return no results.
After the restart of one of the nodes (192.186.1.7 for the log reference),
everything went back to normal and now all is working correctly.
I am well aware that it's very likely that you won't be able to reproduce the
problem (we cannot either). However, maybe you'll figure out why the 'broken'
node wasn't marked as such. The behaviour I would have expected is that all
writes would fail, since quorum cannot be reached. The result would again be
lost data, but it would be a more consistent behaviour.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira