Greetings Cassandra Developers! We've been trying to benchmark Cassandra performance and have developed a test client written in C++ that uses multiple threads to send out a large number of write and read requests (as fast as the server can handle them).
One of the results we're seeing is a bit surprising, and I'm hoping someone here can help shed some light on the topic - as far as I can tell, it hasn't been discuseed on the mailing list. Most of the requests return in a reasonable amount of time (10s or 100s of milliseconds), but every once in a while, the server seems to just "stop" for up to several seconds. During this time, all the reads and writes will take several seconds to complete and network traffic in an out of the system drops off to nearly zero. When plotted on a graph, these appear as very larges spikes every few minutes. (Though without any particular pattern to how often those spikes occur). Even though the average response time is very good (and therefore we get a reasonable number of requests/sec) these occasional outliers are a showstopper for our potential applications. We've experimented with a number of different machines of different capabilities including a range of physical machines, and clusters of machines on Amazon's EC2. We've also used different numbers of nodes in the cluster and different values for ReplicationFactor. All are qualitatively similar, though the numbers vary as expected (i.e. fast machines improve both the average and maximum numbers, but the max values are still on the order of seconds) I know Cassandra has lots of configuration parameters that can be tweaked, but most of the other parameters are left at the default values of Cassandara-0.6.2 or 0.6.3. Has anyone else seen nodes "hang" for several seconds like this? I'm not sure if this is a Java VM issue (e.g. garbage collection) or something specific to the Cassandra application. I'll be happy to share more details of our experiments either on the mailing list, or with interested parties offline. But I thought I'd start with a brief description and see how consistent it is with other experiences. I'm sort of expecting to see "Well, of course you'll see that kind of behavior because you didn't change..." I'm also interested in comparing notes with anyone else that has been doing read/write throughput benchmarks with Cassandara. Thanks in advance for any information or suggestions you may have! -- Peter Fales Alcatel-Lucent Member of Technical Staff 1960 Lucent Lane Room: 9H-505 Naperville, IL 60566-7033 Email: peter.fa...@alcatel-lucent.com Phone: 630 979 8031