Greetings Cassandra Developers!

We've been trying to benchmark Cassandra performance and have 
developed a test client written in C++ that uses multiple threads to 
send out a large number of write and read requests (as fast as the
server can handle them).   

One of the results we're seeing is a bit surprising, and I'm hoping
someone here can help shed some light on the topic - as far as I can
tell, it hasn't been discuseed on the mailing list.

Most of the requests return in a reasonable amount of time (10s or
100s of milliseconds), but every once in a while, the server seems to
just "stop" for up to several seconds.   During this time, all the 
reads and writes will take several seconds to complete and network traffic
in an out of the system drops off to nearly zero.   When plotted on a 
graph, these appear as very larges spikes every few minutes.  (Though without
any particular pattern to how often those spikes occur).   Even though
the average response time is very good (and therefore we get a reasonable
number of requests/sec) these occasional outliers are a showstopper for
our potential applications.

We've experimented with a number of different machines of different 
capabilities including a range of physical machines, and clusters of
machines on Amazon's EC2.  We've also used different numbers of nodes
in the cluster and different values for ReplicationFactor.   All are 
qualitatively similar, though the numbers vary as expected (i.e. 
fast machines improve both the average and maximum numbers, but the 
max values are still on the order of seconds)

I know Cassandra has lots of configuration parameters that can be
tweaked, but most of the other parameters are left at the default
values of Cassandara-0.6.2 or 0.6.3.

Has anyone else seen nodes "hang" for several seconds like this?  I'm
not sure if this is a Java VM issue (e.g. garbage collection) or something
specific to the Cassandra application.   I'll be happy to share more 
details of our experiments either on the mailing list, or with interested
parties offline.  But I thought I'd start with a brief description and 
see how consistent it is with other experiences.   I'm sort of expecting
to see "Well, of course you'll see that kind of behavior because you
didn't change..."

I'm also interested in comparing notes with anyone  else that has been doing
read/write throughput benchmarks with Cassandara.

Thanks in advance for any information or suggestions you may have!

-- 
Peter Fales
Alcatel-Lucent
Member of Technical Staff
1960 Lucent Lane
Room: 9H-505
Naperville, IL 60566-7033
Email: peter.fa...@alcatel-lucent.com
Phone: 630 979 8031

Reply via email to