Thank you for the benchmarks. What version of Cassandra are you using? I had about 80% performance improvement on single node reads after using a trunk build with the results from https://issues.apache.org/jira/browse/CASSANDRA-688 (result caching) and playing around with the configuration. I am not yet running this in production though, so I cannot provide any real numbers.
That said, I have no intention of deploying a single node. I keep seing performance concerns from folks on small or single node clusters. My impression so far is that Cassandra may not be the right solution for these types of deployments. My main interest in Cassandra is the linear scalability of reads and writes. From my own tests and some of the discussion on these lists, it seems Cassandra can thrash around a lot when the number of nodes <= the replication factor * 2, particularly if a node goes down. I understand this is a design trade-off of sorts and I am fine with it. Any sort of distributed, fault tolerant system is well served by using lots of commodity hardware. What I found to have been most valuable for my evaluation was to get a good test together with some real data from our system and then add nodes, remove nodes, break nodes, etc. and watch what happens. Once I finish with this, it looks like I will have some solid numbers to do some capacity planning for figuring out exactly how much hardware to purchase and when I will need to add more. Apologies to the original poster if that got a little long winded, but hopefully it will be useful information for folks. Cheers, -Nate On Tue, Feb 2, 2010 at 7:27 AM, envio user <enviou...@gmail.com> wrote: > All, > > Here are some tests[batch_insert() and get_slice()] I performed on cassandra. > > H/W: Single node, Quad Core(8 cores), 8GB RAM: > Two separate physical disks, one for the commit log and another for the data. > > storage-conf.xml > ================ > <KeysCachedFraction>0.4</KeysCachedFraction> > <CommitLogRotationThresholdInMB>256</CommitLogRotationThresholdInMB> > <MemtableSizeInMB>128</MemtableSizeInMB> > <MemtableObjectCountInMillions>0.2</MemtableObjectCountInMillions> > <MemtableFlushAfterMinutes>1440</MemtableFlushAfterMinutes> > <ConcurrentReads>16</ConcurrentReads> > > > Data Model: > > <ColumnFamily ColumnType="Super" CompareWith="UTF8Type" > CompareSubcolumnsWith="UTF8Type" Name="Super1" /> > > TEST1A > ====== > /home/sun>python stress.py -n 100000 -t 100 -y super -u 1 -c 25 -r -o > insert -i 10 > WARNING: multiprocessing not present, threading will be used. > Benchmark may not be accurate! > total,interval_op_rate,avg_latency,elapsed_time > 19039,1903,0.0532085509215,10 > 52052,3301,0.0302550313445,20 > 82274,3022,0.0330235137811,30 > 100000,1772,0.0337765234716,40 > > TEST1B > ===== > /home/sun>python stress.py -n 100000 -t 100 -y super -u 1 -c 25 -r -o read -i > 10 > WARNING: multiprocessing not present, threading will be used. > Benchmark may not be accurate! > total,interval_op_rate,avg_latency,elapsed_time > 16472,1647,0.0615632034523,10 > 39375,2290,0.04384300123,20 > 65259,2588,0.0385473697268,30 > 91613,2635,0.0379411213277,40 > 100000,838,0.0331208069702,50 > /home/sun> > > > **** I deleted all the data(all: commitlog,data..) and restarted cassandra.*** > I am ok with TEST1A and TEST1B. I want to populate the SCF with > 500 > columns and read 25 columns per key. > > TEST2A > ====== > /home/sun>python stress.py -n 100000 -t 100 -y super -u 1 -c 600 -r -o > insert -i 10 > WARNING: multiprocessing not present, threading will be used. > Benchmark may not be accurate! > total,interval_op_rate,avg_latency,elapsed_time > ............. > ............. > 84216,144,0.689481827031,570 > 85768,155,0.625061393859,580 > 87307,153,0.648041650953,590 > 88785,147,0.671928719674,600 > 90488,170,0.611753724284,610 > 91983,149,0.677673689896,620 > 93490,150,0.63891824366,630 > 95017,152,0.65472143182,640 > 96612,159,0.64355712789,650 > 98098,148,0.673311280851,660 > 99622,152,0.486848112166,670 > 100000,37,0.174115514629,680 > > I understand nobody will write 600 columns at a time. I just need to > populate the data, hence I did this test. > > [r...@fc10mc1 ~]# ls -l /var/lib/cassandra/commitlog/ > total 373880 > -rw-r--r-- 1 root root 268462742 2010-02-03 02:00 CommitLog-1265141714717.log > -rw-r--r-- 1 root root 114003919 2010-02-03 02:00 CommitLog-1265142593543.log > > [r...@fc10mc1 ~]# ls -l /cassandra/lib/cassandra/data/Keyspace1/ > total 3024232 > -rw-r--r-- 1 root root 1508524822 2010-02-03 02:00 Super1-192-Data.db > -rw-r--r-- 1 root root 92725 2010-02-03 02:00 Super1-192-Filter.db > -rw-r--r-- 1 root root 2639957 2010-02-03 02:00 Super1-192-Index.db > -rw-r--r-- 1 root root 100838971 2010-02-03 02:02 Super1-279-Data.db > -rw-r--r-- 1 root root 8725 2010-02-03 02:02 Super1-279-Filter.db > -rw-r--r-- 1 root root 176481 2010-02-03 02:02 Super1-279-Index.db > -rw-r--r-- 1 root root 1478775337 2010-02-03 02:03 Super1-280-Data.db > -rw-r--r-- 1 root root 90805 2010-02-03 02:03 Super1-280-Filter.db > -rw-r--r-- 1 root root 2588072 2010-02-03 02:03 Super1-280-Index.db > [r...@fc10mc1 ~]# > > [r...@fc10mc1 ~]# du -hs /cassandra/lib/cassandra/data/Keyspace1/ > 2.9G /cassandra/lib/cassandra/data/Keyspace1/ > > > TEST2B > ====== > > /home/sun>python stress.py -n 100000 -t 100 -y super -u 1 -c 25 -r -o read -i > 10 > WARNING: multiprocessing not present, threading will be used. > Benchmark may not be accurate! > total,interval_op_rate,avg_latency,elapsed_time > ................. > ................ > > 66962,382,0.261044957001,180 > 70598,363,0.276139952824,190 > 74490,389,0.25678327989,200 > 78252,376,0.263047518976,210 > 82031,377,0.266485546846,220 > 86008,397,0.248498579411,230 > 89699,369,0.274926948857,240 > 93590,389,0.256867142883,250 > 97328,373,0.267352432985,260 > 100000,267,0.217604277555,270 > > This test is more worrying for us. We can't even read 1000 reads per > second. Is there any limitation on cassandra, which will not work with > more number of columns ?. Or mm I doing something wrong here?. Please > let me know. > > Attached are the nodeprobe(tpstats), iostats, and vmstats taken for the tests. > > > thanks in advance, > -Aita > > > some changes I made to stress.py to accomadate more columns. > > 157c156 > < columns = [Column('A' + str(j), data, 0) for j in > xrange(columns_per_key)] > --- >> columns = [Column(chr(ord('A') + j), data, 0) for j in >> xrange(columns_per_key)] > 159c158 > < supers = [SuperColumn('A' + str(j), columns) for j in > xrange(supers_per_key)] > --- >> supers = [SuperColumn(chr(ord('A') + j), columns) for j in >> xrange(supers_per_key)] > 187c186 > < parent = ColumnParent('Super1', 'A' + str(j)) > --- >> parent = ColumnParent('Super1', chr(ord('A') + j)) >