RE: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?

2010-02-16 Thread Dr . Martin Grabmüller
 The other problem is: if I keep mixed write and read (e.g, 8 
 write threads
 plus 7 read threads) against the 2-nodes cluster 
 continuously, the read
 latency will go up gradually (along with the size of 
 Cassandra data file),
 and at the end it will become ~40ms (up from ~20ms) even with only 15
 threads. During this process the data file grew from 1.6GB to 
 over 3GB even
 if I kept writing the same key/values to Cassandra. It seems 
 that Cassandra
 keeps appending to sstable data files and will only clean up 
 them during
 node cleanup or compact (please correct me if this is incorrect). 

In my tests I have observed that good read latency depends on keeping
the number of data files low.  In my current test setup, I have stored
1.9 TB of data on a single node, which is in 21 data files, and read
latency is between 10 and 60ms (for small reads, larger read of course
take more time).  In earlier stages of my test, I had up to 5000
data files, and read performance was quite bad: my configured 10-second
RPC timeout was regularly encountered.

The number of data files is reduced whenever Cassandra compacts them,
which is either automatically, when enough datafiles are generated by
continuous writing, or when triggered by nodeprobe compact, cleanup etc.

So my advice is to keep the write throughput low enough so that Cassandra
can keep up compacting the data files.  For high write throughput, you need
fast drives, if possible on different RAIDs, which are configured as
different DataDirectories for Cassandra.  On my setup (6 drives in a single
RAID-5 configuration), compaction is quite slow: sequential reads/writes
are done at 150 MB/s, whereas during compaction, read/write-performance
drops to a few MB/s.  You definitively want more than one logical drive,
so that Cassandra can alternate between them when flushin memtables and
when compacting.

I would really be interested whether my observations are shared by other
people on this list.

Thanks!

Martin


Question about Token selection for order-preserving partitioner

2010-02-16 Thread Nguyễn Minh Kha
Hi,

I read the wiki topic
Operationshttp://wiki.apache.org/cassandra/Operationsand I don't
understant way to use Token selection for order-preserving
partitioner (application-dependent).
I want create blog comment use TimeUUIDType and order-preserving for range
query, this cluster run in 3 nodes.

How to add token, config seeds and more to run this cluster?

Thank very much.

KhaNguyễn


Re: Question about Token selection for order-preserving partitioner

2010-02-16 Thread Wojciech Kaczmarek
Hi!

2010/2/16 Nguyễn Minh Kha nminh...@gmail.com


 I read the wiki topic 
 Operationshttp://wiki.apache.org/cassandra/Operationsand I don't understant 
 way to use Token selection for order-preserving
 partitioner (application-dependent).
 I want create blog comment use TimeUUIDType and order-preserving for range
 query, this cluster run in 3 nodes.


How to add token, config seeds and more to run this cluster?

 Thank very much.


The order-preserving partitioner will use keys as tokens. The type you're
talking about could be the column type but that's another thing.


Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?

2010-02-16 Thread Brandon Williams
On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabmüller 
martin.grabmuel...@eleven.de wrote:

 In my tests I have observed that good read latency depends on keeping
 the number of data files low.  In my current test setup, I have stored
 1.9 TB of data on a single node, which is in 21 data files, and read
 latency is between 10 and 60ms (for small reads, larger read of course
 take more time).  In earlier stages of my test, I had up to 5000
 data files, and read performance was quite bad: my configured 10-second
 RPC timeout was regularly encountered.


I believe it is known that crossing sstables is O(NlogN) but I'm unable to
find the ticket on this at the moment.  Perhaps Stu Hood will jump in and
enlighten me, but in any case I believe
https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve
it.

Keeping write volume low enough that compaction can keep up is one solution,
and throwing hardware at the problem is another, if necessary.  Also, the
row caching in trunk (soon to be 0.6 we hope) helps greatly for repeat hits.

-Brandon


cassandra freezes

2010-02-16 Thread Boris Shulman
Hello, I'm running some benchmarks on 2 cassandra nodes each running
on 8 cores machine with 16G RAM, 10G for Java heap. I've noticed that
during benchmarks with numerous writes cassandra just freeze for
several minutes (in those benchmarks I'm writing batches of 10 columns
with 1K data each for every key in a single CF). Usually after
performing 50K writes I'm getting a TimeOutException and cassandra
just freezes. What configuration changes can I make in order to
prevent this? Is it possible that my setup just can't handle the load?
How can I calculate the number of casandra nodes for a desired load?


Re: Question about Token selection for order-preserving partitioner

2010-02-16 Thread Nguyễn Minh Kha
Hi!

I think, my question is not clearly.

I don't know to config these nodes for this cluster. I want to use
TimeUUIDType and order-preserving partitioner for range query.
How to config InitialToken and seeds for node01, node02, node03.

I see the wiki topic Operations explain InitialToken for order-preserving
partitioner:

[With order preserving partioners, your key distribution will be
application-dependent. You should still take your best guess at specifying
initial tokens (guided by sampling actual data, if possible), but you will
be more dependent on active load balancing (see below) and/or adding new
nodes to hot spots.]'

I don't know [application-dependent]. And I don't know config seeds section
every node.

Sorry for my english.

On Tue, Feb 16, 2010 at 7:58 PM, Wojciech Kaczmarek
kaczmare...@gmail.comwrote:

 Hi!

 2010/2/16 Nguyễn Minh Kha nminh...@gmail.com


 I read the wiki topic 
 Operationshttp://wiki.apache.org/cassandra/Operationsand I don't 
 understant way to use Token selection for order-preserving
 partitioner (application-dependent).
 I want create blog comment use TimeUUIDType and order-preserving for range
 query, this cluster run in 3 nodes.


 How to add token, config seeds and more to run this cluster?

 Thank very much.


 The order-preserving partitioner will use keys as tokens. The type you're
 talking about could be the column type but that's another thing.




Re: Question about Token selection for order-preserving partitioner

2010-02-16 Thread Wojciech Kaczmarek
Hi!

My comment was supposed to mean that you don't use TimeUUIDType as a token
because it's a possible column type, not a key type (at least is what I know
from a wiki and my short experience, I didn't check the source). You've
probably mistaken sorting of different columns within a row (which depends
on a type of column) with sorting of your keys which is possible when using
order-preserving partitioner. The sorting of keys seems to depend only on a
specific partitioner class used in config.

Application-specific keys distribution means that you should predict how
your keys will be generated in order to set seeds for your nodes. If you're
in doubt and this is your test environment you could just leave InitialToken
empty.

It would help if you told us what the semantics of your keys are.

2010/2/16 Nguyễn Minh Kha nminh...@gmail.com

 Hi!

 I think, my question is not clearly.

 I don't know to config these nodes for this cluster. I want to use
 TimeUUIDType and order-preserving partitioner for range query.
 How to config InitialToken and seeds for node01, node02, node03.

 I see the wiki topic Operations explain InitialToken for order-preserving
 partitioner:

 [With order preserving partioners, your key distribution will be
 application-dependent. You should still take your best guess at specifying
 initial tokens (guided by sampling actual data, if possible), but you will
 be more dependent on active load balancing (see below) and/or adding new
 nodes to hot spots.]'

 I don't know [application-dependent]. And I don't know config seeds section
 every node.

 Sorry for my english.


 On Tue, Feb 16, 2010 at 7:58 PM, Wojciech Kaczmarek kaczmare...@gmail.com
  wrote:

 Hi!

 2010/2/16 Nguyễn Minh Kha nminh...@gmail.com


 I read the wiki topic 
 Operationshttp://wiki.apache.org/cassandra/Operationsand I don't 
 understant way to use Token selection for order-preserving
 partitioner (application-dependent).
 I want create blog comment use TimeUUIDType and order-preserving for
 range query, this cluster run in 3 nodes.


 How to add token, config seeds and more to run this cluster?

 Thank very much.


 The order-preserving partitioner will use keys as tokens. The type you're
 talking about could be the column type but that's another thing.





Re: Nodeprobe Not Working Properly

2010-02-16 Thread Shahan Khan


I can ping to the other server using db1a instead of the host
name.192.168.1.13 db1a::1 localhost ip6-localhost ip6-loopbackfe00::0
ip6-localnetff00::0 ip6-mcastprefixff02::1 ip6-allnodesff02::2
ip6-allroutersff02::3 ip6-allhosts# Auto-generated hostname. Please do not
remove this comment.127.0.0.1 db1b.domain.com localhost db1b
localhost.localdomain

db1b:~$
ping db1a PING db1a (192.168.1.13) 56(84) bytes of data. 64 bytes from
db1a (192.168.1.13): icmp_seq=1 ttl=64 time=0.252 ms 64 bytes from db1a
(192.168.1.13): icmp_seq=2 ttl=64 time=0.228 ms 64 bytes from db1a
(192.168.1.13): icmp_seq=3 ttl=64 time=0.233 ms --- db1a ping statistics
--- 3 packets transmitted, 3 received, 0% packet loss, time 1998ms rtt
min/avg/max/mdev = 0.228/0.237/0.252/0.020 ms 

The same issue happens in
reverse, when I switch db1a and db1b. 

Other possibly useful information:


db1b:~# sudo ifconfigeth0 Link encap:Ethernet HWaddr 00:18:51:07:64:ea 
inet addr:192.168.1.14 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr:
fe80::218:51ff:fe07:64ea/64 Scope:Link UP BROADCAST RUNNING MULTICAST
MTU:1500 Metric:1 RX packets:938664 errors:0 dropped:0 overruns:0 frame:0
TX packets:649625 errors:0 dropped:0 overruns:0 carrier:0 collisions:0
txqueuelen:0  RX bytes:429396142 (409.5 MiB) TX bytes:133120553 (126.9
MiB)eth0:0 Link encap:Ethernet HWaddr 00:18:51:07:64:ea  inet
addr:xx.xx.xxx.xx Bcast:68.71.241.31 Mask:255.255.255.224 UP BROADCAST
RUNNING MULTICAST MTU:1500 Metric:1lo Link encap:Local Loopback  inet
addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK
RUNNING MTU:16436 Metric:1 RX packets:41715 errors:0 dropped:0 overruns:0
frame:0 TX packets:41715 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0  RX bytes:5095781 (4.8 MiB) TX bytes:5095781 (4.8
MiB)

db1b:~# lsof -i
COMMAND PID USER FD TYPE DEVICE SIZE NODE
NAME
portmap 171 daemon 4u IPv4 11228 UDP *:sunrpc 
portmap 171 daemon
5u IPv4 112286667 TCP *:sunrpc (LISTEN)
master 343 root 12u IPv4 112287239
TCP db1b.domain.com:smtp (LISTEN)
jsvc 14672 root 24u IPv6 113014281 TCP
*:53586 (LISTEN)
jsvc 14672 root 26u IPv6 113014282 TCP *:http-alt
(LISTEN)
jsvc 14672 root 27u IPv6 113014285 TCP *:39979 (LISTEN)
jsvc 14672
root 30u IPv6 113014297 TCP 192.168.1.14:afs3-fileserver (LISTEN)
jsvc
14672 root 35u IPv6 113014299 UDP 192.168.1.14:afs3-callback 
jsvc 14672
root 44u IPv6 113014301 TCP 192.168.1.14:9160 (LISTEN)

Thank you,
Shahan


On Mon, 15 Feb 2010 20:52:02 -0600, Brandon Williams  wrote:  On Mon, Feb
15, 2010 at 8:46 PM, Shahan Khan  wrote:

I tried Brandon's suggestion, but
am still getting the same error on the remote server. 

Any other
suggestions? Is it possible that its a bug? 

Thanks, 

Shahan 

db1a =
192.168.1.13 

db1b = 192.168.1.14 

= 

db1a:~# nodeprobe
-host 192.168.1.14 ring  

Error connecting to remote JMX agent!


java.rmi.ConnectException: Connection refused to host: 127.0.0.1; nested
exception is:For some reason, db1a indicates that 192.168.1.14 resolves
to 127.0.0.1...what does your /etc/hosts look like on this machine?   


db1b:~# nodeprobe -host 192.168.1.14 info


Token(bytes[eaaca3c3bd3caba3e14ee0f85d5cda8a]) 

Load : 4 KB 

Generation
No : 1266260277 

Uptime (seconds) : 9933 

Heap Memory (MB) : 54.83 /
1016.13   So the problem with db1b before was indeed incorrect host
resolution for db1a. -Brandon   

 

Links:
--
[1]
mailto:cont...@shahan.me


Re: Nodeprobe Not Working Properly

2010-02-16 Thread Brandon Williams
On Tue, Feb 16, 2010 at 11:08 AM, Shahan Khan cont...@shahan.me wrote:

 I can ping to the other server using db1a instead of the host name.


By 'host name' I assume you mean IP address.


 192.168.1.13db1a
 ::1 localhost ip6-localhost ip6-loopback
 fe00::0 ip6-localnet
 ff00::0 ip6-mcastprefix
 ff02::1 ip6-allnodes
 ff02::2 ip6-allrouters
 ff02::3 ip6-allhosts
 # Auto-generated hostname. Please do not remove this comment.
 127.0.0.1 db1b.domain.com localhost  db1b localhost.localdomain
 

 db1b:~$ ping db1a PING db1a (192.168.1.13) 56(84) bytes of data. 64 bytes
 from db1a (192.168.1.13): icmp_seq=1 ttl=64 time=0.252 ms 64 bytes from db1a
 (192.168.1.13): icmp_seq=2

So db1b's host resolution appears to be ok.  Is this output from db1a, or
db1b?  It appears to be db1b, but your last issue was with db1a resolving
db1b's IP address.

Cassandra doesn't do anything magical with hostname resolution, it relies on
the underlying system for that.

-Brandon


Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?

2010-02-16 Thread Weijun Li
Dumped 50mil records into my 2-node cluster overnight, made sure that
there's not many data files (around 30 only) per Martin's suggestion. The
size of the data directory is 63GB. Now when I read records from the cluster
the read latency is still ~44ms, --there's no write happening during the
read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is
saturated:

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda  47.6767.67 190.33 17.00 23933.33   677.33   118.70
5.24   25.25   4.64  96.17
sda1  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sda2 47.6767.67 190.33 17.00 23933.33   677.33   118.70
5.24   25.25   4.64  96.17
sda3  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00

CPU usage is low.

Does this mean disk i/o is the bottleneck for my case? Will it help if I
increase KCF to cache all sstable index?

Also, this is the almost a read-only mode test, and in reality, our
write/read ratio is close to 1:1 so I'm guessing read latency will even go
higher in that case because there will be difficult for cassandra to find a
good moment to compact the data files that are being busy written.

Thanks,
-Weijun


On Tue, Feb 16, 2010 at 6:06 AM, Brandon Williams dri...@gmail.com wrote:

 On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabmüller 
 martin.grabmuel...@eleven.de wrote:

 In my tests I have observed that good read latency depends on keeping
 the number of data files low.  In my current test setup, I have stored
 1.9 TB of data on a single node, which is in 21 data files, and read
 latency is between 10 and 60ms (for small reads, larger read of course
 take more time).  In earlier stages of my test, I had up to 5000
 data files, and read performance was quite bad: my configured 10-second
 RPC timeout was regularly encountered.


 I believe it is known that crossing sstables is O(NlogN) but I'm unable to
 find the ticket on this at the moment.  Perhaps Stu Hood will jump in and
 enlighten me, but in any case I believe
 https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve
 it.

 Keeping write volume low enough that compaction can keep up is one
 solution, and throwing hardware at the problem is another, if necessary.
  Also, the row caching in trunk (soon to be 0.6 we hope) helps greatly for
 repeat hits.

 -Brandon



Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?

2010-02-16 Thread Weijun Li
One more thoughts about Martin's suggestion: is it possible to put the data
files into multiple directories that are located in different physical
disks? This should help to improve the i/o bottleneck issue.

Has anybody tested the row-caching feature in trunk (shoot for 0.6?)?

-Weijun

On Tue, Feb 16, 2010 at 9:50 AM, Weijun Li weiju...@gmail.com wrote:

 Dumped 50mil records into my 2-node cluster overnight, made sure that
 there's not many data files (around 30 only) per Martin's suggestion. The
 size of the data directory is 63GB. Now when I read records from the cluster
 the read latency is still ~44ms, --there's no write happening during the
 read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is
 saturated:

 Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 sda  47.6767.67 190.33 17.00 23933.33   677.33   118.70
 5.24   25.25   4.64  96.17
 sda1  0.00 0.00  0.00  0.00 0.00 0.00 0.00
 0.000.00   0.00   0.00
 sda2 47.6767.67 190.33 17.00 23933.33   677.33   118.70
 5.24   25.25   4.64  96.17
 sda3  0.00 0.00  0.00  0.00 0.00 0.00 0.00
 0.000.00   0.00   0.00

 CPU usage is low.

 Does this mean disk i/o is the bottleneck for my case? Will it help if I
 increase KCF to cache all sstable index?

 Also, this is the almost a read-only mode test, and in reality, our
 write/read ratio is close to 1:1 so I'm guessing read latency will even go
 higher in that case because there will be difficult for cassandra to find a
 good moment to compact the data files that are being busy written.

 Thanks,
 -Weijun



 On Tue, Feb 16, 2010 at 6:06 AM, Brandon Williams dri...@gmail.comwrote:

 On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabmüller 
 martin.grabmuel...@eleven.de wrote:

 In my tests I have observed that good read latency depends on keeping
 the number of data files low.  In my current test setup, I have stored
 1.9 TB of data on a single node, which is in 21 data files, and read
 latency is between 10 and 60ms (for small reads, larger read of course
 take more time).  In earlier stages of my test, I had up to 5000
 data files, and read performance was quite bad: my configured 10-second
 RPC timeout was regularly encountered.


 I believe it is known that crossing sstables is O(NlogN) but I'm unable to
 find the ticket on this at the moment.  Perhaps Stu Hood will jump in and
 enlighten me, but in any case I believe
 https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve
 it.

 Keeping write volume low enough that compaction can keep up is one
 solution, and throwing hardware at the problem is another, if necessary.
  Also, the row caching in trunk (soon to be 0.6 we hope) helps greatly for
 repeat hits.

 -Brandon





Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?

2010-02-16 Thread Brandon Williams
On Tue, Feb 16, 2010 at 11:50 AM, Weijun Li weiju...@gmail.com wrote:

 Dumped 50mil records into my 2-node cluster overnight, made sure that
 there's not many data files (around 30 only) per Martin's suggestion. The
 size of the data directory is 63GB. Now when I read records from the cluster
 the read latency is still ~44ms, --there's no write happening during the
 read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is
 saturated:

 Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 sda  47.6767.67 190.33 17.00 23933.33   677.33   118.70
 5.24   25.25   4.64  96.17
 sda1  0.00 0.00  0.00  0.00 0.00 0.00 0.00
 0.000.00   0.00   0.00
 sda2 47.6767.67 190.33 17.00 23933.33   677.33   118.70
 5.24   25.25   4.64  96.17
 sda3  0.00 0.00  0.00  0.00 0.00 0.00 0.00
 0.000.00   0.00   0.00

 CPU usage is low.

 Does this mean disk i/o is the bottleneck for my case? Will it help if I
 increase KCF to cache all sstable index?


That's exactly what this means.  Disk is slow :(


 Also, this is the almost a read-only mode test, and in reality, our
 write/read ratio is close to 1:1 so I'm guessing read latency will even go
 higher in that case because there will be difficult for cassandra to find a
 good moment to compact the data files that are being busy written.


Reads that cause disk seeks are always going to slow things down, since disk
seeks are inherently the slowest operation in a machine.  Writes in
Cassandra should always be fast, as they do not cause any disk seeks.

-Brandon


Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?

2010-02-16 Thread Brandon Williams
On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote:

 One more thoughts about Martin's suggestion: is it possible to put the data
 files into multiple directories that are located in different physical
 disks? This should help to improve the i/o bottleneck issue.


Yes, you can already do this, just add more DataFileDirectory directives
pointed at multiple drives.


 Has anybody tested the row-caching feature in trunk (shoot for 0.6?)?


Row cache and key cache both help tremendously if your read pattern has a
decent repeat rate.  Completely random io can only be so fast, however.

-Brandon


Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?

2010-02-16 Thread Weijun Li
Thanks for for DataFileDirectory trick and I'll give a try.

Just noticed the impact of number of data files: node A has 13 data files
with read latency of 20ms and node B has 27 files with read latency of 60ms.
After I ran nodeprobe compact on node B its read latency went up to 150ms.
The read latency of node A became as low as 10ms. Is this normal behavior?
I'm using random partitioner and the hardware/JVM settings are exactly the
same for these two nodes.

Another problem is that Java heap usage is always 900mb out of 6GB? Is there
any way to utilize all of the heap space to decrease the read latency?

-Weijun

On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com wrote:

 On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote:

 One more thoughts about Martin's suggestion: is it possible to put the
 data files into multiple directories that are located in different physical
 disks? This should help to improve the i/o bottleneck issue.


 Yes, you can already do this, just add more DataFileDirectory directives
 pointed at multiple drives.


 Has anybody tested the row-caching feature in trunk (shoot for 0.6?)?


 Row cache and key cache both help tremendously if your read pattern has a
 decent repeat rate.  Completely random io can only be so fast, however.

 -Brandon



Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?

2010-02-16 Thread Brandon Williams
On Tue, Feb 16, 2010 at 12:16 PM, Weijun Li weiju...@gmail.com wrote:

 Thanks for for DataFileDirectory trick and I'll give a try.

 Just noticed the impact of number of data files: node A has 13 data files
 with read latency of 20ms and node B has 27 files with read latency of 60ms.
 After I ran nodeprobe compact on node B its read latency went up to 150ms.
 The read latency of node A became as low as 10ms. Is this normal behavior?
 I'm using random partitioner and the hardware/JVM settings are exactly the
 same for these two nodes.


It sounds like the latency jumped to 150ms because the newly written file
was not in the OS cache.

Another problem is that Java heap usage is always 900mb out of 6GB? Is there
 any way to utilize all of the heap space to decrease the read latency?


By default, Cassandra will use a 1GB heap, as set in bin/cassandra.in.sh.
 You can adjust the jvm heap there via the -Xmx option, but generally you
want to balance the jvm vs the OS cache.  With 6GB, I would probably give
2GB to the jvm, but if you aren't having issues now increasing the jvm's
memory probably won't provide any performance gains, but it's worth noting
that with row cache in 0.6 this may change.

-Brandon


Re: cassandra freezes

2010-02-16 Thread Tatu Saloranta
On Tue, Feb 16, 2010 at 6:25 AM, Boris Shulman shulm...@gmail.com wrote:
 Hello, I'm running some benchmarks on 2 cassandra nodes each running
 on 8 cores machine with 16G RAM, 10G for Java heap. I've noticed that
 during benchmarks with numerous writes cassandra just freeze for
 several minutes (in those benchmarks I'm writing batches of 10 columns
 with 1K data each for every key in a single CF). Usually after
 performing 50K writes I'm getting a TimeOutException and cassandra
 just freezes. What configuration changes can I make in order to
 prevent this? Is it possible that my setup just can't handle the load?
 How can I calculate the number of casandra nodes for a desired load?

One thing that can cause seeming lockups is garbage collector. So
enabling GC debug output would be heplful, to see GC activity. Some
collector (CMS specifically) can stop the system for very long time,
up to minutes. This is not necessarily the root cause, but is easy to
rule out.
Beyond this, getting a stack trace during lockup would make sense.
That can pinpoint what threads are doing, or what they are blocked on
in case there is a deadlock or heavy contention on some shared
resource.

-+ Tatu +-


Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?

2010-02-16 Thread Stu Hood
 After I ran nodeprobe compact on node B its read latency went up to 150ms.
The compaction process can take a while to finish... in 0.5 you need to watch 
the logs to figure out when it has actually finished, and then you should start 
seeing the improvement in read latency.

 Is there any way to utilize all of the heap space to decrease the read 
 latency?
In 0.5 you can adjust the number of keys that are cached by changing the 
'KeysCachedFraction' parameter in your config file. In 0.6 you can additionally 
cache rows. You don't want to use up all of the memory on your box for those 
caches though: you'll want to leave at least 50% for your OS's disk cache, 
which will store the full row content.


-Original Message-
From: Weijun Li weiju...@gmail.com
Sent: Tuesday, February 16, 2010 12:16pm
To: cassandra-user@incubator.apache.org
Subject: Re: Cassandra benchmark shows OK throughput but high read latency ( 
100ms)?

Thanks for for DataFileDirectory trick and I'll give a try.

Just noticed the impact of number of data files: node A has 13 data files
with read latency of 20ms and node B has 27 files with read latency of 60ms.
After I ran nodeprobe compact on node B its read latency went up to 150ms.
The read latency of node A became as low as 10ms. Is this normal behavior?
I'm using random partitioner and the hardware/JVM settings are exactly the
same for these two nodes.

Another problem is that Java heap usage is always 900mb out of 6GB? Is there
any way to utilize all of the heap space to decrease the read latency?

-Weijun

On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com wrote:

 On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote:

 One more thoughts about Martin's suggestion: is it possible to put the
 data files into multiple directories that are located in different physical
 disks? This should help to improve the i/o bottleneck issue.


 Yes, you can already do this, just add more DataFileDirectory directives
 pointed at multiple drives.


 Has anybody tested the row-caching feature in trunk (shoot for 0.6?)?


 Row cache and key cache both help tremendously if your read pattern has a
 decent repeat rate.  Completely random io can only be so fast, however.

 -Brandon





Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?

2010-02-16 Thread Weijun Li
Still have high read latency with 50mil records in the 2-node cluster
(replica 2). I restarted both nodes but read latency is still above 60ms and
disk i/o saturation is high. Tried compact and repair but doesn't help much.
When I reduced the client threads from 15 to 5 it looks a lot better but
throughput is kind of low. I changed using flushing thread of 16 instead the
defaulted 8, could that cause the disk saturation issue?

For benchmark with decent throughput and latency, how many client threads do
they use? Can anyone share your storage-conf.xml in well-tuned high volume
cluster?

-Weijun

On Tue, Feb 16, 2010 at 10:31 AM, Stu Hood stu.h...@rackspace.com wrote:

  After I ran nodeprobe compact on node B its read latency went up to
 150ms.
 The compaction process can take a while to finish... in 0.5 you need to
 watch the logs to figure out when it has actually finished, and then you
 should start seeing the improvement in read latency.

  Is there any way to utilize all of the heap space to decrease the read
 latency?
 In 0.5 you can adjust the number of keys that are cached by changing the
 'KeysCachedFraction' parameter in your config file. In 0.6 you can
 additionally cache rows. You don't want to use up all of the memory on your
 box for those caches though: you'll want to leave at least 50% for your OS's
 disk cache, which will store the full row content.


 -Original Message-
 From: Weijun Li weiju...@gmail.com
 Sent: Tuesday, February 16, 2010 12:16pm
 To: cassandra-user@incubator.apache.org
 Subject: Re: Cassandra benchmark shows OK throughput but high read latency
 ( 100ms)?

 Thanks for for DataFileDirectory trick and I'll give a try.

 Just noticed the impact of number of data files: node A has 13 data files
 with read latency of 20ms and node B has 27 files with read latency of
 60ms.
 After I ran nodeprobe compact on node B its read latency went up to
 150ms.
 The read latency of node A became as low as 10ms. Is this normal behavior?
 I'm using random partitioner and the hardware/JVM settings are exactly the
 same for these two nodes.

 Another problem is that Java heap usage is always 900mb out of 6GB? Is
 there
 any way to utilize all of the heap space to decrease the read latency?

 -Weijun

 On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com
 wrote:

  On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote:
 
  One more thoughts about Martin's suggestion: is it possible to put the
  data files into multiple directories that are located in different
 physical
  disks? This should help to improve the i/o bottleneck issue.
 
 
  Yes, you can already do this, just add more DataFileDirectory
 directives
  pointed at multiple drives.
 
 
  Has anybody tested the row-caching feature in trunk (shoot for 0.6?)?
 
 
  Row cache and key cache both help tremendously if your read pattern has a
  decent repeat rate.  Completely random io can only be so fast, however.
 
  -Brandon
 





Re: Question about Token selection for order-preserving partitioner

2010-02-16 Thread Nguyễn Minh Kha
Hi,

Thank Wojciech, I use TimeUUIDType for CF Comments, I'm not use it for init
token.

On Tue, Feb 16, 2010 at 11:45 PM, Wojciech Kaczmarek
kaczmare...@gmail.comwrote:

 Hi!

 My comment was supposed to mean that you don't use TimeUUIDType as a token
 because it's a possible column type, not a key type (at least is what I know
 from a wiki and my short experience, I didn't check the source). You've
 probably mistaken sorting of different columns within a row (which depends
 on a type of column) with sorting of your keys which is possible when using
 order-preserving partitioner. The sorting of keys seems to depend only on a
 specific partitioner class used in config.

 Application-specific keys distribution means that you should predict how
 your keys will be generated in order to set seeds for your nodes. If you're
 in doubt and this is your test environment you could just leave InitialToken
 empty.

 It would help if you told us what the semantics of your keys are.


 2010/2/16 Nguyễn Minh Kha nminh...@gmail.com

 Hi!

 I think, my question is not clearly.

 I don't know to config these nodes for this cluster. I want to use
 TimeUUIDType and order-preserving partitioner for range query.
 How to config InitialToken and seeds for node01, node02, node03.

 I see the wiki topic Operations explain InitialToken for order-preserving
 partitioner:

 [With order preserving partioners, your key distribution will be
 application-dependent. You should still take your best guess at specifying
 initial tokens (guided by sampling actual data, if possible), but you will
 be more dependent on active load balancing (see below) and/or adding new
 nodes to hot spots.]'

 I don't know [application-dependent]. And I don't know config seeds
 section every node.

 Sorry for my english.


 On Tue, Feb 16, 2010 at 7:58 PM, Wojciech Kaczmarek 
 kaczmare...@gmail.com wrote:

 Hi!

 2010/2/16 Nguyễn Minh Kha nminh...@gmail.com


 I read the wiki topic 
 Operationshttp://wiki.apache.org/cassandra/Operationsand I don't 
 understant way to use Token selection for order-preserving
 partitioner (application-dependent).
 I want create blog comment use TimeUUIDType and order-preserving for
 range query, this cluster run in 3 nodes.


 How to add token, config seeds and more to run this cluster?

 Thank very much.


 The order-preserving partitioner will use keys as tokens. The type you're
 talking about could be the column type but that's another thing.






-- 
Nguyen Minh Kha

NCT Corporation
Email : kh...@nct.vn
Mobile : 090 696 1314
Y!M : iminhkha


Testing row cache feature in trunk: write should put record in cache

2010-02-16 Thread Weijun Li
Just started to play with the row cache feature in trunk: it seems to be
working fine so far except that for RowsCached parameter you need to specify
number of rows rather than a percentage (e.g., 20% doesn't work). Thanks
for this great feature that improves read latency dramatically so that disk
i/o is no longer a serious bottleneck.

The problem is: when you write to Cassandra it doesn't seem to put the new
keys in row cache (it is said to update instead invalidate if the entry is
already in cache). Is it easy to implement this feature? What are the
classes that should be touched for this? I'm guessing that
RowMutationVerbHandler should be the one to insert the entry in row cache?

-Weijun


Re: Testing row cache feature in trunk: write should put record in cache

2010-02-16 Thread Jonathan Ellis
On Tue, Feb 16, 2010 at 7:11 PM, Weijun Li weiju...@gmail.com wrote:
 Just started to play with the row cache feature in trunk: it seems to be
 working fine so far except that for RowsCached parameter you need to specify
 number of rows rather than a percentage (e.g., 20% doesn't work).

20% works, but it's 20% of the rows at server startup.  So on a fresh
start that is zero.

Maybe we should just get rid of the % feature...

 The problem is: when you write to Cassandra it doesn't seem to put the new
 keys in row cache (it is said to update instead invalidate if the entry is
 already in cache). Is it easy to implement this feature?

It's deliberately not done.  For many (most?) workloads you don't want
fresh writes blowing away your read cache.  The code is in
Table.apply:

ColumnFamily cachedRow = cfs.getRawCachedRow(mutation.key());
if (cachedRow != null)
cachedRow.addAll(columnFamily);

I think it would be okay to have a WriteThrough option for what you're
asking, though.

-Jonathan


Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?

2010-02-16 Thread Jonathan Ellis
Have you tried increasing KeysCachedFraction?

On Tue, Feb 16, 2010 at 6:15 PM, Weijun Li weiju...@gmail.com wrote:
 Still have high read latency with 50mil records in the 2-node cluster
 (replica 2). I restarted both nodes but read latency is still above 60ms and
 disk i/o saturation is high. Tried compact and repair but doesn't help much.
 When I reduced the client threads from 15 to 5 it looks a lot better but
 throughput is kind of low. I changed using flushing thread of 16 instead the
 defaulted 8, could that cause the disk saturation issue?

 For benchmark with decent throughput and latency, how many client threads do
 they use? Can anyone share your storage-conf.xml in well-tuned high volume
 cluster?

 -Weijun

 On Tue, Feb 16, 2010 at 10:31 AM, Stu Hood stu.h...@rackspace.com wrote:

  After I ran nodeprobe compact on node B its read latency went up to
  150ms.
 The compaction process can take a while to finish... in 0.5 you need to
 watch the logs to figure out when it has actually finished, and then you
 should start seeing the improvement in read latency.

  Is there any way to utilize all of the heap space to decrease the read
  latency?
 In 0.5 you can adjust the number of keys that are cached by changing the
 'KeysCachedFraction' parameter in your config file. In 0.6 you can
 additionally cache rows. You don't want to use up all of the memory on your
 box for those caches though: you'll want to leave at least 50% for your OS's
 disk cache, which will store the full row content.


 -Original Message-
 From: Weijun Li weiju...@gmail.com
 Sent: Tuesday, February 16, 2010 12:16pm
 To: cassandra-user@incubator.apache.org
 Subject: Re: Cassandra benchmark shows OK throughput but high read latency
 ( 100ms)?

 Thanks for for DataFileDirectory trick and I'll give a try.

 Just noticed the impact of number of data files: node A has 13 data files
 with read latency of 20ms and node B has 27 files with read latency of
 60ms.
 After I ran nodeprobe compact on node B its read latency went up to
 150ms.
 The read latency of node A became as low as 10ms. Is this normal behavior?
 I'm using random partitioner and the hardware/JVM settings are exactly the
 same for these two nodes.

 Another problem is that Java heap usage is always 900mb out of 6GB? Is
 there
 any way to utilize all of the heap space to decrease the read latency?

 -Weijun

 On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com
 wrote:

  On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote:
 
  One more thoughts about Martin's suggestion: is it possible to put the
  data files into multiple directories that are located in different
  physical
  disks? This should help to improve the i/o bottleneck issue.
 
 
  Yes, you can already do this, just add more DataFileDirectory
  directives
  pointed at multiple drives.
 
 
  Has anybody tested the row-caching feature in trunk (shoot for 0.6?)?
 
 
  Row cache and key cache both help tremendously if your read pattern has
  a
  decent repeat rate.  Completely random io can only be so fast, however.
 
  -Brandon
 






Re: Testing row cache feature in trunk: write should put record in cache

2010-02-16 Thread Jonathan Ellis
On Tue, Feb 16, 2010 at 7:17 PM, Jonathan Ellis jbel...@gmail.com wrote:
 On Tue, Feb 16, 2010 at 7:11 PM, Weijun Li weiju...@gmail.com wrote:
 Just started to play with the row cache feature in trunk: it seems to be
 working fine so far except that for RowsCached parameter you need to specify
 number of rows rather than a percentage (e.g., 20% doesn't work).

 20% works, but it's 20% of the rows at server startup.  So on a fresh
 start that is zero.

 Maybe we should just get rid of the % feature...

(Actually, it shouldn't be hard to update this on flush, if you want
to open a ticket.)


Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?

2010-02-16 Thread Weijun Li
Yes my KeysCachedFraction is already 0.3 but it doesn't relief the disk i/o.
I compacted the data to be a 60GB (took quite a while to finish and it
increased latency as expected) one but doesn't help much either.

If I set KCF to 1 (meaning to cache all sstable index), how much memory will
it take for 50mil keys? Is the index a straight key-offset map? I guess key
is 16 bytes and offset is 8 bytes. Will KCF=1 help to reduce disk i/o?

-Weijun

On Tue, Feb 16, 2010 at 5:18 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Have you tried increasing KeysCachedFraction?

 On Tue, Feb 16, 2010 at 6:15 PM, Weijun Li weiju...@gmail.com wrote:
  Still have high read latency with 50mil records in the 2-node cluster
  (replica 2). I restarted both nodes but read latency is still above 60ms
 and
  disk i/o saturation is high. Tried compact and repair but doesn't help
 much.
  When I reduced the client threads from 15 to 5 it looks a lot better but
  throughput is kind of low. I changed using flushing thread of 16 instead
 the
  defaulted 8, could that cause the disk saturation issue?
 
  For benchmark with decent throughput and latency, how many client threads
 do
  they use? Can anyone share your storage-conf.xml in well-tuned high
 volume
  cluster?
 
  -Weijun
 
  On Tue, Feb 16, 2010 at 10:31 AM, Stu Hood stu.h...@rackspace.com
 wrote:
 
   After I ran nodeprobe compact on node B its read latency went up to
   150ms.
  The compaction process can take a while to finish... in 0.5 you need to
  watch the logs to figure out when it has actually finished, and then you
  should start seeing the improvement in read latency.
 
   Is there any way to utilize all of the heap space to decrease the read
   latency?
  In 0.5 you can adjust the number of keys that are cached by changing the
  'KeysCachedFraction' parameter in your config file. In 0.6 you can
  additionally cache rows. You don't want to use up all of the memory on
 your
  box for those caches though: you'll want to leave at least 50% for your
 OS's
  disk cache, which will store the full row content.
 
 
  -Original Message-
  From: Weijun Li weiju...@gmail.com
  Sent: Tuesday, February 16, 2010 12:16pm
  To: cassandra-user@incubator.apache.org
  Subject: Re: Cassandra benchmark shows OK throughput but high read
 latency
  ( 100ms)?
 
  Thanks for for DataFileDirectory trick and I'll give a try.
 
  Just noticed the impact of number of data files: node A has 13 data
 files
  with read latency of 20ms and node B has 27 files with read latency of
  60ms.
  After I ran nodeprobe compact on node B its read latency went up to
  150ms.
  The read latency of node A became as low as 10ms. Is this normal
 behavior?
  I'm using random partitioner and the hardware/JVM settings are exactly
 the
  same for these two nodes.
 
  Another problem is that Java heap usage is always 900mb out of 6GB? Is
  there
  any way to utilize all of the heap space to decrease the read latency?
 
  -Weijun
 
  On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com
  wrote:
 
   On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com
 wrote:
  
   One more thoughts about Martin's suggestion: is it possible to put
 the
   data files into multiple directories that are located in different
   physical
   disks? This should help to improve the i/o bottleneck issue.
  
  
   Yes, you can already do this, just add more DataFileDirectory
   directives
   pointed at multiple drives.
  
  
   Has anybody tested the row-caching feature in trunk (shoot for 0.6?)?
  
  
   Row cache and key cache both help tremendously if your read pattern
 has
   a
   decent repeat rate.  Completely random io can only be so fast,
 however.
  
   -Brandon
  
 
 
 
 



Re: Testing row cache feature in trunk: write should put record in cache

2010-02-16 Thread Weijun Li
Just tried to make quick change to enable it but it didn't work out :-(

   ColumnFamily cachedRow = cfs.getRawCachedRow(mutation.key());

// What I modified
if( cachedRow == null ) {
cfs.cacheRow(mutation.key());
cachedRow = cfs.getRawCachedRow(mutation.key());
}

if (cachedRow != null)
cachedRow.addAll(columnFamily);

How can I open a ticket for you to make the change (enable row cache write
through with an option)?

Thanks,
-Weijun

On Tue, Feb 16, 2010 at 5:20 PM, Jonathan Ellis jbel...@gmail.com wrote:

 On Tue, Feb 16, 2010 at 7:17 PM, Jonathan Ellis jbel...@gmail.com wrote:
  On Tue, Feb 16, 2010 at 7:11 PM, Weijun Li weiju...@gmail.com wrote:
  Just started to play with the row cache feature in trunk: it seems to be
  working fine so far except that for RowsCached parameter you need to
 specify
  number of rows rather than a percentage (e.g., 20% doesn't work).
 
  20% works, but it's 20% of the rows at server startup.  So on a fresh
  start that is zero.
 
  Maybe we should just get rid of the % feature...

 (Actually, it shouldn't be hard to update this on flush, if you want
 to open a ticket.)



Re: Testing row cache feature in trunk: write should put record in cache

2010-02-16 Thread Jonathan Ellis
... tell you what, if you write the option-processing part in
DatabaseDescriptor I will do the actual cache part. :)

On Tue, Feb 16, 2010 at 11:07 PM, Jonathan Ellis jbel...@gmail.com wrote:
 https://issues.apache.org/jira/secure/CreateIssue!default.jspa, but
 this is pretty low priority for me.

 On Tue, Feb 16, 2010 at 8:37 PM, Weijun Li weiju...@gmail.com wrote:
 Just tried to make quick change to enable it but it didn't work out :-(

    ColumnFamily cachedRow = cfs.getRawCachedRow(mutation.key());

     // What I modified
     if( cachedRow == null ) {
         cfs.cacheRow(mutation.key());
         cachedRow = cfs.getRawCachedRow(mutation.key());
     }

     if (cachedRow != null)
         cachedRow.addAll(columnFamily);

 How can I open a ticket for you to make the change (enable row cache write
 through with an option)?

 Thanks,
 -Weijun

 On Tue, Feb 16, 2010 at 5:20 PM, Jonathan Ellis jbel...@gmail.com wrote:

 On Tue, Feb 16, 2010 at 7:17 PM, Jonathan Ellis jbel...@gmail.com wrote:
  On Tue, Feb 16, 2010 at 7:11 PM, Weijun Li weiju...@gmail.com wrote:
  Just started to play with the row cache feature in trunk: it seems to
  be
  working fine so far except that for RowsCached parameter you need to
  specify
  number of rows rather than a percentage (e.g., 20% doesn't work).
 
  20% works, but it's 20% of the rows at server startup.  So on a fresh
  start that is zero.
 
  Maybe we should just get rid of the % feature...

 (Actually, it shouldn't be hard to update this on flush, if you want
 to open a ticket.)