RE: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
The other problem is: if I keep mixed write and read (e.g, 8 write threads plus 7 read threads) against the 2-nodes cluster continuously, the read latency will go up gradually (along with the size of Cassandra data file), and at the end it will become ~40ms (up from ~20ms) even with only 15 threads. During this process the data file grew from 1.6GB to over 3GB even if I kept writing the same key/values to Cassandra. It seems that Cassandra keeps appending to sstable data files and will only clean up them during node cleanup or compact (please correct me if this is incorrect). In my tests I have observed that good read latency depends on keeping the number of data files low. In my current test setup, I have stored 1.9 TB of data on a single node, which is in 21 data files, and read latency is between 10 and 60ms (for small reads, larger read of course take more time). In earlier stages of my test, I had up to 5000 data files, and read performance was quite bad: my configured 10-second RPC timeout was regularly encountered. The number of data files is reduced whenever Cassandra compacts them, which is either automatically, when enough datafiles are generated by continuous writing, or when triggered by nodeprobe compact, cleanup etc. So my advice is to keep the write throughput low enough so that Cassandra can keep up compacting the data files. For high write throughput, you need fast drives, if possible on different RAIDs, which are configured as different DataDirectories for Cassandra. On my setup (6 drives in a single RAID-5 configuration), compaction is quite slow: sequential reads/writes are done at 150 MB/s, whereas during compaction, read/write-performance drops to a few MB/s. You definitively want more than one logical drive, so that Cassandra can alternate between them when flushin memtables and when compacting. I would really be interested whether my observations are shared by other people on this list. Thanks! Martin
Question about Token selection for order-preserving partitioner
Hi, I read the wiki topic Operationshttp://wiki.apache.org/cassandra/Operationsand I don't understant way to use Token selection for order-preserving partitioner (application-dependent). I want create blog comment use TimeUUIDType and order-preserving for range query, this cluster run in 3 nodes. How to add token, config seeds and more to run this cluster? Thank very much. KhaNguyễn
Re: Question about Token selection for order-preserving partitioner
Hi! 2010/2/16 Nguyễn Minh Kha nminh...@gmail.com I read the wiki topic Operationshttp://wiki.apache.org/cassandra/Operationsand I don't understant way to use Token selection for order-preserving partitioner (application-dependent). I want create blog comment use TimeUUIDType and order-preserving for range query, this cluster run in 3 nodes. How to add token, config seeds and more to run this cluster? Thank very much. The order-preserving partitioner will use keys as tokens. The type you're talking about could be the column type but that's another thing.
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabmüller martin.grabmuel...@eleven.de wrote: In my tests I have observed that good read latency depends on keeping the number of data files low. In my current test setup, I have stored 1.9 TB of data on a single node, which is in 21 data files, and read latency is between 10 and 60ms (for small reads, larger read of course take more time). In earlier stages of my test, I had up to 5000 data files, and read performance was quite bad: my configured 10-second RPC timeout was regularly encountered. I believe it is known that crossing sstables is O(NlogN) but I'm unable to find the ticket on this at the moment. Perhaps Stu Hood will jump in and enlighten me, but in any case I believe https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve it. Keeping write volume low enough that compaction can keep up is one solution, and throwing hardware at the problem is another, if necessary. Also, the row caching in trunk (soon to be 0.6 we hope) helps greatly for repeat hits. -Brandon
cassandra freezes
Hello, I'm running some benchmarks on 2 cassandra nodes each running on 8 cores machine with 16G RAM, 10G for Java heap. I've noticed that during benchmarks with numerous writes cassandra just freeze for several minutes (in those benchmarks I'm writing batches of 10 columns with 1K data each for every key in a single CF). Usually after performing 50K writes I'm getting a TimeOutException and cassandra just freezes. What configuration changes can I make in order to prevent this? Is it possible that my setup just can't handle the load? How can I calculate the number of casandra nodes for a desired load?
Re: Question about Token selection for order-preserving partitioner
Hi! I think, my question is not clearly. I don't know to config these nodes for this cluster. I want to use TimeUUIDType and order-preserving partitioner for range query. How to config InitialToken and seeds for node01, node02, node03. I see the wiki topic Operations explain InitialToken for order-preserving partitioner: [With order preserving partioners, your key distribution will be application-dependent. You should still take your best guess at specifying initial tokens (guided by sampling actual data, if possible), but you will be more dependent on active load balancing (see below) and/or adding new nodes to hot spots.]' I don't know [application-dependent]. And I don't know config seeds section every node. Sorry for my english. On Tue, Feb 16, 2010 at 7:58 PM, Wojciech Kaczmarek kaczmare...@gmail.comwrote: Hi! 2010/2/16 Nguyễn Minh Kha nminh...@gmail.com I read the wiki topic Operationshttp://wiki.apache.org/cassandra/Operationsand I don't understant way to use Token selection for order-preserving partitioner (application-dependent). I want create blog comment use TimeUUIDType and order-preserving for range query, this cluster run in 3 nodes. How to add token, config seeds and more to run this cluster? Thank very much. The order-preserving partitioner will use keys as tokens. The type you're talking about could be the column type but that's another thing.
Re: Question about Token selection for order-preserving partitioner
Hi! My comment was supposed to mean that you don't use TimeUUIDType as a token because it's a possible column type, not a key type (at least is what I know from a wiki and my short experience, I didn't check the source). You've probably mistaken sorting of different columns within a row (which depends on a type of column) with sorting of your keys which is possible when using order-preserving partitioner. The sorting of keys seems to depend only on a specific partitioner class used in config. Application-specific keys distribution means that you should predict how your keys will be generated in order to set seeds for your nodes. If you're in doubt and this is your test environment you could just leave InitialToken empty. It would help if you told us what the semantics of your keys are. 2010/2/16 Nguyễn Minh Kha nminh...@gmail.com Hi! I think, my question is not clearly. I don't know to config these nodes for this cluster. I want to use TimeUUIDType and order-preserving partitioner for range query. How to config InitialToken and seeds for node01, node02, node03. I see the wiki topic Operations explain InitialToken for order-preserving partitioner: [With order preserving partioners, your key distribution will be application-dependent. You should still take your best guess at specifying initial tokens (guided by sampling actual data, if possible), but you will be more dependent on active load balancing (see below) and/or adding new nodes to hot spots.]' I don't know [application-dependent]. And I don't know config seeds section every node. Sorry for my english. On Tue, Feb 16, 2010 at 7:58 PM, Wojciech Kaczmarek kaczmare...@gmail.com wrote: Hi! 2010/2/16 Nguyễn Minh Kha nminh...@gmail.com I read the wiki topic Operationshttp://wiki.apache.org/cassandra/Operationsand I don't understant way to use Token selection for order-preserving partitioner (application-dependent). I want create blog comment use TimeUUIDType and order-preserving for range query, this cluster run in 3 nodes. How to add token, config seeds and more to run this cluster? Thank very much. The order-preserving partitioner will use keys as tokens. The type you're talking about could be the column type but that's another thing.
Re: Nodeprobe Not Working Properly
I can ping to the other server using db1a instead of the host name.192.168.1.13 db1a::1 localhost ip6-localhost ip6-loopbackfe00::0 ip6-localnetff00::0 ip6-mcastprefixff02::1 ip6-allnodesff02::2 ip6-allroutersff02::3 ip6-allhosts# Auto-generated hostname. Please do not remove this comment.127.0.0.1 db1b.domain.com localhost db1b localhost.localdomain db1b:~$ ping db1a PING db1a (192.168.1.13) 56(84) bytes of data. 64 bytes from db1a (192.168.1.13): icmp_seq=1 ttl=64 time=0.252 ms 64 bytes from db1a (192.168.1.13): icmp_seq=2 ttl=64 time=0.228 ms 64 bytes from db1a (192.168.1.13): icmp_seq=3 ttl=64 time=0.233 ms --- db1a ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1998ms rtt min/avg/max/mdev = 0.228/0.237/0.252/0.020 ms The same issue happens in reverse, when I switch db1a and db1b. Other possibly useful information: db1b:~# sudo ifconfigeth0 Link encap:Ethernet HWaddr 00:18:51:07:64:ea inet addr:192.168.1.14 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::218:51ff:fe07:64ea/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:938664 errors:0 dropped:0 overruns:0 frame:0 TX packets:649625 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:429396142 (409.5 MiB) TX bytes:133120553 (126.9 MiB)eth0:0 Link encap:Ethernet HWaddr 00:18:51:07:64:ea inet addr:xx.xx.xxx.xx Bcast:68.71.241.31 Mask:255.255.255.224 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:41715 errors:0 dropped:0 overruns:0 frame:0 TX packets:41715 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:5095781 (4.8 MiB) TX bytes:5095781 (4.8 MiB) db1b:~# lsof -i COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME portmap 171 daemon 4u IPv4 11228 UDP *:sunrpc portmap 171 daemon 5u IPv4 112286667 TCP *:sunrpc (LISTEN) master 343 root 12u IPv4 112287239 TCP db1b.domain.com:smtp (LISTEN) jsvc 14672 root 24u IPv6 113014281 TCP *:53586 (LISTEN) jsvc 14672 root 26u IPv6 113014282 TCP *:http-alt (LISTEN) jsvc 14672 root 27u IPv6 113014285 TCP *:39979 (LISTEN) jsvc 14672 root 30u IPv6 113014297 TCP 192.168.1.14:afs3-fileserver (LISTEN) jsvc 14672 root 35u IPv6 113014299 UDP 192.168.1.14:afs3-callback jsvc 14672 root 44u IPv6 113014301 TCP 192.168.1.14:9160 (LISTEN) Thank you, Shahan On Mon, 15 Feb 2010 20:52:02 -0600, Brandon Williams wrote: On Mon, Feb 15, 2010 at 8:46 PM, Shahan Khan wrote: I tried Brandon's suggestion, but am still getting the same error on the remote server. Any other suggestions? Is it possible that its a bug? Thanks, Shahan db1a = 192.168.1.13 db1b = 192.168.1.14 = db1a:~# nodeprobe -host 192.168.1.14 ring Error connecting to remote JMX agent! java.rmi.ConnectException: Connection refused to host: 127.0.0.1; nested exception is:For some reason, db1a indicates that 192.168.1.14 resolves to 127.0.0.1...what does your /etc/hosts look like on this machine? db1b:~# nodeprobe -host 192.168.1.14 info Token(bytes[eaaca3c3bd3caba3e14ee0f85d5cda8a]) Load : 4 KB Generation No : 1266260277 Uptime (seconds) : 9933 Heap Memory (MB) : 54.83 / 1016.13 So the problem with db1b before was indeed incorrect host resolution for db1a. -Brandon Links: -- [1] mailto:cont...@shahan.me
Re: Nodeprobe Not Working Properly
On Tue, Feb 16, 2010 at 11:08 AM, Shahan Khan cont...@shahan.me wrote: I can ping to the other server using db1a instead of the host name. By 'host name' I assume you mean IP address. 192.168.1.13db1a ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts # Auto-generated hostname. Please do not remove this comment. 127.0.0.1 db1b.domain.com localhost db1b localhost.localdomain db1b:~$ ping db1a PING db1a (192.168.1.13) 56(84) bytes of data. 64 bytes from db1a (192.168.1.13): icmp_seq=1 ttl=64 time=0.252 ms 64 bytes from db1a (192.168.1.13): icmp_seq=2 So db1b's host resolution appears to be ok. Is this output from db1a, or db1b? It appears to be db1b, but your last issue was with db1a resolving db1b's IP address. Cassandra doesn't do anything magical with hostname resolution, it relies on the underlying system for that. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
Dumped 50mil records into my 2-node cluster overnight, made sure that there's not many data files (around 30 only) per Martin's suggestion. The size of the data directory is 63GB. Now when I read records from the cluster the read latency is still ~44ms, --there's no write happening during the read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is saturated: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 47.6767.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 47.6767.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 CPU usage is low. Does this mean disk i/o is the bottleneck for my case? Will it help if I increase KCF to cache all sstable index? Also, this is the almost a read-only mode test, and in reality, our write/read ratio is close to 1:1 so I'm guessing read latency will even go higher in that case because there will be difficult for cassandra to find a good moment to compact the data files that are being busy written. Thanks, -Weijun On Tue, Feb 16, 2010 at 6:06 AM, Brandon Williams dri...@gmail.com wrote: On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabmüller martin.grabmuel...@eleven.de wrote: In my tests I have observed that good read latency depends on keeping the number of data files low. In my current test setup, I have stored 1.9 TB of data on a single node, which is in 21 data files, and read latency is between 10 and 60ms (for small reads, larger read of course take more time). In earlier stages of my test, I had up to 5000 data files, and read performance was quite bad: my configured 10-second RPC timeout was regularly encountered. I believe it is known that crossing sstables is O(NlogN) but I'm unable to find the ticket on this at the moment. Perhaps Stu Hood will jump in and enlighten me, but in any case I believe https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve it. Keeping write volume low enough that compaction can keep up is one solution, and throwing hardware at the problem is another, if necessary. Also, the row caching in trunk (soon to be 0.6 we hope) helps greatly for repeat hits. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? -Weijun On Tue, Feb 16, 2010 at 9:50 AM, Weijun Li weiju...@gmail.com wrote: Dumped 50mil records into my 2-node cluster overnight, made sure that there's not many data files (around 30 only) per Martin's suggestion. The size of the data directory is 63GB. Now when I read records from the cluster the read latency is still ~44ms, --there's no write happening during the read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is saturated: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 47.6767.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 47.6767.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 CPU usage is low. Does this mean disk i/o is the bottleneck for my case? Will it help if I increase KCF to cache all sstable index? Also, this is the almost a read-only mode test, and in reality, our write/read ratio is close to 1:1 so I'm guessing read latency will even go higher in that case because there will be difficult for cassandra to find a good moment to compact the data files that are being busy written. Thanks, -Weijun On Tue, Feb 16, 2010 at 6:06 AM, Brandon Williams dri...@gmail.comwrote: On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabmüller martin.grabmuel...@eleven.de wrote: In my tests I have observed that good read latency depends on keeping the number of data files low. In my current test setup, I have stored 1.9 TB of data on a single node, which is in 21 data files, and read latency is between 10 and 60ms (for small reads, larger read of course take more time). In earlier stages of my test, I had up to 5000 data files, and read performance was quite bad: my configured 10-second RPC timeout was regularly encountered. I believe it is known that crossing sstables is O(NlogN) but I'm unable to find the ticket on this at the moment. Perhaps Stu Hood will jump in and enlighten me, but in any case I believe https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve it. Keeping write volume low enough that compaction can keep up is one solution, and throwing hardware at the problem is another, if necessary. Also, the row caching in trunk (soon to be 0.6 we hope) helps greatly for repeat hits. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
On Tue, Feb 16, 2010 at 11:50 AM, Weijun Li weiju...@gmail.com wrote: Dumped 50mil records into my 2-node cluster overnight, made sure that there's not many data files (around 30 only) per Martin's suggestion. The size of the data directory is 63GB. Now when I read records from the cluster the read latency is still ~44ms, --there's no write happening during the read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is saturated: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 47.6767.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 47.6767.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 CPU usage is low. Does this mean disk i/o is the bottleneck for my case? Will it help if I increase KCF to cache all sstable index? That's exactly what this means. Disk is slow :( Also, this is the almost a read-only mode test, and in reality, our write/read ratio is close to 1:1 so I'm guessing read latency will even go higher in that case because there will be difficult for cassandra to find a good moment to compact the data files that are being busy written. Reads that cause disk seeks are always going to slow things down, since disk seeks are inherently the slowest operation in a machine. Writes in Cassandra should always be fast, as they do not cause any disk seeks. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote: One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Yes, you can already do this, just add more DataFileDirectory directives pointed at multiple drives. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? Row cache and key cache both help tremendously if your read pattern has a decent repeat rate. Completely random io can only be so fast, however. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
Thanks for for DataFileDirectory trick and I'll give a try. Just noticed the impact of number of data files: node A has 13 data files with read latency of 20ms and node B has 27 files with read latency of 60ms. After I ran nodeprobe compact on node B its read latency went up to 150ms. The read latency of node A became as low as 10ms. Is this normal behavior? I'm using random partitioner and the hardware/JVM settings are exactly the same for these two nodes. Another problem is that Java heap usage is always 900mb out of 6GB? Is there any way to utilize all of the heap space to decrease the read latency? -Weijun On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com wrote: On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote: One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Yes, you can already do this, just add more DataFileDirectory directives pointed at multiple drives. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? Row cache and key cache both help tremendously if your read pattern has a decent repeat rate. Completely random io can only be so fast, however. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
On Tue, Feb 16, 2010 at 12:16 PM, Weijun Li weiju...@gmail.com wrote: Thanks for for DataFileDirectory trick and I'll give a try. Just noticed the impact of number of data files: node A has 13 data files with read latency of 20ms and node B has 27 files with read latency of 60ms. After I ran nodeprobe compact on node B its read latency went up to 150ms. The read latency of node A became as low as 10ms. Is this normal behavior? I'm using random partitioner and the hardware/JVM settings are exactly the same for these two nodes. It sounds like the latency jumped to 150ms because the newly written file was not in the OS cache. Another problem is that Java heap usage is always 900mb out of 6GB? Is there any way to utilize all of the heap space to decrease the read latency? By default, Cassandra will use a 1GB heap, as set in bin/cassandra.in.sh. You can adjust the jvm heap there via the -Xmx option, but generally you want to balance the jvm vs the OS cache. With 6GB, I would probably give 2GB to the jvm, but if you aren't having issues now increasing the jvm's memory probably won't provide any performance gains, but it's worth noting that with row cache in 0.6 this may change. -Brandon
Re: cassandra freezes
On Tue, Feb 16, 2010 at 6:25 AM, Boris Shulman shulm...@gmail.com wrote: Hello, I'm running some benchmarks on 2 cassandra nodes each running on 8 cores machine with 16G RAM, 10G for Java heap. I've noticed that during benchmarks with numerous writes cassandra just freeze for several minutes (in those benchmarks I'm writing batches of 10 columns with 1K data each for every key in a single CF). Usually after performing 50K writes I'm getting a TimeOutException and cassandra just freezes. What configuration changes can I make in order to prevent this? Is it possible that my setup just can't handle the load? How can I calculate the number of casandra nodes for a desired load? One thing that can cause seeming lockups is garbage collector. So enabling GC debug output would be heplful, to see GC activity. Some collector (CMS specifically) can stop the system for very long time, up to minutes. This is not necessarily the root cause, but is easy to rule out. Beyond this, getting a stack trace during lockup would make sense. That can pinpoint what threads are doing, or what they are blocked on in case there is a deadlock or heavy contention on some shared resource. -+ Tatu +-
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
After I ran nodeprobe compact on node B its read latency went up to 150ms. The compaction process can take a while to finish... in 0.5 you need to watch the logs to figure out when it has actually finished, and then you should start seeing the improvement in read latency. Is there any way to utilize all of the heap space to decrease the read latency? In 0.5 you can adjust the number of keys that are cached by changing the 'KeysCachedFraction' parameter in your config file. In 0.6 you can additionally cache rows. You don't want to use up all of the memory on your box for those caches though: you'll want to leave at least 50% for your OS's disk cache, which will store the full row content. -Original Message- From: Weijun Li weiju...@gmail.com Sent: Tuesday, February 16, 2010 12:16pm To: cassandra-user@incubator.apache.org Subject: Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)? Thanks for for DataFileDirectory trick and I'll give a try. Just noticed the impact of number of data files: node A has 13 data files with read latency of 20ms and node B has 27 files with read latency of 60ms. After I ran nodeprobe compact on node B its read latency went up to 150ms. The read latency of node A became as low as 10ms. Is this normal behavior? I'm using random partitioner and the hardware/JVM settings are exactly the same for these two nodes. Another problem is that Java heap usage is always 900mb out of 6GB? Is there any way to utilize all of the heap space to decrease the read latency? -Weijun On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com wrote: On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote: One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Yes, you can already do this, just add more DataFileDirectory directives pointed at multiple drives. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? Row cache and key cache both help tremendously if your read pattern has a decent repeat rate. Completely random io can only be so fast, however. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
Still have high read latency with 50mil records in the 2-node cluster (replica 2). I restarted both nodes but read latency is still above 60ms and disk i/o saturation is high. Tried compact and repair but doesn't help much. When I reduced the client threads from 15 to 5 it looks a lot better but throughput is kind of low. I changed using flushing thread of 16 instead the defaulted 8, could that cause the disk saturation issue? For benchmark with decent throughput and latency, how many client threads do they use? Can anyone share your storage-conf.xml in well-tuned high volume cluster? -Weijun On Tue, Feb 16, 2010 at 10:31 AM, Stu Hood stu.h...@rackspace.com wrote: After I ran nodeprobe compact on node B its read latency went up to 150ms. The compaction process can take a while to finish... in 0.5 you need to watch the logs to figure out when it has actually finished, and then you should start seeing the improvement in read latency. Is there any way to utilize all of the heap space to decrease the read latency? In 0.5 you can adjust the number of keys that are cached by changing the 'KeysCachedFraction' parameter in your config file. In 0.6 you can additionally cache rows. You don't want to use up all of the memory on your box for those caches though: you'll want to leave at least 50% for your OS's disk cache, which will store the full row content. -Original Message- From: Weijun Li weiju...@gmail.com Sent: Tuesday, February 16, 2010 12:16pm To: cassandra-user@incubator.apache.org Subject: Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)? Thanks for for DataFileDirectory trick and I'll give a try. Just noticed the impact of number of data files: node A has 13 data files with read latency of 20ms and node B has 27 files with read latency of 60ms. After I ran nodeprobe compact on node B its read latency went up to 150ms. The read latency of node A became as low as 10ms. Is this normal behavior? I'm using random partitioner and the hardware/JVM settings are exactly the same for these two nodes. Another problem is that Java heap usage is always 900mb out of 6GB? Is there any way to utilize all of the heap space to decrease the read latency? -Weijun On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com wrote: On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote: One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Yes, you can already do this, just add more DataFileDirectory directives pointed at multiple drives. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? Row cache and key cache both help tremendously if your read pattern has a decent repeat rate. Completely random io can only be so fast, however. -Brandon
Re: Question about Token selection for order-preserving partitioner
Hi, Thank Wojciech, I use TimeUUIDType for CF Comments, I'm not use it for init token. On Tue, Feb 16, 2010 at 11:45 PM, Wojciech Kaczmarek kaczmare...@gmail.comwrote: Hi! My comment was supposed to mean that you don't use TimeUUIDType as a token because it's a possible column type, not a key type (at least is what I know from a wiki and my short experience, I didn't check the source). You've probably mistaken sorting of different columns within a row (which depends on a type of column) with sorting of your keys which is possible when using order-preserving partitioner. The sorting of keys seems to depend only on a specific partitioner class used in config. Application-specific keys distribution means that you should predict how your keys will be generated in order to set seeds for your nodes. If you're in doubt and this is your test environment you could just leave InitialToken empty. It would help if you told us what the semantics of your keys are. 2010/2/16 Nguyễn Minh Kha nminh...@gmail.com Hi! I think, my question is not clearly. I don't know to config these nodes for this cluster. I want to use TimeUUIDType and order-preserving partitioner for range query. How to config InitialToken and seeds for node01, node02, node03. I see the wiki topic Operations explain InitialToken for order-preserving partitioner: [With order preserving partioners, your key distribution will be application-dependent. You should still take your best guess at specifying initial tokens (guided by sampling actual data, if possible), but you will be more dependent on active load balancing (see below) and/or adding new nodes to hot spots.]' I don't know [application-dependent]. And I don't know config seeds section every node. Sorry for my english. On Tue, Feb 16, 2010 at 7:58 PM, Wojciech Kaczmarek kaczmare...@gmail.com wrote: Hi! 2010/2/16 Nguyễn Minh Kha nminh...@gmail.com I read the wiki topic Operationshttp://wiki.apache.org/cassandra/Operationsand I don't understant way to use Token selection for order-preserving partitioner (application-dependent). I want create blog comment use TimeUUIDType and order-preserving for range query, this cluster run in 3 nodes. How to add token, config seeds and more to run this cluster? Thank very much. The order-preserving partitioner will use keys as tokens. The type you're talking about could be the column type but that's another thing. -- Nguyen Minh Kha NCT Corporation Email : kh...@nct.vn Mobile : 090 696 1314 Y!M : iminhkha
Testing row cache feature in trunk: write should put record in cache
Just started to play with the row cache feature in trunk: it seems to be working fine so far except that for RowsCached parameter you need to specify number of rows rather than a percentage (e.g., 20% doesn't work). Thanks for this great feature that improves read latency dramatically so that disk i/o is no longer a serious bottleneck. The problem is: when you write to Cassandra it doesn't seem to put the new keys in row cache (it is said to update instead invalidate if the entry is already in cache). Is it easy to implement this feature? What are the classes that should be touched for this? I'm guessing that RowMutationVerbHandler should be the one to insert the entry in row cache? -Weijun
Re: Testing row cache feature in trunk: write should put record in cache
On Tue, Feb 16, 2010 at 7:11 PM, Weijun Li weiju...@gmail.com wrote: Just started to play with the row cache feature in trunk: it seems to be working fine so far except that for RowsCached parameter you need to specify number of rows rather than a percentage (e.g., 20% doesn't work). 20% works, but it's 20% of the rows at server startup. So on a fresh start that is zero. Maybe we should just get rid of the % feature... The problem is: when you write to Cassandra it doesn't seem to put the new keys in row cache (it is said to update instead invalidate if the entry is already in cache). Is it easy to implement this feature? It's deliberately not done. For many (most?) workloads you don't want fresh writes blowing away your read cache. The code is in Table.apply: ColumnFamily cachedRow = cfs.getRawCachedRow(mutation.key()); if (cachedRow != null) cachedRow.addAll(columnFamily); I think it would be okay to have a WriteThrough option for what you're asking, though. -Jonathan
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
Have you tried increasing KeysCachedFraction? On Tue, Feb 16, 2010 at 6:15 PM, Weijun Li weiju...@gmail.com wrote: Still have high read latency with 50mil records in the 2-node cluster (replica 2). I restarted both nodes but read latency is still above 60ms and disk i/o saturation is high. Tried compact and repair but doesn't help much. When I reduced the client threads from 15 to 5 it looks a lot better but throughput is kind of low. I changed using flushing thread of 16 instead the defaulted 8, could that cause the disk saturation issue? For benchmark with decent throughput and latency, how many client threads do they use? Can anyone share your storage-conf.xml in well-tuned high volume cluster? -Weijun On Tue, Feb 16, 2010 at 10:31 AM, Stu Hood stu.h...@rackspace.com wrote: After I ran nodeprobe compact on node B its read latency went up to 150ms. The compaction process can take a while to finish... in 0.5 you need to watch the logs to figure out when it has actually finished, and then you should start seeing the improvement in read latency. Is there any way to utilize all of the heap space to decrease the read latency? In 0.5 you can adjust the number of keys that are cached by changing the 'KeysCachedFraction' parameter in your config file. In 0.6 you can additionally cache rows. You don't want to use up all of the memory on your box for those caches though: you'll want to leave at least 50% for your OS's disk cache, which will store the full row content. -Original Message- From: Weijun Li weiju...@gmail.com Sent: Tuesday, February 16, 2010 12:16pm To: cassandra-user@incubator.apache.org Subject: Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)? Thanks for for DataFileDirectory trick and I'll give a try. Just noticed the impact of number of data files: node A has 13 data files with read latency of 20ms and node B has 27 files with read latency of 60ms. After I ran nodeprobe compact on node B its read latency went up to 150ms. The read latency of node A became as low as 10ms. Is this normal behavior? I'm using random partitioner and the hardware/JVM settings are exactly the same for these two nodes. Another problem is that Java heap usage is always 900mb out of 6GB? Is there any way to utilize all of the heap space to decrease the read latency? -Weijun On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com wrote: On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote: One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Yes, you can already do this, just add more DataFileDirectory directives pointed at multiple drives. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? Row cache and key cache both help tremendously if your read pattern has a decent repeat rate. Completely random io can only be so fast, however. -Brandon
Re: Testing row cache feature in trunk: write should put record in cache
On Tue, Feb 16, 2010 at 7:17 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Feb 16, 2010 at 7:11 PM, Weijun Li weiju...@gmail.com wrote: Just started to play with the row cache feature in trunk: it seems to be working fine so far except that for RowsCached parameter you need to specify number of rows rather than a percentage (e.g., 20% doesn't work). 20% works, but it's 20% of the rows at server startup. So on a fresh start that is zero. Maybe we should just get rid of the % feature... (Actually, it shouldn't be hard to update this on flush, if you want to open a ticket.)
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
Yes my KeysCachedFraction is already 0.3 but it doesn't relief the disk i/o. I compacted the data to be a 60GB (took quite a while to finish and it increased latency as expected) one but doesn't help much either. If I set KCF to 1 (meaning to cache all sstable index), how much memory will it take for 50mil keys? Is the index a straight key-offset map? I guess key is 16 bytes and offset is 8 bytes. Will KCF=1 help to reduce disk i/o? -Weijun On Tue, Feb 16, 2010 at 5:18 PM, Jonathan Ellis jbel...@gmail.com wrote: Have you tried increasing KeysCachedFraction? On Tue, Feb 16, 2010 at 6:15 PM, Weijun Li weiju...@gmail.com wrote: Still have high read latency with 50mil records in the 2-node cluster (replica 2). I restarted both nodes but read latency is still above 60ms and disk i/o saturation is high. Tried compact and repair but doesn't help much. When I reduced the client threads from 15 to 5 it looks a lot better but throughput is kind of low. I changed using flushing thread of 16 instead the defaulted 8, could that cause the disk saturation issue? For benchmark with decent throughput and latency, how many client threads do they use? Can anyone share your storage-conf.xml in well-tuned high volume cluster? -Weijun On Tue, Feb 16, 2010 at 10:31 AM, Stu Hood stu.h...@rackspace.com wrote: After I ran nodeprobe compact on node B its read latency went up to 150ms. The compaction process can take a while to finish... in 0.5 you need to watch the logs to figure out when it has actually finished, and then you should start seeing the improvement in read latency. Is there any way to utilize all of the heap space to decrease the read latency? In 0.5 you can adjust the number of keys that are cached by changing the 'KeysCachedFraction' parameter in your config file. In 0.6 you can additionally cache rows. You don't want to use up all of the memory on your box for those caches though: you'll want to leave at least 50% for your OS's disk cache, which will store the full row content. -Original Message- From: Weijun Li weiju...@gmail.com Sent: Tuesday, February 16, 2010 12:16pm To: cassandra-user@incubator.apache.org Subject: Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)? Thanks for for DataFileDirectory trick and I'll give a try. Just noticed the impact of number of data files: node A has 13 data files with read latency of 20ms and node B has 27 files with read latency of 60ms. After I ran nodeprobe compact on node B its read latency went up to 150ms. The read latency of node A became as low as 10ms. Is this normal behavior? I'm using random partitioner and the hardware/JVM settings are exactly the same for these two nodes. Another problem is that Java heap usage is always 900mb out of 6GB? Is there any way to utilize all of the heap space to decrease the read latency? -Weijun On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com wrote: On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote: One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Yes, you can already do this, just add more DataFileDirectory directives pointed at multiple drives. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? Row cache and key cache both help tremendously if your read pattern has a decent repeat rate. Completely random io can only be so fast, however. -Brandon
Re: Testing row cache feature in trunk: write should put record in cache
Just tried to make quick change to enable it but it didn't work out :-( ColumnFamily cachedRow = cfs.getRawCachedRow(mutation.key()); // What I modified if( cachedRow == null ) { cfs.cacheRow(mutation.key()); cachedRow = cfs.getRawCachedRow(mutation.key()); } if (cachedRow != null) cachedRow.addAll(columnFamily); How can I open a ticket for you to make the change (enable row cache write through with an option)? Thanks, -Weijun On Tue, Feb 16, 2010 at 5:20 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Feb 16, 2010 at 7:17 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Feb 16, 2010 at 7:11 PM, Weijun Li weiju...@gmail.com wrote: Just started to play with the row cache feature in trunk: it seems to be working fine so far except that for RowsCached parameter you need to specify number of rows rather than a percentage (e.g., 20% doesn't work). 20% works, but it's 20% of the rows at server startup. So on a fresh start that is zero. Maybe we should just get rid of the % feature... (Actually, it shouldn't be hard to update this on flush, if you want to open a ticket.)
Re: Testing row cache feature in trunk: write should put record in cache
... tell you what, if you write the option-processing part in DatabaseDescriptor I will do the actual cache part. :) On Tue, Feb 16, 2010 at 11:07 PM, Jonathan Ellis jbel...@gmail.com wrote: https://issues.apache.org/jira/secure/CreateIssue!default.jspa, but this is pretty low priority for me. On Tue, Feb 16, 2010 at 8:37 PM, Weijun Li weiju...@gmail.com wrote: Just tried to make quick change to enable it but it didn't work out :-( ColumnFamily cachedRow = cfs.getRawCachedRow(mutation.key()); // What I modified if( cachedRow == null ) { cfs.cacheRow(mutation.key()); cachedRow = cfs.getRawCachedRow(mutation.key()); } if (cachedRow != null) cachedRow.addAll(columnFamily); How can I open a ticket for you to make the change (enable row cache write through with an option)? Thanks, -Weijun On Tue, Feb 16, 2010 at 5:20 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Feb 16, 2010 at 7:17 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Feb 16, 2010 at 7:11 PM, Weijun Li weiju...@gmail.com wrote: Just started to play with the row cache feature in trunk: it seems to be working fine so far except that for RowsCached parameter you need to specify number of rows rather than a percentage (e.g., 20% doesn't work). 20% works, but it's 20% of the rows at server startup. So on a fresh start that is zero. Maybe we should just get rid of the % feature... (Actually, it shouldn't be hard to update this on flush, if you want to open a ticket.)