RE: Effective allocation of multiple disks
You can list multiple DataFileDirectories, and Cassandra will scatter files across all of them. Use 1 disk for the commitlog, and 3 disks for data directories. See http://wiki.apache.org/cassandra/CassandraHardware#Disk Thanks, Stu -Original Message- From: Eric Rosenberry epros...@gmail.com Sent: Wednesday, March 10, 2010 2:00am To: cassandra-user@incubator.apache.org Subject: Effective allocation of multiple disks Based on the documentation, it is clear that with Cassandra you want to have one disk for commitlog, and one disk for data. My question is: If you think your workload is going to require more io performance to the data disks than a single disk can handle, how would you recommend effectively utilizing additional disks? It would seem a number of vendors sell 1U boxes with four 3.5 inch disks. If we use one for commitlog, is there a way to have Cassandra itself equally split data across the three remaining disks? Or is this something that needs to be handled by the hardware level, or operating system/file system level? Options include a hardware RAID controller in a RAID 0 stripe (this is more $$$ and for what gain?), or utilizing a volume manager like LVM. Along those same lines, if you do implement some type of striping, what RAID stripe size is recommended? (I think Todd Burruss asked this earlier but I did not see a response) Thanks for any input! -Eric
RE: CassandraHardware link on the wiki FrontPage
Anyone can edit any page once they have an account: click the Login link at the top right next to the search box to create an account. Thanks, Stu -Original Message- From: Eric Rosenberry e...@rosenberry.org Sent: Wednesday, March 10, 2010 2:52am To: cassandra-user@incubator.apache.org Subject: CassandraHardware link on the wiki FrontPage Would it be possible to add a link to the CassandraHardware page from the FrontPage of the wiki? I think other new folks to Cassandra may find it useful. ;-) (I would do it myself, though that page is Immutable) http://wiki.apache.org/cassandra/FrontPage http://wiki.apache.org/cassandra/CassandraHardware Thanks! -Eric
Re: Effective allocation of multiple disks
Yea, I suppose major compactions are the wildcard here. Nonetheless, the situation where you only have 1 SSTable should be very rare. I'll open a ticket though, because we really ought to be able to utilize those disks more thoroughly, and I have some ideas there. -Original Message- From: Anthony Molinaro antho...@alumni.caltech.edu Sent: Wednesday, March 10, 2010 3:38pm To: cassandra-user@incubator.apache.org Subject: Re: Effective allocation of multiple disks This is incorrect, as discussed a few weeks ago. I have a setup with multiple disks, and as soon as compaction occurs all the data ends up on one disk. If you need the additional io, you will want raid0. But simply listing multiple DataFileDirectories will not work. -Anthony On Wed, Mar 10, 2010 at 02:08:13AM -0600, Stu Hood wrote: You can list multiple DataFileDirectories, and Cassandra will scatter files across all of them. Use 1 disk for the commitlog, and 3 disks for data directories. See http://wiki.apache.org/cassandra/CassandraHardware#Disk Thanks, Stu -Original Message- From: Eric Rosenberry epros...@gmail.com Sent: Wednesday, March 10, 2010 2:00am To: cassandra-user@incubator.apache.org Subject: Effective allocation of multiple disks Based on the documentation, it is clear that with Cassandra you want to have one disk for commitlog, and one disk for data. My question is: If you think your workload is going to require more io performance to the data disks than a single disk can handle, how would you recommend effectively utilizing additional disks? It would seem a number of vendors sell 1U boxes with four 3.5 inch disks. If we use one for commitlog, is there a way to have Cassandra itself equally split data across the three remaining disks? Or is this something that needs to be handled by the hardware level, or operating system/file system level? Options include a hardware RAID controller in a RAID 0 stripe (this is more $$$ and for what gain?), or utilizing a volume manager like LVM. Along those same lines, if you do implement some type of striping, what RAID stripe size is recommended? (I think Todd Burruss asked this earlier but I did not see a response) Thanks for any input! -Eric -- Anthony Molinaro antho...@alumni.caltech.edu
Re: Hackathon?!?
Definitely on board! -Original Message- From: Dan Di Spaltro dan.dispal...@gmail.com Sent: Tuesday, March 9, 2010 8:05pm To: cassandra-user@incubator.apache.org Subject: Re: Hackathon?!? Alright guys, we have settled on a date for the Cassandra meetup on... April 15th, better known as, Tax day! We can host it here at Cloudkick, unless a cooler startup wants to host it. http://maps.google.com/maps/ms?ie=UTF8hl=enmsa=0msid=100290781618196563860.000478354937656785449z=19 http://maps.google.com/maps/ms?ie=UTF8hl=enmsa=0msid=100290781618196563860.000478354937656785449z=191499 Potrero Ave San Francisco CA 94110 Bottom line, it would be great to get some folks together and spend some time doing an intro, cover some deployments, data models and try to address all the other burning questions out there. We pushed it out from PyCON and hopefully settled on a good day, lets get a count for how many folks are interested! Thanks, On Tue, Feb 9, 2010 at 3:10 PM, Reuben Smith reuben.sm...@gmail.com wrote: I live in the city and I'd like to add my vote for an Intro to Cassandra night. Reuben On Tue, Feb 9, 2010 at 10:43 AM, Dan Di Spaltro dan.dispal...@gmail.com wrote: I think the tentative plans would be to push this out a bit farther away from PyCon, to get a bigger attendance. It sounds like an Intro to Cassandra would be a better theme; focus on the education piece. But it will happen! So stay tuned. On Tue, Feb 9, 2010 at 3:53 AM, Wayne Lewis wa...@lewisclan.org wrote: Hi Dan, Are you still planning for end of Feb? Please add me to the very interested list. Thanks! Wayne Lewis On Jan 26, 2010, at 8:42 PM, Dan Di Spaltro wrote: Would anyone be interested in a Cassandra hack-a-thon at the end of February in San Francisco? I think it would be great to get everyone together, since the last hack-a-thon was at the Twitter office back around OSCON time. We could provide space in the Mission area or someone else could too, our office is in a pretty interesting area ( http://maps.google.com/maps/ms?ie=UTF8hl=enmsa=0msid=100290781618196563860.000478354937656785449z=17 ). Tell me what you guys think! -- Dan Di Spaltro -- Dan Di Spaltro -- Dan Di Spaltro
RE: Latest check-in to trunk/ is broken
Run `ant clean` before building. A few files moved around. -Original Message- From: Cool BSD c...@coolbsd.com Sent: Monday, March 8, 2010 5:18pm To: cassandra-user cassandra-user@incubator.apache.org Subject: Latest check-in to trunk/ is broken version info: $ svn info Path: . URL: https://svn.apache.org/repos/asf/incubator/cassandra/trunk Repository Root: https://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 920560 Node Kind: directory Schedule: normal Last Changed Author: gdusbabek Last Changed Rev: 920537 Last Changed Date: 2010-03-08 14:00:51 -0800 (Mon, 08 Mar 2010) and error message: build-project: [echo] apache-cassandra: /net/f5/shared/nosql/cassandra/archive/svn/build.xml [javac] Compiling 277 source files to /net/f5/shared/nosql/cassandra/archive/svn/build/classes [javac] /net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:112: reference to SSTableReader is ambiguous, both class org.apache.cassandra.io.sstable.SSTableReader in org.apache.cassandra.io.sstable and class org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match [javac] private void updateEstimateFor(ColumnFamilyStore cfs, SetListSSTableReader buckets) [javac]^ [javac] /net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:138: reference to SSTableReader is ambiguous, both class org.apache.cassandra.io.sstable.SSTableReader in org.apache.cassandra.io.sstable and class org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match [javac] public FutureListSSTableReader submitAnticompaction(final ColumnFamilyStore cfStore, final CollectionRange ranges, final InetAddress target) [javac]^ [javac] /net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:240: reference to SSTableReader is ambiguous, both class org.apache.cassandra.io.sstable.SSTableReader in org.apache.cassandra.io.sstable and class org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match [javac] int doCompaction(ColumnFamilyStore cfs, CollectionSSTableReader sstables, int gcBefore) throws IOException [javac]^ [javac] /net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:341: reference to SSTableReader is ambiguous, both class org.apache.cassandra.io.sstable.SSTableReader in org.apache.cassandra.io.sstable and class org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match [javac] private ListSSTableReader doAntiCompaction(ColumnFamilyStore cfs, CollectionSSTableReader sstables, CollectionRange ranges, InetAddress target) [javac] ^ [javac] /net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:341: reference to SSTableReader is ambiguous, both class org.apache.cassandra.io.sstable.SSTableReader in org.apache.cassandra.io.sstable and class org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match [javac] private ListSSTableReader doAntiCompaction(ColumnFamilyStore cfs, CollectionSSTableReader sstables, CollectionRange ranges, InetAddress target) [javac] ^ [javac] /net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:451: reference to SSTableReader is ambiguous, both class org.apache.cassandra.io.sstable.SSTableReader in org.apache.cassandra.io.sstable and class org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match [javac] static SetListSSTableReader getBuckets(IterableSSTableReader files, long min) [javac] ^ [javac] /net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:451: reference to SSTableReader is ambiguous, both class org.apache.cassandra.io.sstable.SSTableReader in org.apache.cassandra.io.sstable and class org.apache.cassandra.io.SSTableReader in org.apache.cassandra.io match [javac] static SetListSSTableReader getBuckets(IterableSSTableReader files, long min) [javac] ^ [javac] /net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:498: reference to SSTableScanner is ambiguous, both class org.apache.cassandra.io.sstable.SSTableScanner in org.apache.cassandra.io.sstable and class org.apache.cassandra.io.SSTableScanner in org.apache.cassandra.io match [javac] private SetSSTableScanner scanners; [javac] ^ [javac] /net/f5/shared/nosql/cassandra/archive/svn/src/java/org/apache/cassandra/db/CompactionManager.java:500: reference to SSTableReader is ambiguous, both class
Re: Dynamically Switching from Ordered Partitioner to Random?
But rather than switching, you should definitely try the 'loadbalance' approach first, and see whether OrderPP works out for you. -Original Message- From: Chris Goffinet goffi...@digg.com Sent: Friday, March 5, 2010 1:43pm To: cassandra-user@incubator.apache.org Subject: Re: Dynamically Switching from Ordered Partitioner to Random? At this time, you have to re-import the data. -Chris On Fri, Mar 5, 2010 at 11:42 AM, shiv shivaji shivaji...@yahoo.com wrote: I started with the ordered partitioner as I was hoping to make use of the map-reduce functionality. However, my data was likely lopped onto 2 key machines with most of it on one (as seen from another thread. There were also machine failures to blame for the uneven distribution). One solution which I am trying is to load balance. Is there any other thing I can try to convert the partitioner to random on a live system? I know this sounds like an odd request. Curious about my options though. I did see a post mentioning that one can compute the md5 hash of each key and then insert using that and have a mapping table from key to md5 hash. Unfortunately, the data is already loaded using an ordered partitioner and I was wondering if there is a way to switch to random now. Shiv -- Chris Goffinet
Re: Connect during bootstrapping?
You are probably in the portion of bootstrap where data to be transferred is split out to disk, which can take a while: see https://issues.apache.org/jira/browse/CASSANDRA-579 Look for a 'streaming' subdirectory in your data directories to confirm. -Original Message- From: Brian Frank Cooper coop...@yahoo-inc.com Sent: Tuesday, March 2, 2010 11:50pm To: cassandra-user@incubator.apache.org cassandra-user@incubator.apache.org Subject: Re: Connect during bootstrapping? Thanks for the note. Can you help me with something else? I can't seem to get any data to transfer during bootstrapping...I must be doing something wrong. Here is what I did: I took 0.6.0-beta2, loaded 2 machines with 60-70GB each. Then I started a third node, with AutoBootstrap true. The node claims it is bootstrapping: INFO - Auto DiskAccessMode determined to be mmap INFO - Saved Token not found. Using Rb0mePN3PheW3haA INFO - Creating new commitlog segment /home/cooperb/cassandra/commitlog/CommitLog-1267594407761.log INFO - Starting up server gossip INFO - Joining: getting load information INFO - Sleeping 9 ms to wait for load information... INFO - Node /98.137.30.37 is now part of the cluster INFO - Node /98.137.30.38 is now part of the cluster INFO - InetAddress /98.137.30.37 is now UP INFO - InetAddress /98.137.30.38 is now UP INFO - Joining: getting bootstrap token INFO - New token will be user148315419 to assume load from /98.137.30.38 INFO - Joining: sleeping 3 for pending range setup INFO - Bootstrapping But when I run nodetool streams, no streams are transferring: Mode: Bootstrapping Not sending any streams. Not receiving any streams. And it doesn't look like the node is getting any data. Any ideas? Thanks for the help... Brian On 3/2/10 12:22 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Mar 2, 2010 at 1:54 PM, Brian Frank Cooper coop...@yahoo-inc.com wrote: Hi folks, I'm running 0.5 and I had 2 nodes up and running, then added a 3rd node in bootstrap mode. I understand from other discussion list threads that the new node doesn't serve reads while it is bootstrapping, but does that mean it won't connect at all? it doesn't start the thrift listener until it is bootstrapped, so yes. (you can tell when it's bootstrapped by when it appears in nodeprobe ring. 0.6 also adds bootstrap progress reporting via jmx.) When I try to connect from my java client, or cassandra-cli, I get the exception below. Is it the expected behavior? (Also, cassandra-cli says Connected to xxx.yahoo.com even though it isn't really connected...) This is fixed in https://issues.apache.org/jira/browse/CASSANDRA-807 for 0.6, fwiw. -Jonathan -- Brian Cooper Principal Research Scientist Yahoo! Research
Re: Is Cassandra a document based DB?
In HBase you have table:row:family:key:val:version, which some people might consider richer Cassandra is actually table:family:row:key:val[:subval], where subvals are the columns stored in a supercolumn (which can be easily arranged by timestamp to give the versioned approach). -Original Message- From: Erik Holstad erikhols...@gmail.com Sent: Monday, March 1, 2010 3:49pm To: cassandra-user@incubator.apache.org Subject: Re: Is Cassandra a document based DB? On Mon, Mar 1, 2010 at 4:41 AM, Brandon Williams dri...@gmail.com wrote: On Mon, Mar 1, 2010 at 5:34 AM, HHB hubaghd...@yahoo.ca wrote: What are the advantages/disadvantages of Cassandra over HBase? Ease of setup: all nodes are the same. No single point of failure: all nodes are the same. Speed: http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf Richer model: supercolumns. I think that there are people that would be of a different opinion here. Cassandra has as I've understood it table:key:name:val and in cases the val is a serialized data structure. In HBase you have table:row:family:key:val:version, which some people might consider richer. Multi-datacenter awareness. There are likely other things I'm forgetting, but those stand out for me. -Brandon -- Regards Erik
Re: StackOverflowError on high load
Ran, There are bounds to how large your data directory will grow, relative to the actual data. Please read up on compaction: http://wiki.apache.org/cassandra/MemtableSSTable , and if you have a significant number of deletes occuring, also read http://wiki.apache.org/cassandra/DistributedDeletes The key mitigation is to ensure that minor compactions get a chance to occur regularly. This will happen automatically, but the faster you write data to your nodes, the more behind on compactions they can get. We consider this a bug, and CASSANDRA-685 will be exploring solutions so that your client automatically backs off as a node becomes overloaded. Thanks, Stu -Original Message- From: Ran Tavory ran...@gmail.com Sent: Sunday, February 21, 2010 9:01am To: cassandra-user@incubator.apache.org Subject: Re: StackOverflowError on high load This sort of explain this, yes, but what solution can I use? I do see the OPP writes go faster than the RP, so this makes sense that when using the OPP there's higher chance that a host will fall behind with compaction and eventually crash. It's not a nice feature, but hopefully there are mitigations to this. So my question is - what are the mitigations? What should I tell my admin to do in order to prevent this? Telling him increase the directory size 2x isn't going to cut it as the directory just keeps growing and is not bound... I'm also no clear whether CASSANDRA-804 is going to be a real fix. Thanks On Sat, Feb 20, 2010 at 9:36 PM, Jonathan Ellis jbel...@gmail.com wrote: if OPP is configured w/ imbalanced ranges (or less balanced than RP) then that would explain it. OPP is actually slightly faster in terms of raw speed. On Sat, Feb 20, 2010 at 2:31 PM, Ran Tavory ran...@gmail.com wrote: interestingly, I ran the same load but this time with a random partitioner and, although from time to time test2 was a little behind with its compaction task, it did not crash and was able to eventually close the gaps that were opened. Does this make sense? Is there a reason why random partitioner is less likely to be faulty in this scenario? The scenario is of about 1300 writes/sec of small amounts of data to a single CF on a cluster with two nodes and no replication. With the order-preserving-partitioner after a few hours of load the compaction pool is behind on one of the hosts and eventually this host crashes, but with the random partitioner it doesn't crash. thanks On Sat, Feb 20, 2010 at 6:27 AM, Jonathan Ellis jbel...@gmail.com wrote: looks like test1 started gc storming, so test2 treats it as dead and starts doing hinted handoff for it, which increases test2's load, even though test1 is not completely dead yet. On Thu, Feb 18, 2010 at 1:16 AM, Ran Tavory ran...@gmail.com wrote: I found another interesting graph, attached. I looked at the write-count and write-latency of the CF I'm writing to and I see a few interesting things: 1. the host test2 crashed at 18:00 2. At 16:00, after a few hours of load both hosts dropped their write-count. test1 (which did not crash) started slowing down first and then test2 slowed. 3. At 16:00 I start seeing high write-latency on test2 only. This takes about 2h until finally at 18:00 it crashes. Does this help? On Thu, Feb 18, 2010 at 7:44 AM, Ran Tavory ran...@gmail.com wrote: I ran the process again and after a few hours the same node crashed the same way. Now I can tell for sure this is indeed what Jonathan proposed - the data directory needs to be 2x of what it is, but it looks like a design problem, how large to I need to tell my admin to set it then? Here's what I see when the server crashes: $ df -h /outbrain/cassandra/data/ FilesystemSize Used Avail Use% Mounted on /dev/mapper/cassandra-data 97G 46G 47G 50% /outbrain/cassandra/data The directory is 97G and when the host crashes it's at 50% use. I'm also monitoring various JMX counters and I see that COMPACTION-POOL PendingTasks grows for a while on this host (not on the other host, btw, which is fine, just this host) and then flats for 3 hours. After 3 hours of flat it crashes. I'm attaching the graph. When I restart cassandra on this host (not changed file allocation size, just restart) it does manage to compact the data files pretty fast, so after a minute I get 12% use, so I wonder what made it crash before that doesn't now? (could be the load that's not running now) $ df -h /outbrain/cassandra/data/ FilesystemSize Used Avail Use% Mounted on /dev/mapper/cassandra-data 97G 11G 82G 12% /outbrain/cassandra/data The question is what size does the data directory need to be? It's not 2x the size of the data I expect to have (I only have 11G of real data after compaction and the dir is 97G, so it
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
After I ran nodeprobe compact on node B its read latency went up to 150ms. The compaction process can take a while to finish... in 0.5 you need to watch the logs to figure out when it has actually finished, and then you should start seeing the improvement in read latency. Is there any way to utilize all of the heap space to decrease the read latency? In 0.5 you can adjust the number of keys that are cached by changing the 'KeysCachedFraction' parameter in your config file. In 0.6 you can additionally cache rows. You don't want to use up all of the memory on your box for those caches though: you'll want to leave at least 50% for your OS's disk cache, which will store the full row content. -Original Message- From: Weijun Li weiju...@gmail.com Sent: Tuesday, February 16, 2010 12:16pm To: cassandra-user@incubator.apache.org Subject: Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)? Thanks for for DataFileDirectory trick and I'll give a try. Just noticed the impact of number of data files: node A has 13 data files with read latency of 20ms and node B has 27 files with read latency of 60ms. After I ran nodeprobe compact on node B its read latency went up to 150ms. The read latency of node A became as low as 10ms. Is this normal behavior? I'm using random partitioner and the hardware/JVM settings are exactly the same for these two nodes. Another problem is that Java heap usage is always 900mb out of 6GB? Is there any way to utilize all of the heap space to decrease the read latency? -Weijun On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com wrote: On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote: One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Yes, you can already do this, just add more DataFileDirectory directives pointed at multiple drives. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? Row cache and key cache both help tremendously if your read pattern has a decent repeat rate. Completely random io can only be so fast, however. -Brandon
Re: TimeOutExceptions and Cluster Performance
The combination of 'too many open files' and lots of memtable flushes could mean you have tons and tons of sstables on disk. This can make reads especially slow. If you are seeing the timeouts on reads a lot more often than on writes, then this explanation might make sense, and you should watch https://issues.apache.org/jira/browse/CASSANDRA-685. Thanks, Stu -Original Message- From: Jonathan Ellis jbel...@gmail.com Sent: Friday, February 12, 2010 9:43pm To: cassandra-user@incubator.apache.org Subject: Re: TimeOutExceptions and Cluster Performance There's a lot more details that would be useful, but if you are on the verge of OOMing and something actually running out, then that's probably the culprit; when the JVM gets low on ram it will consume all your CPU trying to GC enough to continue. (you mentioned seeing high cpu on one core which tends to corroborate this; to confirm you can look at the thread using the CPU: http://publib.boulder.ibm.com/infocenter/javasdk/tools/index.jsp?topic=/com.ibm.java.doc.igaa/_1vg0001475cb4a-1190e2e0f74-8000_1007.html) Look at your executor queues, in the output of nodeprobe tpstats if you have no other metrics system. You probably are just swamping it with writes, if you have 1000s of ops in any of the pending queues, that's bad. -Jonathan On Fri, Feb 12, 2010 at 7:40 PM, Stephen Hamer stephen.ha...@gmail.com wrote: Hi, I'm running a 5 node Cassandra cluster and am having a very tough time getting reasonable performance from it. Many of the requests are failing with TimeOutException. This is making it difficult to use Cassandra in a production setting. The cluster was running fine for a week or two (it was created 3 weeks ago) but has started to degrade in the last week. The cluster was originally only 3 nodes but when performance started to degrade I added another two nodes. This doesn't seem to have helped though. Requests being made from the my application are being balanced across the cluster in a round robin fashion. Many of these requests are failing with TimeOutException. When the occurs I can look at the DB servers and several of them fully utilizing 1 core. I can turn off my application when this is going on (which stops all reads and writes to Cassandra). The cluster seems to stay in this state for another several hour before returning to a resting state. When the CPU is loaded I see lots of messages about en-queuing, sorting, and writing memtables so I have tried adjusting the memtable size down to 16MB and raised the MemtableFlushAfterMinutes to 1440. This doesn't seem to have affected anything though. I was seeing errors about too many file descriptors being open so I added “ulimit –n 32768” to Cassandra.in.sh. This seems to fixed this. I was also seeing lots of out of memory exceptions so I raised the heap size to 4GB. This has helped but not eliminated the OOM issues. I'm not sure if it's related to any of the performance issues but I see lots of log entries about DigestMismatchExceptions. I've included a sample of the exceptions below. My Cassandra cluster is almost unusable in its current state because of the number of timeout exceptions that I'm seeing. I suspect that this is because of a configuration or I have improperly set something up. It feels like the database has entered a bad state which is causing it to churn as much as it is but have no way to verify this. What steps can I take to address the performance issues I am seeing and the consistent stream of TimeOutExceptions? Thanks, Stephen Here are some specifics about the cluster configuration: 5 Large EC2 instances - 7.5 GB ram, 2 cores (64bit, 1-1.2Ghz), data and commit logs stored on separate EBS volumes. Boxes are running Debian 5. r...@prod-cassandra4 ~/cassandra # bin/nodeprobe -host localhost ring Address Status Load Range Ring 101279862673517536112907910111793343978 10.254.55.191 Up 2.94 GB 27246729060092122727944947571993545 |--| 10.214.119.127Up 3.67 GB 34209800341332764076889844611182786881 | ^ 10.215.122.208Up 11.86 GB 42649376116143870288751410571644302377 v | 10.215.30.47 Up 6.37 GB 81374929113514034361049243620869663203 | ^ 10.208.246.160Up 5.15 GB 101279862673517536112907910111793343978 |--| I am running the 0.5 release of Cassandra (at commit 44e8c2e...). Here are some of my configuration options: Memory, disk, performance section of storage-conf.xml (I've only included options that I've changed from the defaults): Partitionerorg.apache.cassandra.dht.RandomPartitioner/Partitioner ReplicationFactor3/ReplicationFactor SlicedBufferSizeInKB512/SlicedBufferSizeInKB FlushDataBufferSizeInMB64/FlushDataBufferSizeInMB FlushIndexBufferSizeInMB16/FlushIndexBufferSizeInMB ColumnIndexSizeInKB64/ColumnIndexSizeInKB MemtableSizeInMB16/MemtableSizeInMB
Re: OOM Exception
PS: If this turns out to actually be the problem, I'll open a ticket for it. Thanks, Stu -Original Message- From: Stu Hood stuart.h...@rackspace.com Sent: Sunday, December 13, 2009 12:28pm To: cassandra-user@incubator.apache.org Subject: Re: OOM Exception With 248G per box, you probably have slightly more than 1/2 billion items? One current implementation detail in Cassandra is that it loads 128th of the index into memory for faster lookups. This means you might have something like 4.5 million keys in memory at the moment. The '128' value is a constant at SSTable.INDEX_INTERVAL. You should be able to recompile with '1024' to allow for an 8 times larger database, but understand that this will have a negative effect on your read performance. Thanks, Stu -Original Message- From: Dan Di Spaltro dan.dispal...@gmail.com Sent: Sunday, December 13, 2009 12:06pm To: cassandra-user@incubator.apache.org Subject: Re: OOM Exception What consistencyLevel are you inserting the elements? If you do ./bin/nodeprobe -host localhost tpstats on each machine do you see one metric that has a lot of pending items? On Sun, Dec 13, 2009 at 8:14 AM, Brian Burruss bburr...@real.com wrote: another OOM exception. the only thing interesting about my testing is that there are 2 servers, RF=2, W=1, R=1 ... there is 248G of data on each server. I have -Xmx3G assigned to each server 2009-12-12 22:04:37,436 ERROR [pool-1-thread-309] [Cassandra.java:734] Internal error processing get java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.service.StorageProxy.weakReadLocal(StorageProxy.java:523) at org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:373) at org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:92) at org.apache.cassandra.service.CassandraServer.multigetColumns(CassandraServer.java:265) at org.apache.cassandra.service.CassandraServer.multigetInternal(CassandraServer.java:320) at org.apache.cassandra.service.CassandraServer.get(CassandraServer.java:253) at org.apache.cassandra.service.Cassandra$Processor$get.process(Cassandra.java:724) at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:712) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) From: Brian Burruss Sent: Saturday, December 12, 2009 7:45 AM To: cassandra-user@incubator.apache.org Subject: OOM Exception this happened after cassandra was running for a couple of days. I have -Xmx3G on JVM. is there any other info you need so this makes sense? thx! 2009-12-11 21:38:37,216 ERROR [HINTED-HANDOFF-POOL:1] [DebuggableThreadPoolExecutor.java:157] Error in ThreadPoolExecutor java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.io.BufferedRandomAccessFile.init(BufferedRandomAccessFile.java:151) at org.apache.cassandra.io.BufferedRandomAccessFile.init(BufferedRandomAccessFile.java:144) at org.apache.cassandra.io.SSTableWriter.init(SSTableWriter.java:53) at org.apache.cassandra.db.ColumnFamilyStore.doFileCompaction(ColumnFamilyStore.java:911) at org.apache.cassandra.db.ColumnFamilyStore.doFileCompaction(ColumnFamilyStore.java:855) at org.apache.cassandra.db.ColumnFamilyStore.doMajorCompactionInternal(ColumnFamilyStore.java:698) at org.apache.cassandra.db.ColumnFamilyStore.doMajorCompaction(ColumnFamilyStore.java:670) at org.apache.cassandra.db.HintedHandOffManager.deliverAllHints(HintedHandOffManager.java:190) at org.apache.cassandra.db.HintedHandOffManager.access$000(HintedHandOffManager.java:75) at org.apache.cassandra.db.HintedHandOffManager$1.run(HintedHandOffManager.java:249) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) -- Dan Di Spaltro
Re: quorum / hinted handoff
You need a quorum relative to your replication factor. You mentioned in the first e-mail that you have RF=2, so you need a quorum of 2. If you use RF=3, then you need a quorum of 2 as well. -Original Message- From: B. Todd Burruss bburr...@real.com Sent: Friday, November 20, 2009 4:14pm To: cassandra-user@incubator.apache.org Subject: Re: quorum / hinted handoff not really. it seems that if i start with 3 nodes, remove 1 of them, i should still have a quorum, which is 2. this is not what i experience. On Fri, 2009-11-20 at 16:03 -0600, Jonathan Ellis wrote: Oh, okay. Then it's working as expected. Does it make more sense to you now? :) -Jonathan On Fri, Nov 20, 2009 at 3:43 PM, B. Todd Burruss bburr...@real.com wrote: this was on the build i got yesterday, 882359. ... and you are correct about if you start with 2 nodes and take one down - there isn't a quorum and the write/read fails. i tested that as well. thx! On Fri, 2009-11-20 at 15:30 -0600, Jonathan Ellis wrote: On Fri, Nov 20, 2009 at 11:31 AM, B. Todd Burruss bburr...@real.com wrote: one more point on this .. if i only start a cluster with 2 nodes, and i use the same config setup (RF=2, etc) .. it works fine. it's only when i start with the 3 nodes and remove 1. in fact, i remove the node before i do any reads or writes at all, completely fresh database. That sounds like a bug. If you have 2 nodes, RF of 2, and take one node down then quorum anything should always fail. Is this on trunk still? -Jonathan
Re: bandwidth limiting Cassandra's replication and access control
Hey Ted, Would you mind creating a ticket for this issue in JIRA? A lot of discussion has gone on, and a place to collect the design and feedback would be a good start. Thanks, Stu -Original Message- From: Ted Zlatanov t...@lifelogs.com Sent: Wednesday, November 11, 2009 3:28pm To: cassandra-user@incubator.apache.org Cc: cassandra-...@incubator.apache.org Subject: Re: bandwidth limiting Cassandra's replication and access control On Wed, 11 Nov 2009 07:40:00 -0800 Coe, Robin robin@bluecoat.com wrote: CR Just going to chime in here, because I have experience writing apps CR that use JAAS and JNDI to authenticate against LDAP and JDBC CR services. However, I only just started looking at Cassandra this CR week, so I'm not certain of the premise behind controlling access to CR the Cassandra service. CR IMO, auth services should be left to the application layer that CR interfaces to Cassandra and not built into Cassandra. In the CR tutorial snippet included below, the access being granted is at the CR codebase level, not the transaction level. Since users of Cassandra CR will generally be fronted by a service layer, the java security CR manager isn’t going to suffice. What this snippet could do, though, CR and may be the rationale for the request, is to ensure that CR unauthorized users cannot instantiate a new Cassandra server. CR However, if a user has physical access to the machine on which CR Cassandra is installed, they could easily bypass that layer of CR security. CR So, I guess I'm wondering whether this discussion pertains to CR application-layer security, i.e., permission to execute Thrift CR transactions, or Cassandra service security? Or is it strictly a CR utility function, to create a map of users to specific Keyspaces, to CR simplify the Thrift API? (note followups to the devel list) I mentioned I didn't know JAAS so I appreciate any help you can give. Specifically, I don't know yet what is the difference between the codebase level and the transaction level in JAAS terms. Can you explain? I am interested in controlling the Thrift client API, not the Gossip replication service. The authenticating clients will not have physical access to the machine and all the authentication tokens will have to be passed over a Thrift login call. How would you use JAAS+JNDI to control that? The access point is CassandraServer.java as Jonathan mentioned. Ted