Re: datastax community ami -- broken? --- datastax-agent conflicts with opscenter-agent
Hi Joaquin, A quick word of praise - addressing the issue so quickly presents a really good view of Datastax cheers On Sat, Dec 7, 2013 at 8:14 AM, Joaquin Casares joaq...@datastax.comwrote: Hello again John, The AMI has been patched and tested for both DSE and C* and works for the standard 3 node test. The new code has been pushed to the 2.4 branch so launching a new set of instances will give you an updated AMI. You should now have the newest version of OpsCenter installed, along with the new DataStax Agents (that replace the OpsCenter Agents). Also, I've patched the two bugs for the motd and for allowing the other nodes to join. The issue came from a new release of nodetool that contained some unexpected text that choked up the AMI code as it waited for nodes to come online. Let me know if you see any further issues. Thanks, Joaquin Casares DataStax Software Engineer in Test http://www.datastax.com/what-we-offer/products-services/training/virtual-training On Fri, Dec 6, 2013 at 2:02 PM, Joaquin Casares joaq...@datastax.comwrote: Hey John, Thanks for letting us know. I'm also seeing that the motd gets stuck, but if I ctrl-c during the message and try a `nodetool status` there doesn't appear to be an issue. I'm currently investigating why it's getting stuck. Are you seeing something similar? What happens if you try to run a `sudo service cassandra restart`? Could you send me your /var/log/cassandra/system.log if this still fails? Also, I realize now that the package name for the newest version of OpsCenter changed from opscenter-free to opscenter. I committed that change to our dev AMI and am testing it now. Once this change is made you will no longer have to install agents via OpsCenter since they should already be on the system. That being said, you won't hit the current OpsCenter/DataStax Agent version mismatch you've been hitting. Also, we currently only have one AMI. Each time an instance is launched the newest version of the code is pulled down from https://github.com/riptano/ComboAMI to ensure the code never gets stale and can easily keep up with DSE/C* releases as well as AMI code fixes. I'll reply again as soon as I figure out and patch this motd issue. Thanks, Joaquin Casares DataStax Software Engineer in Test http://www.datastax.com/what-we-offer/products-services/training/virtual-training On Fri, Dec 6, 2013 at 7:16 AM, John R. Frank j...@mit.edu wrote: Hi C* experts, In the last 18hrs or so, I have been having trouble getting cassandra instances to launch using the datastax community AMI. Has anyone else seen this? The instance comes up but then cassandra fails to run. The most informative error message that I've seen so far is in the obscenter agent install log (below) --- see especially this line: datastax-agent conflicts with opscenter-agent I have also attached the ami.log I run into these issues when launching with either the browser page: https://aws.amazon.com/amis/datastax-auto-clustering-ami-2-2 Or command line, like this: ec2-run-instances ami-814ec2e8 -t m1.large --region us-east-1 --key buildbot -n $num_instances -d --clustername Foo --totalnodes $num_instances --version community --opscenter yes Is there a newer AMI? Any advice? jrf Some agent installations failed: - 10.6.133.241: Failure installing agent on 10.6.133.241. Error output: Unable to install the opscenter-agent package. Please check your apt-get configuration as well as the agent install log (/var/log/opscenter-agent/ installer.log). Standard output: Removing old opscenter-agent files. opscenter-agent: unrecognized service Reading package lists... Building dependency tree... Reading state information... 0 upgraded, 0 newly installed, 0 to remove and 171 not upgraded. Starting agent installation process for version 3.2.2 Reading package lists... Building dependency tree... Reading state information... sysstat is already the newest version. 0 upgraded, 0 newly installed, 0 to remove and 171 not upgraded. Selecting previously unselected package opscenter-agent. dpkg: regarding .../opscenter-agent.deb containing opscenter-agent: datastax-agent conflicts with opscenter-agent opscenter-agent (version 3.2.2) is to be installed. opscenter-agent provides opscenter-agent and is to be installed. dpkg: error processing opscenter_agent_setup.vYRzL0Tevn/opscenter-agent.deb (--install): conflicting packages - not installing opscenter-agent Errors were encountered while processing: opscenter_agent_setup.vYRzL0Tevn/opscenter-agent.deb FAILURE: Unable to install the opscenter-agent package. Please check your apt-get configuration as well as the agent install log (/var/log/opscenter-agent/installer.log). Exit code: 1 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4
Re: How would you model that?
How about something like using a time-range as the key (e.g an hour depending on your update rate) and a composite (time:user) as the column name cheers On Fri, Nov 8, 2013 at 10:45 PM, Laing, Michael michael.la...@nytimes.comwrote: You could try this: CREATE TABLE user_activity (shard text, user text, ts timeuuid, primary key (shard, ts)); select user, ts from user_activity where shard in ('00', '01', ...) order by ts desc; Grab each user and ts the first time you see that user. Use as many shards as you think you need to control row size and spread the load. Set ttls to expire user_activity entries when you are no longer interested in them. ml On Fri, Nov 8, 2013 at 6:10 AM, pavli...@gmail.com pavli...@gmail.comwrote: Hey guys, I need to retrieve a list of distinct users based on their activity datetime. How can I model a table to store that kind of information? The straightforward decision was this: CREATE TABLE user_activity (user text primary key, ts timeuuid); but it turned out it is impossible to do a select like this: select * from user_activity order by ts; as it fails with ORDER BY is only supported when the partition key is restricted by an EQ or an IN. How would you model the thing? Just need to have a list of users based on their last activity timestamp... Thanks! -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Storage management during rapid growth
I can't comment on the technical question, however one thing I learnt with managing the growth of data is that the $/GB of tends to drop at a rate that can absorb a moderate proportion of the increase in cost due to the increase in size of data. I'd recommend having a wet-finger-in-the-air stab at projecting the growth in data sizes versus the historical trends in the decease in cost of storage. cheers On Fri, Nov 1, 2013 at 7:15 AM, Dave Cowen d...@luciddg.com wrote: Hi, all - I'm currently managing a small Cassandra cluster, several nodes with local SSD storage. It's difficult for to forecast the growth of the Cassandra data over the next couple of years for various reasons, but it is virtually guaranteed to grow substantially. During this time, there may be times where it is desirable to increase the amount of storage available to each node, but, assuming we are not I/O bound, keep from expanding the cluster horizontally with additional nodes that have local storage. In addition, expanding with local SSDs is costly. My colleagues and I have had several discussions of a couple of other options that don't involve scaling horizontally or adding SSDs: 1) Move to larger, cheaper spinning-platter disks. However, when monitoring the performance of our cluster, we see sustained periods - especially during repair/compaction/cleanup - of several hours where there are 2000 IOPS. It will be hard to get to that level of performance in each node with spinning platter disks, and we'd prefer not to take that kind of performance hit during maintenance operations. 2) Move some nodes to a SAN solution, ensuring that there is a mix of storage, drives, LUNs and RAIDs so that there isn't a single point of failure. While we're aware that this is frowned on in the Cassandra community due to Cassandra's design, a SAN seems like the obvious way of being able to quickly add storage to a cluster without having to juggle local drives, and provides a level of performance between local spinning platter drives and local SSDs. So, the questions: 1) Has anyone moved from SSDs to spinning-platter disks, or managed a cluster that contained both? Do the numbers we're seeing exaggerate the performance hit we'd see if we moved to spinners? 2) Have you successfully used a SAN or a hybrid SAN solution (some local, some SAN-based) to dynamically add storage to the cluster? What type of SAN have you used, and what issues have you run into? 3) Am I missing a way of economically scaling storage? Thanks for any insight. Dave -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Recommended hardware
Far from expert opinion, however one configuration I have seen talked about is 3 x m1.xlarge in AWS. I have tested 4 x m1.xlarge and 4 x m1.large. The m1.xlarge was fine for out tests (we were hitting it pretty hard), the m1.large was eratic - from that I took way that you either need to give Cassandra sufficient resources or know how to tune properly (I don't) cheers On Tue, Sep 24, 2013 at 2:17 AM, Tim Dunphy bluethu...@gmail.com wrote: Hello, I am running Cassandra 2.0 on a 2gb memory 10 gb HD in a virtual cloud environment. It's supporting a php application running on the same node. Mostly this instance runs smoothly but runs low on memory. Depending on how much the site is used, the VM will swap out sometimes excessively. I realize this setup may not be enough to support a cassandra instance. I was wondering if there were any recommended hardware specs someone could point me to for both physical and virtual (cloud) type environments. Thank you, Tim Sent from my iPhone -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: cassandra just gone..no heap dump, no log info
A random guess - possibly an OOM (Out of Memory) where Linux will kill a process to recover memory when it is desperately low on memory. Have a look in either your syslog output of the output of dmesg cheers On Wed, Sep 18, 2013 at 10:21 PM, Hiller, Dean dean.hil...@nrel.gov wrote: Anyone know how to debug cassandra processes just exiting? There is no info in the cassandra logs and there is no heap dump file(which in the past has shown up in /opt/cassandra/bin directory for me). This occurs when running a map/reduce job that put severe load on the system. The logs look completely fine. I find it odd 1. No logs of why it exited at all 2. No heap dump which would imply there would be no logs as it crashed Is there any other way a process can die and linux would log it somehow? (like running out of memory) Thanks, Dean -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Question about 'duplicate' columns
I've been thinking through some cases that I can see happening at some point and thought I'd ask on the list to see if my understanding is correct. Say a bunch of columns have been loaded 'a long time ago', i.e long enough in the past that they have been compacted. My understanding is that if some these columns get reloaded then they are likely to sit in additional sstables until the larger sstable is called up for compaction, which might be a while. The case that springs to mind is filling small gaps in data by doing bulk loads around the gap to make sure that the gap is filled. Have I understood correctly ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Question about 'duplicate' columns
On Tue, Aug 6, 2013 at 6:10 PM, Aaron Morton aa...@thelastpickle.comwrote: Yes. If you overwrite much older data with new data both versions of the column will remain on disk until compaction get's to work on both fragments of the row. thanks Cheers - Aaron Morton Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 6/08/2013, at 6:48 PM, Franc Carter franc.car...@sirca.org.au wrote: I've been thinking through some cases that I can see happening at some point and thought I'd ask on the list to see if my understanding is correct. Say a bunch of columns have been loaded 'a long time ago', i.e long enough in the past that they have been compacted. My understanding is that if some these columns get reloaded then they are likely to sit in additional sstables until the larger sstable is called up for compaction, which might be a while. The case that springs to mind is filling small gaps in data by doing bulk loads around the gap to make sure that the gap is filled. Have I understood correctly ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: schema management
On Wed, Jul 3, 2013 at 2:06 AM, Silas Smith silas.sm...@gmail.com wrote: Franc, We manage our schema through the Astyanax driver. It runs in a listener at application startup. We read a self-defined schema version, update the schema if needed based on the version number, and then write the new schema version number. There is a chance two or more app servers will try to update the schema at the same time but in our testing we haven't seen any problems result from this even when we forced many servers to all update the schema with many different updates at the same time. And besides we typically do a rolling restart anyway. Todd, Mutagen Cassandra looks pretty similar to what we're doing, but is perhaps a bit more elegant. Will take a look at that now :) Cheers Thanks all, I'll likely stick to cassandra-cli scripts for this project and then look in to Cassandra-Mutagen cheers On Mon, Jul 1, 2013 at 5:55 PM, Franc Carter franc.car...@sirca.org.auwrote: On Tue, Jul 2, 2013 at 10:33 AM, Todd Fast t...@digitalexistence.comwrote: Franc-- I think you will find Mutagen Cassandra very interesting; it is similar to schema management tools like Flyway for SQL databases: Oops - forgot to mention in my original email that we will be looking into Mutagen Cassandra in the medium term. I'm after something with a low barrier to entry initially as we are quite time constrained. cheers Mutagen Cassandra is a framework (based on Mutagen) that provides schema versioning and mutation for Apache Cassandra. Mutagen is a lightweight framework for applying versioned changes (known as mutations) to a resource, in this case a Cassandra schema. Mutagen takes into account the resource's existing state and only applies changes that haven't yet been applied. Schema mutation with Mutagen helps you make manageable changes to the schema of live Cassandra instances as you update your software, and is especially useful when used across development, test, staging, and production environments to automatically keep schemas in sync. https://github.com/toddfast/mutagen-cassandra Todd On Mon, Jul 1, 2013 at 5:23 PM, sankalp kohli kohlisank...@gmail.comwrote: You can generate schema through the code. That is also one option. On Mon, Jul 1, 2013 at 4:10 PM, Franc Carter franc.car...@sirca.org.au wrote: Hi, I've been giving some thought to the way we deploy schemas and am looking for something better than out current approach, which is to use cassandra-cli scripts. What do people use for this ? cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
schema management
Hi, I've been giving some thought to the way we deploy schemas and am looking for something better than out current approach, which is to use cassandra-cli scripts. What do people use for this ? cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: schema management
On Tue, Jul 2, 2013 at 10:33 AM, Todd Fast t...@digitalexistence.comwrote: Franc-- I think you will find Mutagen Cassandra very interesting; it is similar to schema management tools like Flyway for SQL databases: Oops - forgot to mention in my original email that we will be looking into Mutagen Cassandra in the medium term. I'm after something with a low barrier to entry initially as we are quite time constrained. cheers Mutagen Cassandra is a framework (based on Mutagen) that provides schema versioning and mutation for Apache Cassandra. Mutagen is a lightweight framework for applying versioned changes (known as mutations) to a resource, in this case a Cassandra schema. Mutagen takes into account the resource's existing state and only applies changes that haven't yet been applied. Schema mutation with Mutagen helps you make manageable changes to the schema of live Cassandra instances as you update your software, and is especially useful when used across development, test, staging, and production environments to automatically keep schemas in sync. https://github.com/toddfast/mutagen-cassandra Todd On Mon, Jul 1, 2013 at 5:23 PM, sankalp kohli kohlisank...@gmail.comwrote: You can generate schema through the code. That is also one option. On Mon, Jul 1, 2013 at 4:10 PM, Franc Carter franc.car...@sirca.org.auwrote: Hi, I've been giving some thought to the way we deploy schemas and am looking for something better than out current approach, which is to use cassandra-cli scripts. What do people use for this ? cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: crashed while running repair
On Sat, Jun 22, 2013 at 11:21 AM, sankalp kohli kohlisank...@gmail.comwrote: Looks like memory map failed. In a 64 bit system, you should have unlimited virtual memory but Linux has a limit on the number of maps. Looks at these two places. http://stackoverflow.com/questions/8892143/error-when-opening-a-lucene-index-map-failed https://blog.kumina.nl/2011/04/cassandra-java-io-ioerror-java-io-ioexception-map-failed/ That sounds very plausible, I have a CF with a very large number of files as I used the default sstable_size_in_mb, I'm following another thread on how to recover from that. cheers On Fri, Jun 21, 2013 at 3:22 PM, Franc Carter franc.car...@sirca.org.auwrote: Hi, I am experimenting with Cassandra-1.2.4, and got a crash while running repair. The nodes has 24GB of ram with an 8GB heap. Any ideas on my I may have missed in the config ? Log is below ERROR [Thread-136019] 2013-06-22 06:30:05,861 CassandraDaemon.java (line 174) Exception in thread Thread[Thread-136019,5,main] FSReadError in /var/lib/cassandra/data/cut3/Price/cut3-Price-ib-44369-Index.db at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:200) at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.complete(MmappedSegmentedFile.java:168) at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:340) at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:319) at org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:194) at org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:122) at org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:238) at org.apache.cassandra.net.IncomingTcpConnection.handleStream(IncomingTcpConnection.java:178) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78) Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:748) at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:192) ... 8 more Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745) ... 9 more ERROR [Thread-136019] 2013-06-22 06:30:05,865 FileUtils.java (line 375) Stopping gossiper thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Updated sstable size for LCS, ran upgradesstables, file sizes didn't change
On Sat, Jun 22, 2013 at 10:42 AM, Wei Zhu wz1...@yahoo.com wrote: I think the new SSTable will be in the new size. In order to do that, you need to trigger a compaction so that the new SSTables will be generated. for LCS, there is no major compaction though. You can run a nodetool repair and hopefully you will bring some new SSTables and compactions will kick in. Or you can change the $CFName.json file under your data directory and move every SSTable to level 0. You need to stop your node, write a simple script to alter that file and start the node again. I think it will be helpful to have a nodetool command to change the SSTable Size and trigger the rebuild of the SSTables. I'd find that useful as well cheers Thanks. -Wei -- *From: *Robert Coli rc...@eventbrite.com *To: *user@cassandra.apache.org *Sent: *Friday, June 21, 2013 4:51:29 PM *Subject: *Re: Updated sstable size for LCS, ran upgradesstables, file sizes didn't change On Fri, Jun 21, 2013 at 4:40 PM, Andrew Bialecki andrew.biale...@gmail.com wrote: However when we run alter the column family and then run nodetool upgradesstables -a keyspace columnfamily, the files in the data directory have been re-written, but the file sizes are the same. Is this the expected behavior? If not, what's the right way to upgrade them. If this is expected, how can we benchmark the read/write performance with varying sstable sizes. It is expected, upgradesstables/scrub/clean compactions work on a single sstable at a time, they are not capable of combining or splitting them. In theory you could probably : 1) start out with the largest size you want to test 2) stop your node 3) use sstable_split [1] to split sstables 4) start node, test 5) repeat 2-4 I am not sure if there is anything about level compaction which makes this infeasible. =Rob [1] https://github.com/pcmanus/cassandra/tree/sstable_split -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Compaction not running
On Fri, Jun 21, 2013 at 6:16 PM, aaron morton aa...@thelastpickle.comwrote: Do you think it's worth posting an issue, or not enough traceable evidence ? If you can reproduce it then certainly file a bug. I'll keep my eye on it to see if it happens again and there is a pattern cheers Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 20/06/2013, at 9:41 PM, Franc Carter franc.car...@sirca.org.au wrote: On Thu, Jun 20, 2013 at 7:27 PM, aaron morton aa...@thelastpickle.comwrote: nodetool compactionstats, gives pending tasks: 13120 If there are no errors in the log, I would say this is a bug. This happened after the node ran out of file descriptors, so an edge case wouldn't surprise me. I've rebuilt the node (blown the data way and am running a nodetool rebuild). Do you think it's worth posting an issue, or not enough traceable evidence ? cheers Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 19/06/2013, at 11:41 AM, Franc Carter franc.car...@sirca.org.au wrote: On Wed, Jun 19, 2013 at 9:34 AM, Bryan Talbot btal...@aeriagames.comwrote: Manual compaction for LCS doesn't really do much. It certainly doesn't compact all those little files into bigger files. What makes you think that compactions are not occurring? Yeah, that's what I thought, however:- nodetool compactionstats, gives pending tasks: 13120 Active compaction remaining time :n/a when I run nodetool compact in a loop the pending tasks goes down gradually. This node also has vastly higher latencies (x10) than the other nodes. I saw this with a previous CF than I 'manually compacted', and when the pending tasks reached low numbers (stuck on 9) then latencies were back to low milliseconds cheers -Bryan On Tue, Jun 18, 2013 at 3:59 PM, Franc Carter franc.car...@sirca.org.au wrote: On Sat, Jun 15, 2013 at 11:49 AM, Franc Carter franc.car...@sirca.org.au wrote: On Sat, Jun 15, 2013 at 8:48 AM, Robert Coli rc...@eventbrite.comwrote: On Wed, Jun 12, 2013 at 3:26 PM, Franc Carter franc.car...@sirca.org.au wrote: We are running a test system with Leveled compaction on Cassandra-1.2.4. While doing an initial load of the data one of the nodes ran out of file descriptors and since then it hasn't been automatically compacting. You have (at least) two options : 1) increase file descriptors available to Cassandra with ulimit, if possible 2) increase the size of your sstables with levelled compaction, such that you have fewer of them Oops, I wasn't clear enough. I have increased the number of file descriptors and no longer have a file descriptor issue. However the node still doesn't compact automatically. If I run a 'nodetool compact' it will do a small amount of compaction and then stop. The Column Family is using LCS Any ideas on this - compaction is still not automatically running for one of my nodes thanks cheers =Rob -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
crashed while running repair
Hi, I am experimenting with Cassandra-1.2.4, and got a crash while running repair. The nodes has 24GB of ram with an 8GB heap. Any ideas on my I may have missed in the config ? Log is below ERROR [Thread-136019] 2013-06-22 06:30:05,861 CassandraDaemon.java (line 174) Exception in thread Thread[Thread-136019,5,main] FSReadError in /var/lib/cassandra/data/cut3/Price/cut3-Price-ib-44369-Index.db at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:200) at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.complete(MmappedSegmentedFile.java:168) at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:340) at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:319) at org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:194) at org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:122) at org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:238) at org.apache.cassandra.net.IncomingTcpConnection.handleStream(IncomingTcpConnection.java:178) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78) Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:748) at org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:192) ... 8 more Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745) ... 9 more ERROR [Thread-136019] 2013-06-22 06:30:05,865 FileUtils.java (line 375) Stopping gossiper thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Compaction not running
On Thu, Jun 20, 2013 at 7:27 PM, aaron morton aa...@thelastpickle.comwrote: nodetool compactionstats, gives pending tasks: 13120 If there are no errors in the log, I would say this is a bug. This happened after the node ran out of file descriptors, so an edge case wouldn't surprise me. I've rebuilt the node (blown the data way and am running a nodetool rebuild). Do you think it's worth posting an issue, or not enough traceable evidence ? cheers Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 19/06/2013, at 11:41 AM, Franc Carter franc.car...@sirca.org.au wrote: On Wed, Jun 19, 2013 at 9:34 AM, Bryan Talbot btal...@aeriagames.comwrote: Manual compaction for LCS doesn't really do much. It certainly doesn't compact all those little files into bigger files. What makes you think that compactions are not occurring? Yeah, that's what I thought, however:- nodetool compactionstats, gives pending tasks: 13120 Active compaction remaining time :n/a when I run nodetool compact in a loop the pending tasks goes down gradually. This node also has vastly higher latencies (x10) than the other nodes. I saw this with a previous CF than I 'manually compacted', and when the pending tasks reached low numbers (stuck on 9) then latencies were back to low milliseconds cheers -Bryan On Tue, Jun 18, 2013 at 3:59 PM, Franc Carter franc.car...@sirca.org.auwrote: On Sat, Jun 15, 2013 at 11:49 AM, Franc Carter franc.car...@sirca.org.au wrote: On Sat, Jun 15, 2013 at 8:48 AM, Robert Coli rc...@eventbrite.comwrote: On Wed, Jun 12, 2013 at 3:26 PM, Franc Carter franc.car...@sirca.org.au wrote: We are running a test system with Leveled compaction on Cassandra-1.2.4. While doing an initial load of the data one of the nodes ran out of file descriptors and since then it hasn't been automatically compacting. You have (at least) two options : 1) increase file descriptors available to Cassandra with ulimit, if possible 2) increase the size of your sstables with levelled compaction, such that you have fewer of them Oops, I wasn't clear enough. I have increased the number of file descriptors and no longer have a file descriptor issue. However the node still doesn't compact automatically. If I run a 'nodetool compact' it will do a small amount of compaction and then stop. The Column Family is using LCS Any ideas on this - compaction is still not automatically running for one of my nodes thanks cheers =Rob -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Performance Difference between Cassandra version
On Thu, Jun 20, 2013 at 9:18 AM, Raihan Jamal jamalrai...@gmail.com wrote: I am trying to see whether there will be any performance difference between Cassandra 1.0.8 vs Cassandra 1.2.2 for reading the data mainly? Has anyone seen any major performance difference? We are part way through a performance comparison between 1.0.9 with Size Tiered Compaction and 1.2.4 with Leveled Compaction - for our use case it looks like a significant performance improvement on the read side. We are finding compaction lags when we do very large bulk loads, but for us this is an initialisation task and that's a reasonable trade-off cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Large number of files for Leveled Compaction
On Mon, Jun 17, 2013 at 3:37 PM, Franc Carter franc.car...@sirca.org.auwrote: On Mon, Jun 17, 2013 at 3:28 PM, Wei Zhu wz1...@yahoo.com wrote: default value of 5MB is way too small in practice. Too many files in one directory is not a good thing. It's not clear what should be a good number. I have heard people are using 50MB, 75MB, even 100MB. Do your own test o find a right number. Interesting - 50MB is the low end of what people are using - 5MB is a lot lower. I'll try a 50MB set Oops, forgot to ask - is there a way to get Cassandra to rebuild the sstables as bigger once I have updated the column family definition ? thanks cheers -Wei -- *From: *Franc Carter franc.car...@sirca.org.au *To: *user@cassandra.apache.org *Sent: *Sunday, June 16, 2013 10:15:22 PM *Subject: *Re: Large number of files for Leveled Compaction On Mon, Jun 17, 2013 at 2:59 PM, Manoj Mainali mainalima...@gmail.comwrote: Not in the case of LeveledCompaction. Only SizeTieredCompaction merges smaller sstables into large ones. With the LeveledCompaction, the sstables are always of fixed size but they are grouped into different levels. You can refer to this page http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra on details of how LeveledCompaction works. Yes, but it seems I've misinterpreted that page ;-( I took this paragraph In figure 3, new sstables are added to the first level, L0, and immediately compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap. As more data is added, leveled compaction results in a situation like the one shown in figure 4. to mean that once a level fills up it gets compacted into a higher level cheers Cheers Manoj On Mon, Jun 17, 2013 at 1:54 PM, Franc Carter franc.car...@sirca.org.au wrote: On Mon, Jun 17, 2013 at 2:47 PM, Manoj Mainali mainalima...@gmail.comwrote: With LeveledCompaction, each sstable size is fixed and is defined by sstable_size_in_mb in the compaction configuration of CF definition and default value is 5MB. In you case, you may have not defined your own value, that is why your each sstable is 5MB. And if you dataset is huge, you will see a lot of sstable counts. Ok, seems like I do have (at least) an incomplete understanding. I realise that the minimum size is 5MB, but I thought compaction would merge these into a smaller number of larger sstables ? thanks Cheers Manoj On Fri, Jun 7, 2013 at 1:44 PM, Franc Carter franc.car...@sirca.org.au wrote: Hi, We are trialling Cassandra-1.2(.4) with Leveled compaction as it looks like it may be a win for us. The first step of testing was to push a fairly large slab of data into the Column Family - we did this much faster ( x100) than we would in a production environment. This has left the Column Family with about 140,000 files in the Column Family directory which seems way too high. On two of the nodes the CompactionStats show 2 outstanding tasks and on a third node there are over 13,000 outstanding tasks. However from looking at the log activity it looks like compaction has finished on all nodes. Is this number of files expected/normal ? cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Compaction not running
On Sat, Jun 15, 2013 at 11:49 AM, Franc Carter franc.car...@sirca.org.auwrote: On Sat, Jun 15, 2013 at 8:48 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Jun 12, 2013 at 3:26 PM, Franc Carter franc.car...@sirca.org.au wrote: We are running a test system with Leveled compaction on Cassandra-1.2.4. While doing an initial load of the data one of the nodes ran out of file descriptors and since then it hasn't been automatically compacting. You have (at least) two options : 1) increase file descriptors available to Cassandra with ulimit, if possible 2) increase the size of your sstables with levelled compaction, such that you have fewer of them Oops, I wasn't clear enough. I have increased the number of file descriptors and no longer have a file descriptor issue. However the node still doesn't compact automatically. If I run a 'nodetool compact' it will do a small amount of compaction and then stop. The Column Family is using LCS Any ideas on this - compaction is still not automatically running for one of my nodes thanks cheers =Rob -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Compaction not running
On Wed, Jun 19, 2013 at 9:34 AM, Bryan Talbot btal...@aeriagames.comwrote: Manual compaction for LCS doesn't really do much. It certainly doesn't compact all those little files into bigger files. What makes you think that compactions are not occurring? Yeah, that's what I thought, however:- nodetool compactionstats, gives pending tasks: 13120 Active compaction remaining time :n/a when I run nodetool compact in a loop the pending tasks goes down gradually. This node also has vastly higher latencies (x10) than the other nodes. I saw this with a previous CF than I 'manually compacted', and when the pending tasks reached low numbers (stuck on 9) then latencies were back to low milliseconds cheers -Bryan On Tue, Jun 18, 2013 at 3:59 PM, Franc Carter franc.car...@sirca.org.auwrote: On Sat, Jun 15, 2013 at 11:49 AM, Franc Carter franc.car...@sirca.org.au wrote: On Sat, Jun 15, 2013 at 8:48 AM, Robert Coli rc...@eventbrite.comwrote: On Wed, Jun 12, 2013 at 3:26 PM, Franc Carter franc.car...@sirca.org.au wrote: We are running a test system with Leveled compaction on Cassandra-1.2.4. While doing an initial load of the data one of the nodes ran out of file descriptors and since then it hasn't been automatically compacting. You have (at least) two options : 1) increase file descriptors available to Cassandra with ulimit, if possible 2) increase the size of your sstables with levelled compaction, such that you have fewer of them Oops, I wasn't clear enough. I have increased the number of file descriptors and no longer have a file descriptor issue. However the node still doesn't compact automatically. If I run a 'nodetool compact' it will do a small amount of compaction and then stop. The Column Family is using LCS Any ideas on this - compaction is still not automatically running for one of my nodes thanks cheers =Rob -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Large number of files for Leveled Compaction
On Fri, Jun 7, 2013 at 2:44 PM, Franc Carter franc.car...@sirca.org.auwrote: Hi, We are trialling Cassandra-1.2(.4) with Leveled compaction as it looks like it may be a win for us. The first step of testing was to push a fairly large slab of data into the Column Family - we did this much faster ( x100) than we would in a production environment. This has left the Column Family with about 140,000 files in the Column Family directory which seems way too high. On two of the nodes the CompactionStats show 2 outstanding tasks and on a third node there are over 13,000 outstanding tasks. However from looking at the log activity it looks like compaction has finished on all nodes. Is this number of files expected/normal ? An addendum to this. None of the files are *Data.db bigger than 5MB (including on the nodes that have finished compaction). I'm wondering if I have misunderstood Leveled Compaction, I thought that there should be data files of 50MB and 500MB (the dataset is 190GB) cheers cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Large number of files for Leveled Compaction
On Mon, Jun 17, 2013 at 2:47 PM, Manoj Mainali mainalima...@gmail.comwrote: With LeveledCompaction, each sstable size is fixed and is defined by sstable_size_in_mb in the compaction configuration of CF definition and default value is 5MB. In you case, you may have not defined your own value, that is why your each sstable is 5MB. And if you dataset is huge, you will see a lot of sstable counts. Ok, seems like I do have (at least) an incomplete understanding. I realise that the minimum size is 5MB, but I thought compaction would merge these into a smaller number of larger sstables ? thanks Cheers Manoj On Fri, Jun 7, 2013 at 1:44 PM, Franc Carter franc.car...@sirca.org.auwrote: Hi, We are trialling Cassandra-1.2(.4) with Leveled compaction as it looks like it may be a win for us. The first step of testing was to push a fairly large slab of data into the Column Family - we did this much faster ( x100) than we would in a production environment. This has left the Column Family with about 140,000 files in the Column Family directory which seems way too high. On two of the nodes the CompactionStats show 2 outstanding tasks and on a third node there are over 13,000 outstanding tasks. However from looking at the log activity it looks like compaction has finished on all nodes. Is this number of files expected/normal ? cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Large number of files for Leveled Compaction
On Mon, Jun 17, 2013 at 2:59 PM, Manoj Mainali mainalima...@gmail.comwrote: Not in the case of LeveledCompaction. Only SizeTieredCompaction merges smaller sstables into large ones. With the LeveledCompaction, the sstables are always of fixed size but they are grouped into different levels. You can refer to this page http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra on details of how LeveledCompaction works. Yes, but it seems I've misinterpreted that page ;-( I took this paragraph In figure 3, new sstables are added to the first level, L0, and immediately compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap. As more data is added, leveled compaction results in a situation like the one shown in figure 4. to mean that once a level fills up it gets compacted into a higher level cheers Cheers Manoj On Mon, Jun 17, 2013 at 1:54 PM, Franc Carter franc.car...@sirca.org.auwrote: On Mon, Jun 17, 2013 at 2:47 PM, Manoj Mainali mainalima...@gmail.comwrote: With LeveledCompaction, each sstable size is fixed and is defined by sstable_size_in_mb in the compaction configuration of CF definition and default value is 5MB. In you case, you may have not defined your own value, that is why your each sstable is 5MB. And if you dataset is huge, you will see a lot of sstable counts. Ok, seems like I do have (at least) an incomplete understanding. I realise that the minimum size is 5MB, but I thought compaction would merge these into a smaller number of larger sstables ? thanks Cheers Manoj On Fri, Jun 7, 2013 at 1:44 PM, Franc Carter franc.car...@sirca.org.auwrote: Hi, We are trialling Cassandra-1.2(.4) with Leveled compaction as it looks like it may be a win for us. The first step of testing was to push a fairly large slab of data into the Column Family - we did this much faster ( x100) than we would in a production environment. This has left the Column Family with about 140,000 files in the Column Family directory which seems way too high. On two of the nodes the CompactionStats show 2 outstanding tasks and on a third node there are over 13,000 outstanding tasks. However from looking at the log activity it looks like compaction has finished on all nodes. Is this number of files expected/normal ? cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Large number of files for Leveled Compaction
On Mon, Jun 17, 2013 at 3:28 PM, Wei Zhu wz1...@yahoo.com wrote: default value of 5MB is way too small in practice. Too many files in one directory is not a good thing. It's not clear what should be a good number. I have heard people are using 50MB, 75MB, even 100MB. Do your own test o find a right number. Interesting - 50MB is the low end of what people are using - 5MB is a lot lower. I'll try a 50MB set cheers -Wei -- *From: *Franc Carter franc.car...@sirca.org.au *To: *user@cassandra.apache.org *Sent: *Sunday, June 16, 2013 10:15:22 PM *Subject: *Re: Large number of files for Leveled Compaction On Mon, Jun 17, 2013 at 2:59 PM, Manoj Mainali mainalima...@gmail.comwrote: Not in the case of LeveledCompaction. Only SizeTieredCompaction merges smaller sstables into large ones. With the LeveledCompaction, the sstables are always of fixed size but they are grouped into different levels. You can refer to this page http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra on details of how LeveledCompaction works. Yes, but it seems I've misinterpreted that page ;-( I took this paragraph In figure 3, new sstables are added to the first level, L0, and immediately compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap. As more data is added, leveled compaction results in a situation like the one shown in figure 4. to mean that once a level fills up it gets compacted into a higher level cheers Cheers Manoj On Mon, Jun 17, 2013 at 1:54 PM, Franc Carter franc.car...@sirca.org.auwrote: On Mon, Jun 17, 2013 at 2:47 PM, Manoj Mainali mainalima...@gmail.comwrote: With LeveledCompaction, each sstable size is fixed and is defined by sstable_size_in_mb in the compaction configuration of CF definition and default value is 5MB. In you case, you may have not defined your own value, that is why your each sstable is 5MB. And if you dataset is huge, you will see a lot of sstable counts. Ok, seems like I do have (at least) an incomplete understanding. I realise that the minimum size is 5MB, but I thought compaction would merge these into a smaller number of larger sstables ? thanks Cheers Manoj On Fri, Jun 7, 2013 at 1:44 PM, Franc Carter franc.car...@sirca.org.au wrote: Hi, We are trialling Cassandra-1.2(.4) with Leveled compaction as it looks like it may be a win for us. The first step of testing was to push a fairly large slab of data into the Column Family - we did this much faster ( x100) than we would in a production environment. This has left the Column Family with about 140,000 files in the Column Family directory which seems way too high. On two of the nodes the CompactionStats show 2 outstanding tasks and on a third node there are over 13,000 outstanding tasks. However from looking at the log activity it looks like compaction has finished on all nodes. Is this number of files expected/normal ? cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Compaction not running
On Sat, Jun 15, 2013 at 8:48 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Jun 12, 2013 at 3:26 PM, Franc Carter franc.car...@sirca.org.au wrote: We are running a test system with Leveled compaction on Cassandra-1.2.4. While doing an initial load of the data one of the nodes ran out of file descriptors and since then it hasn't been automatically compacting. You have (at least) two options : 1) increase file descriptors available to Cassandra with ulimit, if possible 2) increase the size of your sstables with levelled compaction, such that you have fewer of them Oops, I wasn't clear enough. I have increased the number of file descriptors and no longer have a file descriptor issue. However the node still doesn't compact automatically. If I run a 'nodetool compact' it will do a small amount of compaction and then stop. The Column Family is using LCS cheers =Rob -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Compaction not running
Hi, We are running a test system with Leveled compaction on Cassandra-1.2.4. While doing an initial load of the data one of the nodes ran out of file descriptors and since then it hasn't been automatically compacting. Any suggestions on how to fix this ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Cassandra (1.2.5) + Pig (0.11.1) Errors with large column families
-- Forwarded message -- From: Mark Lewandowski mark.e.lewandow...@gmail.com Date: Jun 8, 2013 8:03 AM Subject: Cassandra (1.2.5) + Pig (0.11.1) Errors with large column families To: user@cassandra.apache.org Cc: I'm cur.rently trying to get Cassandra (1.2.5) and Pig (0.11.1) to play nice together. I'm running a basic script: rows = LOAD 'cassandra://keyspace/colfam' USING CassandraStorage(); dump rows; This fails for my column family which has ~100,000 rows. However, if I modify the script to this: rows = LOAD 'cassandra://betable_games/bets' USING CassandraStorage(); rows = limit rows 7000; dump rows; Then it seems to work. 7000 is about as high as I've been able to get it before it fails. The error I keep getting is: 2013-06-07 14:58:49,119 [Thread-4] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: org.apache.thrift.TException: Message length exceeded: 4480 at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:384) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:390) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.computeNext(ColumnFamilyRecordReader.java:313) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.getProgress(ColumnFamilyRecordReader.java:103) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.getProgress(PigRecordReader.java:169) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.getProgress(MapTask.java:514) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:539) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214) Caused by: org.apache.thrift.TException: Message length exceeded: 4480 at org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:393) at org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:363) at org.apache.cassandra.thrift.Column.read(Column.java:535) at org.apache.cassandra.thrift.ColumnOrSuperColumn.read(ColumnOrSuperColumn.java:507) at org.apache.cassandra.thrift.KeySlice.read(KeySlice.java:408) at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12905) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:734) at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:718) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:346) ... 13 more I've seen a similar problem on this mailing list using Cassandra-1.2.3, however the fixes on that thread of increasing thrift_framed_transport_size_in_mb, thrift_max_message_length_in_mb in cassandra.yaml did not appear to have any effect. Has anyone else seen this issue, and how can I fix it? Thanks, -Mark
Large number of files for Leveled Compaction
Hi, We are trialling Cassandra-1.2(.4) with Leveled compaction as it looks like it may be a win for us. The first step of testing was to push a fairly large slab of data into the Column Family - we did this much faster ( x100) than we would in a production environment. This has left the Column Family with about 140,000 files in the Column Family directory which seems way too high. On two of the nodes the CompactionStats show 2 outstanding tasks and on a third node there are over 13,000 outstanding tasks. However from looking at the log activity it looks like compaction has finished on all nodes. Is this number of files expected/normal ? cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: two-node cassandra cluster
On Fri, Aug 24, 2012 at 8:25 PM, Jason Axelson ja...@engagestage.comwrote: Hi, I have an application that will be very dormant most of the time but will need high-bursting a few days out of the month. Since we are deploying on EC2 I would like to keep only one Cassandra server up most of the time and then on burst days I want to bring one more server up (with more RAM and CPU than the first) to help serve the load. What is the best way to do this? Should I take a different approach? Some notes about what I plan to do: * Bring the node up and repair it immediately * After the burst time is over decommission the powerful node * Use the always-on server as the seed node * My main question is how to get the nodes to share all the data since I want a replication factor of 2 (so both nodes have all the data) but that won't work while there is only one server. Should I bring up 2 extra servers instead of just one? Thanks, Jason Caveat: I haven't tried what I am about to suggest Could you run the cluster on smaller instances for most of the time and then when you need more performance increases the instance size to get more CPU/Memory. If you use EBS with provisioned IOPs you should be able to make the transition reasonably quickly. cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Ball is rolling on High Performance Cassandra Cookbook second edition
does not understand what a column family is they will likely use cassandra incorrectly. This is my view as well. One of the big hurdles I noticed with developers moving to Cassandra is that there is a strong tendency to apply RDBMS thinking to Casandra - this is unsurprising, the majority of data store conceptualisation exists in this framework. I can see using names that have connections with RDBMS is likely to encourage this. cheers Maybe this is just a semantics debate because a table in a column oriented database is different then a table in a row oriented database, but the column family data model is one of the cornerstones of Cassandra. Globally replacing column family with table for the text is not a good idea. We will have to be smart about it. As thrift, the cli, the internals, the high level clients will be like this for some time. I definitely plan to add an entire chapter on CQL. I think we can put it after the CLI chapter, the introduction of CQL can attempt to cover the ground between the old school and the new school thinking. Edward -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: enforcing ordering
On Thu, May 10, 2012 at 9:05 PM, aaron morton aa...@thelastpickle.comwrote: Kewl. I'd be interested to know what you come up with. Hi, it's taken some thought, however we know have a data model that we like and it does indeed make our major concerns non-existent ;-) So I thought I'd explain it in case someone is doing something sufficiently close to us that the model is useful (and in case we are doing something silly - I hope not). The data set consist of 30 years of daily data for several million entities, the data for each is a small number of different record types ( 10) where entity,date,record_type is unique. Each record_type can have a couple of hundred key/value pairs. The query that we need to do is Set_of_Values = Get(set_of_entities, date_range, set_of_keys) Where set_of_keys is likely to be most of the keys that are valid for the entities. One slight complication (the one that sparked my initial question) is that there are also corrections that completely replace the data for an entity,date,record_type, multiple versions of the corrections can be transmitted, but only one correction per entity/day/record_type The data model that we have designed has a single Column Family keyed by the entity with a composite column name consisting of date,version,record_type with the value being a protobuf packing of the key/value pairs from the record. The version is the 'receipt data of the data' - 'date the data is for'. The properties of this that we like are:- * Record insertion is idempotent allowing for multiple active/active order independent loaders, this is a really big win for us(1). * The random partitioner gives us good scalability across the entity dimension which is the largest dimension. * The column ordering makes it easy to find the most recent 'correct' value for an entity on a day. * The Column ordering give us reasonably efficient date range queries There are a couple of implications of this data model:- * We store more data than we have to in the ideal world. * we push the work of decoding/extracting information of the protobuf on to the clients along with some of the version management. My view is that this a reasonable trade-off for systems that can have large numbers of clients that are independent of each other as scaling client machines is not hard. Feedback welcome cheers (1) It's important as it allows us to use a large number of loading processes to insert the historical data that is pretty large in a short period of time. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 10/05/2012, at 3:03 PM, Franc Carter wrote: On Tue, May 8, 2012 at 8:21 PM, Franc Carter franc.car...@sirca.org.auwrote: On Tue, May 8, 2012 at 8:09 PM, aaron morton aa...@thelastpickle.comwrote: Can you store the corrections in a separate CF? We sat down and thought about this harder - it looks like a good solution for us that may makel other hard problems go away - thanks. cheers Yes, I thought of that, but that turns on read in to two ;-( When the client reads the key, reads from the original the corrects CF at the same time. Apply the correction only on the client side. When you have confirmed the ingest has completed, run a background jobs to apply the corrections, store the updated values and delete the correction data. I was thinking down this path, but I ended up chasing the rabbit down a deep hole of race conditions . . . cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 8/05/2012, at 9:35 PM, Franc Carter wrote: Hi, I'm wondering if there is a common 'pattern' to address a scenario we will have to deal with. We will be storing a set of Column/Value pairs per Key where the Column/Values are read from a set of files that we download regularly. We need the loading to be resilient and we can receive corrections for some of the Column/Values that can only be loaded after the initial data has been inserted. The challenge we have is that we have a strong preference for active/active loading of data and can't see how to achieve this without some form of serialisation (which Cassandra doesn't support - correct ?) thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW
Re: Query
On Mon, Jun 4, 2012 at 7:36 PM, MOHD ARSHAD SALEEM marshadsal...@tataelxsi.co.in wrote: Hi all, I wanted to know how to read and write data using cassandra API's . is there any link related to sample program . I did a Proof of Concept using a python client -PyCassa ( https://github.com/pycassa/pycassa) which works well cheers Regards Arshad -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Number of keyspaces
On Wed, May 23, 2012 at 8:09 PM, aaron morton aa...@thelastpickle.comwrote: We were thinking of doing a major compaction after each year is 'closed off'. Not a terrible idea. Years tend to happen annually, so their growth pattern is well understood. This would mean that compactions for the current year were dealing with a smaller amount of data and hence be faster and have less impact on a day-to-day basis. Older data is compacted into higher tiers / generations so will not be included when compacting new data (background http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra). That said, there is a chance that at some point you the big older files get compacted. i.e. if you get (by default) 4 X 100GB files they will get compacted into 1. I'm a bit nervous about leveled compaction as it's new(ish) It feels a bit like a premature optimisation. Yep, that's certainly possible - it's habit I tend towards ;-( cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/05/2012, at 1:52 PM, Franc Carter wrote: On Wed, May 23, 2012 at 7:42 AM, aaron morton aa...@thelastpickle.comwrote: 1 KS with 24 CF's will use roughly the same resources as 24 KS's with 1 CF. Each CF: * loads the bloom filter for each SSTable * samples the index for each sstable * uses row and key cache * has a current memtable and potentially memtables waiting to flush. * had secondary index CF's I would generally avoid a data model that calls for CF's to be added in response to new entities or new data. Older data will move moved to larger files, and not included in compaction for newer data. We were thinking of doing a major compaction after each year is 'closed off'. This would mean that compactions for the current year were dealing with a smaller amount of data and hence be faster and have less impact on a day-to-day basis. Our query patterns will only infrequently cross year boundaries. Are we being naive ? cheers Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/05/2012, at 3:31 AM, Luís Ferreira wrote: I have 24 keyspaces, each with a columns family and am considering changing it to 1 keyspace with 24 CFs. Would this be beneficial? On May 22, 2012, at 12:56 PM, samal wrote: Not ideally, now cass has global memtable tuning. Each cf correspond to memory in ram. Year wise cf means it will be in read only state for next year, memtable will still consume ram. On 22-May-2012 5:01 PM, Franc Carter franc.car...@sirca.org.au wrote: On Tue, May 22, 2012 at 9:19 PM, aaron morton aa...@thelastpickle.comwrote: It's more the number of CF's than keyspaces. Oh - does increasing the number of Column Families affect performance ? The design we are working on at the moment is considering using a Column Family per year. We were thinking this would isolate compactions to a more manageable size as we don't update previous years. cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 22/05/2012, at 6:58 PM, R. Verlangen wrote: Yes, it does. However there's no real answer what's the limit: it depends on your hardware and cluster configuration. You might even want to search the archives of this mailinglist, I remember this has been asked before. Cheers! 2012/5/21 Luís Ferreira zamith...@gmail.com Hi, Does the number of keyspaces affect the overall cassandra performance? Cumprimentos, Luís Ferreira -- With kind regards, Robin Verlangen www.robinverlangen.nl -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 Cumprimentos, Luís Ferreira -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Number of keyspaces
On Tue, May 22, 2012 at 9:19 PM, aaron morton aa...@thelastpickle.comwrote: It's more the number of CF's than keyspaces. Oh - does increasing the number of Column Families affect performance ? The design we are working on at the moment is considering using a Column Family per year. We were thinking this would isolate compactions to a more manageable size as we don't update previous years. cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 22/05/2012, at 6:58 PM, R. Verlangen wrote: Yes, it does. However there's no real answer what's the limit: it depends on your hardware and cluster configuration. You might even want to search the archives of this mailinglist, I remember this has been asked before. Cheers! 2012/5/21 Luís Ferreira zamith...@gmail.com Hi, Does the number of keyspaces affect the overall cassandra performance? Cumprimentos, Luís Ferreira -- With kind regards, Robin Verlangen www.robinverlangen.nl -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Number of keyspaces
On Wed, May 23, 2012 at 7:42 AM, aaron morton aa...@thelastpickle.comwrote: 1 KS with 24 CF's will use roughly the same resources as 24 KS's with 1 CF. Each CF: * loads the bloom filter for each SSTable * samples the index for each sstable * uses row and key cache * has a current memtable and potentially memtables waiting to flush. * had secondary index CF's I would generally avoid a data model that calls for CF's to be added in response to new entities or new data. Older data will move moved to larger files, and not included in compaction for newer data. We were thinking of doing a major compaction after each year is 'closed off'. This would mean that compactions for the current year were dealing with a smaller amount of data and hence be faster and have less impact on a day-to-day basis. Our query patterns will only infrequently cross year boundaries. Are we being naive ? cheers Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/05/2012, at 3:31 AM, Luís Ferreira wrote: I have 24 keyspaces, each with a columns family and am considering changing it to 1 keyspace with 24 CFs. Would this be beneficial? On May 22, 2012, at 12:56 PM, samal wrote: Not ideally, now cass has global memtable tuning. Each cf correspond to memory in ram. Year wise cf means it will be in read only state for next year, memtable will still consume ram. On 22-May-2012 5:01 PM, Franc Carter franc.car...@sirca.org.au wrote: On Tue, May 22, 2012 at 9:19 PM, aaron morton aa...@thelastpickle.comwrote: It's more the number of CF's than keyspaces. Oh - does increasing the number of Column Families affect performance ? The design we are working on at the moment is considering using a Column Family per year. We were thinking this would isolate compactions to a more manageable size as we don't update previous years. cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 22/05/2012, at 6:58 PM, R. Verlangen wrote: Yes, it does. However there's no real answer what's the limit: it depends on your hardware and cluster configuration. You might even want to search the archives of this mailinglist, I remember this has been asked before. Cheers! 2012/5/21 Luís Ferreira zamith...@gmail.com Hi, Does the number of keyspaces affect the overall cassandra performance? Cumprimentos, Luís Ferreira -- With kind regards, Robin Verlangen www.robinverlangen.nl -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 Cumprimentos, Luís Ferreira -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: enforcing ordering
On Thu, May 10, 2012 at 9:05 PM, aaron morton aa...@thelastpickle.comwrote: Kewl. I'd be interested to know what you come up with. Sure - I'll post details once we have them nailed down. I suspect that it will be 'obvious in hindsight', I'm still suffering from RDBMS brain - which is interesting becuse i am not a database guy, but yet I still have these ingrained ways of thinking cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 10/05/2012, at 3:03 PM, Franc Carter wrote: On Tue, May 8, 2012 at 8:21 PM, Franc Carter franc.car...@sirca.org.auwrote: On Tue, May 8, 2012 at 8:09 PM, aaron morton aa...@thelastpickle.comwrote: Can you store the corrections in a separate CF? We sat down and thought about this harder - it looks like a good solution for us that may makel other hard problems go away - thanks. cheers Yes, I thought of that, but that turns on read in to two ;-( When the client reads the key, reads from the original the corrects CF at the same time. Apply the correction only on the client side. When you have confirmed the ingest has completed, run a background jobs to apply the corrections, store the updated values and delete the correction data. I was thinking down this path, but I ended up chasing the rabbit down a deep hole of race conditions . . . cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 8/05/2012, at 9:35 PM, Franc Carter wrote: Hi, I'm wondering if there is a common 'pattern' to address a scenario we will have to deal with. We will be storing a set of Column/Value pairs per Key where the Column/Values are read from a set of files that we download regularly. We need the loading to be resilient and we can receive corrections for some of the Column/Values that can only be loaded after the initial data has been inserted. The challenge we have is that we have a strong preference for active/active loading of data and can't see how to achieve this without some form of serialisation (which Cassandra doesn't support - correct ?) thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: enforcing ordering
On Tue, May 8, 2012 at 8:21 PM, Franc Carter franc.car...@sirca.org.auwrote: On Tue, May 8, 2012 at 8:09 PM, aaron morton aa...@thelastpickle.comwrote: Can you store the corrections in a separate CF? We sat down and thought about this harder - it looks like a good solution for us that may makel other hard problems go away - thanks. cheers Yes, I thought of that, but that turns on read in to two ;-( When the client reads the key, reads from the original the corrects CF at the same time. Apply the correction only on the client side. When you have confirmed the ingest has completed, run a background jobs to apply the corrections, store the updated values and delete the correction data. I was thinking down this path, but I ended up chasing the rabbit down a deep hole of race conditions . . . cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 8/05/2012, at 9:35 PM, Franc Carter wrote: Hi, I'm wondering if there is a common 'pattern' to address a scenario we will have to deal with. We will be storing a set of Column/Value pairs per Key where the Column/Values are read from a set of files that we download regularly. We need the loading to be resilient and we can receive corrections for some of the Column/Values that can only be loaded after the initial data has been inserted. The challenge we have is that we have a strong preference for active/active loading of data and can't see how to achieve this without some form of serialisation (which Cassandra doesn't support - correct ?) thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
enforcing ordering
Hi, I'm wondering if there is a common 'pattern' to address a scenario we will have to deal with. We will be storing a set of Column/Value pairs per Key where the Column/Values are read from a set of files that we download regularly. We need the loading to be resilient and we can receive corrections for some of the Column/Values that can only be loaded after the initial data has been inserted. The challenge we have is that we have a strong preference for active/active loading of data and can't see how to achieve this without some form of serialisation (which Cassandra doesn't support - correct ?) thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: enforcing ordering
On Tue, May 8, 2012 at 8:09 PM, aaron morton aa...@thelastpickle.comwrote: Can you store the corrections in a separate CF? Yes, I thought of that, but that turns on read in to two ;-( When the client reads the key, reads from the original the corrects CF at the same time. Apply the correction only on the client side. When you have confirmed the ingest has completed, run a background jobs to apply the corrections, store the updated values and delete the correction data. I was thinking down this path, but I ended up chasing the rabbit down a deep hole of race conditions . . . cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 8/05/2012, at 9:35 PM, Franc Carter wrote: Hi, I'm wondering if there is a common 'pattern' to address a scenario we will have to deal with. We will be storing a set of Column/Value pairs per Key where the Column/Values are read from a set of files that we download regularly. We need the loading to be resilient and we can receive corrections for some of the Column/Values that can only be loaded after the initial data has been inserted. The challenge we have is that we have a strong preference for active/active loading of data and can't see how to achieve this without some form of serialisation (which Cassandra doesn't support - correct ?) thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: 200TB in Cassandra ?
On Fri, Apr 20, 2012 at 6:27 AM, aaron morton aa...@thelastpickle.comwrote: Couple of ideas: * take a look at compression in 1.X http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression * is there repetition in the binary data ? Can you save space by implementing content addressable storage ? The data is already very highly space optimised. We've come to the conclusion that Cassandra is probably not the right fit the use case this time cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/04/2012, at 12:55 AM, Dave Brosius wrote: I think your math is 'relatively' correct. It would seem to me you should focus on how you can reduce the amount of storage you are using per item, if at all possible, if that node count is prohibitive. On 04/19/2012 07:12 AM, Franc Carter wrote: Hi, One of the projects I am working on is going to need to store about 200TB of data - generally in manageable binary chunks. However, after doing some rough calculations based on rules of thumb I have seen for how much storage should be on each node I'm worried. 200TB with RF=3 is 600TB = 600,000GB Which is 1000 nodes at 600GB per node I'm hoping I've missed something as 1000 nodes is not viable for us. cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: 200TB in Cassandra ?
On Sat, Apr 21, 2012 at 1:05 AM, Jake Luciani jak...@gmail.com wrote: What other solutions are you considering? Any OLTP style access of 200TB of data will require substantial IO. We currently use an in-house written database because when we first started our system there was nothing that handled our problem economically. We would like to use something more off the shelf to reduce maintenance and development costs. We've been looking at Hadoop for the computational component. However it looks like HDFS does not map to our storage patterns well as the latency is quite high. In addition the resilience model of the Name Node is a concern in our environment. We were thinking through whether using Cassandra for the Hadoop data store is viable for us, however we've come to the conclusion that it doesn't map well in this case. Do you know how big your working dataset will be? The system is batch, jobs could range between very small up to a moderate percentage of the data set. It' even possible that we could need to read the entire data set. How much we get resident is a cost/performance trade-off we need to make cheers -Jake On Fri, Apr 20, 2012 at 3:30 AM, Franc Carter franc.car...@sirca.org.auwrote: On Fri, Apr 20, 2012 at 6:27 AM, aaron morton aa...@thelastpickle.comwrote: Couple of ideas: * take a look at compression in 1.X http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression * is there repetition in the binary data ? Can you save space by implementing content addressable storage ? The data is already very highly space optimised. We've come to the conclusion that Cassandra is probably not the right fit the use case this time cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/04/2012, at 12:55 AM, Dave Brosius wrote: I think your math is 'relatively' correct. It would seem to me you should focus on how you can reduce the amount of storage you are using per item, if at all possible, if that node count is prohibitive. On 04/19/2012 07:12 AM, Franc Carter wrote: Hi, One of the projects I am working on is going to need to store about 200TB of data - generally in manageable binary chunks. However, after doing some rough calculations based on rules of thumb I have seen for how much storage should be on each node I'm worried. 200TB with RF=3 is 600TB = 600,000GB Which is 1000 nodes at 600GB per node I'm hoping I've missed something as 1000 nodes is not viable for us. cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- http://twitter.com/tjake -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
200TB in Cassandra ?
Hi, One of the projects I am working on is going to need to store about 200TB of data - generally in manageable binary chunks. However, after doing some rough calculations based on rules of thumb I have seen for how much storage should be on each node I'm worried. 200TB with RF=3 is 600TB = 600,000GB Which is 1000 nodes at 600GB per node I'm hoping I've missed something as 1000 nodes is not viable for us. cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: RE 200TB in Cassandra ?
On Thu, Apr 19, 2012 at 9:38 PM, Romain HARDOUIN romain.hardo...@urssaf.frwrote: Cassandra supports data compression and depending on your data, you can gain a reduction in data size up to 4x. The data is gzip'd already ;-) 600 TB is a lot, hence requires lots of servers... Franc Carter franc.car...@sirca.org.au a écrit sur 19/04/2012 13:12:19 : Hi, One of the projects I am working on is going to need to store about 200TB of data - generally in manageable binary chunks. However, after doing some rough calculations based on rules of thumb I have seen for how much storage should be on each node I'm worried. 200TB with RF=3 is 600TB = 600,000GB Which is 1000 nodes at 600GB per node I'm hoping I've missed something as 1000 nodes is not viable for us. cheers -- Franc Carter | Systems architect | Sirca Ltd franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: 200TB in Cassandra ?
On Thu, Apr 19, 2012 at 10:07 PM, John Doe jd...@yahoo.com wrote: Franc Carter franc.car...@sirca.org.au One of the projects I am working on is going to need to store about 200TB of data - generally in manageable binary chunks. However, after doing some rough calculations based on rules of thumb I have seen for how much storage should be on each node I'm worried. 200TB with RF=3 is 600TB = 600,000GB Which is 1000 nodes at 600GB per node I'm hoping I've missed something as 1000 nodes is not viable for us. Why only 600GB per node? I had seen comments that you didn't want to put 'too much' data on to a single node and had seen the figure of 400GB thrown around as an approximate figure - I rounded up to 600GB to make the maths easy ;-) I'm hoping that my understanding is flawed ;-) cheers JD -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: RE 200TB in Cassandra ?
On Thu, Apr 19, 2012 at 10:16 PM, Yiming Sun yiming@gmail.com wrote: 600 TB is really a lot, even 200 TB is a lot. In our organization, storage at such scale is handled by our storage team and they purchase specialized (and very expensive) equipment from storage hardware vendors because at this scale, performance and reliability is absolutely critical. Yep that's what we currently do. We have 200TB sitting on a set of high end disk arrays which are running RAID6. I'm in the early stages of looking at whether this is still the best approach. but it sounds like your team may not be able to afford such equipment. 600GB per node will require a cloud and you need a data center to house them... but 2TB disks are common place nowadays and you can jam multiple 2TB disks into each node to reduce the number of machines needed. It all depends on what budget you have. The bit I am trying to understand is whether my figure of 400TB/node in practice for Cassandra is correct, or whether we can push the GB/node higher and if so how high cheers -- Y. On Thu, Apr 19, 2012 at 7:54 AM, Franc Carter franc.car...@sirca.org.auwrote: On Thu, Apr 19, 2012 at 9:38 PM, Romain HARDOUIN romain.hardo...@urssaf.fr wrote: Cassandra supports data compression and depending on your data, you can gain a reduction in data size up to 4x. The data is gzip'd already ;-) 600 TB is a lot, hence requires lots of servers... Franc Carter franc.car...@sirca.org.au a écrit sur 19/04/2012 13:12:19 : Hi, One of the projects I am working on is going to need to store about 200TB of data - generally in manageable binary chunks. However, after doing some rough calculations based on rules of thumb I have seen for how much storage should be on each node I'm worried. 200TB with RF=3 is 600TB = 600,000GB Which is 1000 nodes at 600GB per node I'm hoping I've missed something as 1000 nodes is not viable for us. cheers -- Franc Carter | Systems architect | Sirca Ltd franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Largest 'sensible' value
On Wed, Apr 4, 2012 at 8:56 AM, Jonathan Ellis jbel...@gmail.com wrote: We use 2MB chunks for our CFS implementation of HDFS: http://www.datastax.com/dev/blog/cassandra-file-system-design thanks On Mon, Apr 2, 2012 at 4:23 AM, Franc Carter franc.car...@sirca.org.au wrote: Hi, We are in the early stages of thinking about a project that needs to store data that will be accessed by Hadoop. One of the concerns we have is around the Latency of HDFS as our use case is is not for reading all the data and hence we will need custom RecordReaders etc. I've seen a couple of comments that you shouldn't put large chunks in to a value - however 'large' is not well defined for the range of people using these solutions ;-) Doe anyone have a rough rule of thumb for how big a single value can be before we are outside sanity? thanks -- Franc Carter | Systems architect | Sirca Ltd franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Largest 'sensible' value
Hi, We are in the early stages of thinking about a project that needs to store data that will be accessed by Hadoop. One of the concerns we have is around the Latency of HDFS as our use case is is not for reading all the data and hence we will need custom RecordReaders etc. I've seen a couple of comments that you shouldn't put large chunks in to a value - however 'large' is not well defined for the range of people using these solutions ;-) Doe anyone have a rough rule of thumb for how big a single value can be before we are outside sanity? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Largest 'sensible' value
On Tue, Apr 3, 2012 at 4:18 AM, Ben Coverston ben.covers...@datastax.comwrote: This is a difficult question to answer for a variety of reasons, but I'll give it a try, maybe it will be helpful, maybe not. The most obvious problem with this is that Thrift is buffer based, not streaming. That means that whatever the size of your chunk it needs to be received, deserialized, and processed by cassandra within a timeframe that we call the rpc_timeout (by default this is 10 seconds). Thanks. I suspect that 'not streaming' is the key, and not just from the Cassandra side - our use case has a subtle assumption of streaming on the client side. We could chop it up in to buckets and put each one in a time ordered column, but that the defeats the purpose of why I was considering Cassandra - to avoid the latency of seeks in HDFS cheers Bigger buffers mean larger allocations, larger allocations mean that the JVM is working harder, and is more prone to fragmentation on the heap. With mixed workloads (lots of high latency, large requests and many very small low latency requests) larger buffers can also, over time, clog up the thread pool in a way that can cause your shorter queries to have to wait for your longer running queries to complete (to free up worker threads) making everything slow. This isn't a problem unique to Cassandra, everything that uses worker queues runs into some variant of this problem. As with everything else, you'll probably need to test your specific use case to see what 'too big' is for you. On Mon, Apr 2, 2012 at 9:23 AM, Franc Carter franc.car...@sirca.org.auwrote: Hi, We are in the early stages of thinking about a project that needs to store data that will be accessed by Hadoop. One of the concerns we have is around the Latency of HDFS as our use case is is not for reading all the data and hence we will need custom RecordReaders etc. I've seen a couple of comments that you shouldn't put large chunks in to a value - however 'large' is not well defined for the range of people using these solutions ;-) Doe anyone have a rough rule of thumb for how big a single value can be before we are outside sanity? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- Ben Coverston DataStax -- The Apache Cassandra Company -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: sstable image/pic ?
2012/2/28 Hontvári József Levente hontv...@flyordie.com * Does the column name get stored for every col/val for every key (which sort of worries me for long column names) Yes, the column name is stored with each value for every key, but it may not matter if you switch on compression, which AFAIK has only advantages and will be the default. I am also worried about the storage space, so I did a test. Yes - I'm using compression - I've seen the same outcome in one of our own systems. There is a MySQL table which I intend to move to Cassandra. It has about 40 columns with very long column names, the average is 15 characters. The column values are mostly 2-4 byte integers. On the other hand many colums are empty, specifically not NULL but 0. AFAIK MySQL is also able to optimize NON NULL columns with 0 values to a single bit. In Cassandra I simply did not store a column if its value is the default 0. The table size, only data without indexes, in MySQL was about 2.5 GB with 7 millions rows. In Cassandra it was about 12 GB without compression, and 3,4 GB with compression (which also includes a single index for the row keys). So with compression switched on, in this specific case the storage requirements are roughly the same on Cassandra and MySQL. Good to know - thanks * Is data in an sstable sorted by key then column or column then key Sorted by key and then sorted by column. thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
sstable image/pic ?
Hi, does anyone know of a picture/image that shows the layout of keys/columns/values in an sstable - I haven't been able to find one and am having a hard time visualising the layout from various descriptions and various overviews thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
data model advice
Hi, I've finished my first model and experiments with Cassandra with result I'm pretty happy with - so I thought I'd move on to something harder. We have a set of data that has a large number of entities (which is our primary search key), for each of the entities we have a smallish (100) number of sets of data. Each set has a further set the contains column/vale pairs. The queries will be for an Entity, for one or more days for one or more of the subsets. Conceptually I would like to model like it like this:- Entity { Day1: { TypeA: {col1:val1, col2:val2, . . . } TypeB: {col1:val1, col3:val3, . . . } . . } . . . DayN: { TypeB: {col3:val3, col5:val5, . . . } TypeD: {col3:val3, col6:val6, . . . } . . } } My understanding of the Cassandra data model is that I run out of map-dept to do this in my simplistic approach as the Days are super columns, the types are column and then I don't have a col/val map left for data. Does anyone have advice on a good approach ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: data model advice
On Fri, Feb 24, 2012 at 2:54 PM, Martin Arrowsmith arrowsmith.mar...@gmail.com wrote: Hi Franc, Or, you can consider using composite columns. It is not recommended to use Super Columns anymore. Thanks, I'll look in to composite columns cheers Best wishes, Martin On Thu, Feb 23, 2012 at 7:51 PM, Indranath Ghosh indrana...@gmail.comwrote: How about using a composite row key like the following: Entity.Day1.TypeA: {col1:val1, col2:val2, . . . } Entity.Day1.TypeB: {col1:val1, col2:val2, . . . } . . Entity.DayN.TypeA: {col1:val1, col2:val2, . . . } Entity.DayN.TypeB: {col1:val1, col2:val2, . . . } It is better to avoid super columns.. -indra On Thu, Feb 23, 2012 at 6:36 PM, Franc Carter franc.car...@sirca.org.auwrote: Hi, I've finished my first model and experiments with Cassandra with result I'm pretty happy with - so I thought I'd move on to something harder. We have a set of data that has a large number of entities (which is our primary search key), for each of the entities we have a smallish (100) number of sets of data. Each set has a further set the contains column/vale pairs. The queries will be for an Entity, for one or more days for one or more of the subsets. Conceptually I would like to model like it like this:- Entity { Day1: { TypeA: {col1:val1, col2:val2, . . . } TypeB: {col1:val1, col3:val3, . . . } . . } . . . DayN: { TypeB: {col3:val3, col5:val5, . . . } TypeD: {col3:val3, col6:val6, . . . } . . } } My understanding of the Cassandra data model is that I run out of map-dept to do this in my simplistic approach as the Days are super columns, the types are column and then I don't have a col/val map left for data. Does anyone have advice on a good approach ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Indranath Ghosh Phone: 408-813-9207* -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: data model advice
On Fri, Feb 24, 2012 at 2:54 PM, Martin Arrowsmith arrowsmith.mar...@gmail.com wrote: Hi Franc, Or, you can consider using composite columns. It is not recommended to use Super Columns anymore. Best wishes, On first read it would seem that there is fair bit of overhead with composite columns as it's my understanding that the column name is stored with each value - or have I missed something ? cheers Martin On Thu, Feb 23, 2012 at 7:51 PM, Indranath Ghosh indrana...@gmail.comwrote: How about using a composite row key like the following: Entity.Day1.TypeA: {col1:val1, col2:val2, . . . } Entity.Day1.TypeB: {col1:val1, col2:val2, . . . } . . Entity.DayN.TypeA: {col1:val1, col2:val2, . . . } Entity.DayN.TypeB: {col1:val1, col2:val2, . . . } It is better to avoid super columns.. -indra On Thu, Feb 23, 2012 at 6:36 PM, Franc Carter franc.car...@sirca.org.auwrote: Hi, I've finished my first model and experiments with Cassandra with result I'm pretty happy with - so I thought I'd move on to something harder. We have a set of data that has a large number of entities (which is our primary search key), for each of the entities we have a smallish (100) number of sets of data. Each set has a further set the contains column/vale pairs. The queries will be for an Entity, for one or more days for one or more of the subsets. Conceptually I would like to model like it like this:- Entity { Day1: { TypeA: {col1:val1, col2:val2, . . . } TypeB: {col1:val1, col3:val3, . . . } . . } . . . DayN: { TypeB: {col3:val3, col5:val5, . . . } TypeD: {col3:val3, col6:val6, . . . } . . } } My understanding of the Cassandra data model is that I run out of map-dept to do this in my simplistic approach as the Days are super columns, the types are column and then I don't have a col/val map left for data. Does anyone have advice on a good approach ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Indranath Ghosh Phone: 408-813-9207* -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: List all keys with RandomPartitioner
On Wed, Feb 22, 2012 at 8:47 PM, Flavio Baronti f.baro...@list-group.comwrote: I need to iterate over all the rows in a column family stored with RandomPartitioner. When I reach the end of a key slice, I need to find the token of the last key in order to ask for the next slice. I saw in an old email that the token for a specific key can be recoveder through FBUtilities.hash(). That class however is inside the full Cassandra jar, not inside the client-specific part. Is there a way to iterate over all the keys which does not require the server-side Cassandra jar? Does this help ? http://wiki.apache.org/cassandra/FAQ#iter_world cheers Thanks Flavio -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: reads/s suddenly dropped
On Mon, Feb 20, 2012 at 9:42 PM, Franc Carter franc.car...@sirca.org.auwrote: On Mon, Feb 20, 2012 at 12:00 PM, aaron morton aa...@thelastpickle.comwrote: Aside from iostats.. nodetool cfstats will give you read and write latency for each CF. This is the latency for the operation on each node. Check that to see if latency is increasing. Take a look at nodetool compactionstats to see if compactions are running at the same time. The IO is throttled but if you are on aws it may not be throttled enough. compaction had finished The sweet spot for non netflix deployments seems to be a m1.xlarge with 16GB. THe JVM can have 8 and the rest can be used for memmapping files. Here is a good post about choosing EC2 sizes… http://perfcap.blogspot.co.nz/2011/03/understanding-and-using-amazon-ebs.html Thanks - good article. I'll go up to m1.xlarge and explore that behaviour the m1.xlarge is giving much better and more consistent results thanks cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/02/2012, at 9:31 AM, Franc Carter wrote: On Mon, Feb 20, 2012 at 4:10 AM, Philippe watche...@gmail.com wrote: Perhaps your dataset can no longer be held in memory. Check iostats I have been flushing the keycache and dropping the linux disk caches before each to avoid testing memory reads. One possibility that I thought of is that the success keys are now 'far enough away' that they are not being included in the previous read and hence the seek penalty has to be paid a lot more often - viable ? cheers Le 19 févr. 2012 11:24, Franc Carter franc.car...@sirca.org.au a écrit : I've been testing Cassandra - primarily looking at reads/second for our fairly data model - one unique key with a row of columns that we always request. I've now setup the cluster with with m1.large (2 cpus 8GB) I had loaded a months worth of data in and was doing random requests as a torture test - and getting very nice results. I then loaded another days worth of day and repeated the tests while the load was running - still good. I then started loading more days and at some point the performance dropped by close to an order of magnitude ;-( Any ideas on what to look for ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: reads/s suddenly dropped
On Mon, Feb 20, 2012 at 12:00 PM, aaron morton aa...@thelastpickle.comwrote: Aside from iostats.. nodetool cfstats will give you read and write latency for each CF. This is the latency for the operation on each node. Check that to see if latency is increasing. Take a look at nodetool compactionstats to see if compactions are running at the same time. The IO is throttled but if you are on aws it may not be throttled enough. compaction had finished The sweet spot for non netflix deployments seems to be a m1.xlarge with 16GB. THe JVM can have 8 and the rest can be used for memmapping files. Here is a good post about choosing EC2 sizes… http://perfcap.blogspot.co.nz/2011/03/understanding-and-using-amazon-ebs.html Thanks - good article. I'll go up to m1.xlarge and explore that behaviour cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/02/2012, at 9:31 AM, Franc Carter wrote: On Mon, Feb 20, 2012 at 4:10 AM, Philippe watche...@gmail.com wrote: Perhaps your dataset can no longer be held in memory. Check iostats I have been flushing the keycache and dropping the linux disk caches before each to avoid testing memory reads. One possibility that I thought of is that the success keys are now 'far enough away' that they are not being included in the previous read and hence the seek penalty has to be paid a lot more often - viable ? cheers Le 19 févr. 2012 11:24, Franc Carter franc.car...@sirca.org.au a écrit : I've been testing Cassandra - primarily looking at reads/second for our fairly data model - one unique key with a row of columns that we always request. I've now setup the cluster with with m1.large (2 cpus 8GB) I had loaded a months worth of data in and was doing random requests as a torture test - and getting very nice results. I then loaded another days worth of day and repeated the tests while the load was running - still good. I then started loading more days and at some point the performance dropped by close to an order of magnitude ;-( Any ideas on what to look for ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
reads/s suddenly dropped
I've been testing Cassandra - primarily looking at reads/second for our fairly data model - one unique key with a row of columns that we always request. I've now setup the cluster with with m1.large (2 cpus 8GB) I had loaded a months worth of data in and was doing random requests as a torture test - and getting very nice results. I then loaded another days worth of day and repeated the tests while the load was running - still good. I then started loading more days and at some point the performance dropped by close to an order of magnitude ;-( Any ideas on what to look for ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: reads/s suddenly dropped
On Mon, Feb 20, 2012 at 4:10 AM, Philippe watche...@gmail.com wrote: Perhaps your dataset can no longer be held in memory. Check iostats I have been flushing the keycache and dropping the linux disk caches before each to avoid testing memory reads. One possibility that I thought of is that the success keys are now 'far enough away' that they are not being included in the previous read and hence the seek penalty has to be paid a lot more often - viable ? cheers Le 19 févr. 2012 11:24, Franc Carter franc.car...@sirca.org.au a écrit : I've been testing Cassandra - primarily looking at reads/second for our fairly data model - one unique key with a row of columns that we always request. I've now setup the cluster with with m1.large (2 cpus 8GB) I had loaded a months worth of data in and was doing random requests as a torture test - and getting very nice results. I then loaded another days worth of day and repeated the tests while the load was running - still good. I then started loading more days and at some point the performance dropped by close to an order of magnitude ;-( Any ideas on what to look for ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Key cache hit rate issue
On 17/02/2012 8:53 AM, Eran Chinthaka Withana eran.chinth...@gmail.com wrote: Hi Jonathan, Thanks for the reply. Yes there is a possibility that the keys can be distributed in multiple SSTables, but my data access patterns are such that I always read/write the whole row. So I expect all the data to be in the same SSTable (please correct me if I'm wrong). For some reason 16637958 (the keys cached) has become a golden number and I don't see key cache increasing beyond that. I also checked memory and I have about 4GB left in JVM memory and didn't see any issues on logs. I have seen the same thing with the keycache size becoming static cheers Thanks, Eran Chinthaka Withana On Thu, Feb 16, 2012 at 1:20 PM, Jonathan Ellis jbel...@gmail.com wrote: So, you have roughly 1/6 of your (physical) row keys cached and about 1/4 cache hit rate, which doesn't sound unreasonable to me. Remember, each logical key may be spread across multiple physical sstables -- each (key, sstable) pair is one entry in the key cache. On Thu, Feb 16, 2012 at 1:48 PM, Eran Chinthaka Withana eran.chinth...@gmail.com wrote: Hi Aaron, Here it is. Keyspace: Read Count: 1123637972 Read Latency: 5.757938114343114 ms. Write Count: 128201833 Write Latency: 0.0682576607387509 ms. Pending Tasks: 0 Column Family: YY SSTable count: 18 Space used (live): 103318720685 Space used (total): 103318720685 Number of Keys (estimate): 92404992 Memtable Columns Count: 1425580 Memtable Data Size: 359655747 Memtable Switch Count: 2522 Read Count: 1123637972 Read Latency: 14.731 ms. Write Count: 128201833 Write Latency: NaN ms. Pending Tasks: 0 Bloom Filter False Postives: 1488 Bloom Filter False Ratio: 0.0 Bloom Filter Space Used: 331522920 Key cache capacity: 16637958 Key cache size: 16637958 Key cache hit rate: 0.2708 Row cache: disabled Compacted row minimum size: 51 Compacted row maximum size: 6866 Compacted row mean size: 2560 Thanks, Eran Chinthaka Withana On Thu, Feb 16, 2012 at 12:30 AM, aaron morton aa...@thelastpickle.com wrote: Its in the order of 261 to 8000 and the ratio is 0.00. But i guess 8000 is bit high. Is there a way to fix/improve it? Sorry I don't understand what you mean. But if the ratio is 0.0 all is good. Could you include the full output from cfstats for the CF you are looking at ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 15/02/2012, at 1:00 PM, Eran Chinthaka Withana wrote: Its in the order of 261 to 8000 and the ratio is 0.00. But i guess 8000 is bit high. Is there a way to fix/improve it? Thanks, Eran Chinthaka Withana On Tue, Feb 14, 2012 at 3:42 PM, aaron morton aa...@thelastpickle.com wrote: Out of interest what does cfstats say about the bloom filter stats ? A high false positive could lead to a low key cache hit rate. Also, is there a way to warm start the key cache, meaning pre-load the amount of keys I set as keys_cached? See key_cache_save_period when creating the CF. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 15/02/2012, at 5:54 AM, Eran Chinthaka Withana wrote: Hi, I'm using Cassandra 1.0.7 and I've set the keys_cached to about 80% (using the numerical values). This is visible in cfstats too. But I'm getting less than 20% (or sometimes even 0%) key cache hit rate. Well, the data access pattern is not the issue here as I know they are retrieving the same row multiple times. I'm using hector client with dynamic load balancing policy with consistency ONE for both reads and writes. Any ideas on how to find the issue and fix this? Here is what I see on cfstats. Key cache capacity: 16637958 Key cache size: 16637958 Key cache hit rate: 0.045454545454545456 Also, is there a way to warm start the key cache, meaning pre-load the amount of keys I set as keys_cached? Thanks, Eran -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: stalled bootstrap
On Wed, Feb 15, 2012 at 10:21 AM, aaron morton aa...@thelastpickle.comwrote: The assertion looks like a bug. Can you run it with DEBUG logging ? Sorry - I had to blow the instances away. I'm on a tight timeline for the Proof of Concept I am doing and rebuilding a 4-node cluster from scratch was going to be way faster. If I get time I'll try to reproduce it towards the end of the project - sorry. Do you have compression enabled ? Yes - SnappyCompressor Can you please submit a ticket here https://issues.apache.org/jira/browse/CASSANDRA with the extra info and update the email thread. Would you still like this even though I can't get much detail ? I *think* that the node this is happening on is failing to create the temp file in IncomingStreamReader.streamIn and the it's then trying to delete the file before it retries. Extra debugging may be a help. The assertion is hiding the original error. Can you check if the new node code create files in the data directory ? I'll try these if I can get time to retest - thanks for the pointers cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 15/02/2012, at 12:42 AM, Franc Carter wrote: Hi, I'm running the DataSatx 1.0.7 AMI in ec2. I started with two nodes and have just added a third node on the way to expanding to a four node cluster. The bootstrapping was going along ok for a while, but has stalled. In /var/log/cassandra/system.log I am seeing this repeated continuously (tmp file changes each time) INFO [Thread-529373] 2012-02-14 11:36:18,350 StreamInSession.java (line 120) Streaming of file /raid0/cassandra/data/OpsCenter/rollups7200-hc-1-Data.db sections=2 progress=0/42387 - 0% from org.apache.cassandra.streaming.StreamInSession@6ebcf58a failed: requesting a retry. ERROR [Thread-529373] 2012-02-14 11:36:18,351 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[Thread-529373,5,main] java.lang.AssertionError: attempted to delete non-existing file rollups7200-tmp-hc-529319-Data.db at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:49) at org.apache.cassandra.streaming.IncomingStreamReader.retry(IncomingStreamReader.java:172) at org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:92) at org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:185) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:81) Any advice on how to resolve this ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
stalled bootstrap
Hi, I'm running the DataSatx 1.0.7 AMI in ec2. I started with two nodes and have just added a third node on the way to expanding to a four node cluster. The bootstrapping was going along ok for a while, but has stalled. In /var/log/cassandra/system.log I am seeing this repeated continuously (tmp file changes each time) INFO [Thread-529373] 2012-02-14 11:36:18,350 StreamInSession.java (line 120) Streaming of file /raid0/cassandra/data/OpsCenter/rollups7200-hc-1-Data.db sections=2 progress=0/42387 - 0% from org.apache.cassandra.streaming.StreamInSession@6ebcf58a failed: requesting a retry. ERROR [Thread-529373] 2012-02-14 11:36:18,351 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[Thread-529373,5,main] java.lang.AssertionError: attempted to delete non-existing file rollups7200-tmp-hc-529319-Data.db at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:49) at org.apache.cassandra.streaming.IncomingStreamReader.retry(IncomingStreamReader.java:172) at org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:92) at org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:185) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:81) Any advice on how to resolve this ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: active/pending queue lengths
On Tue, Feb 14, 2012 at 8:01 PM, aaron morton aa...@thelastpickle.comwrote: And the output from tpstats is ? I can't reproduce it at the moment ;-( nodetool is throwing 'Failed to retrieve RMIServer stub:' - which I'm guessing/hoping is related to the stalled bootstrap. A - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 14/02/2012, at 12:43 PM, Franc Carter wrote: On Tue, Feb 14, 2012 at 6:06 AM, aaron morton aa...@thelastpickle.comwrote: What CL are you reading at ? Quorum Write ops go to RF number of nodes, read ops go to RF number of nodes 10% (the default probability that Read Repair will be running) of the time and CL number of nodes 90% of the time. With 2 nodes and RF 2 the QUOURM is 2, every request will involve all nodes. Yep, the thing tat confuses is the different behaviour for reading from one node versus two As to why the pending list gets longer, do you have some more info ? What process are you using to measure ? It's hard to guess why. In this setup every node will have the data and should be able to do a local read and then on the other node. I have four pycassa clients, two making requests to one server and two making requests to the other (or all four making requests to the same server). The requested keys don't overlap and I would expect/assume the keys are in the keycache I am looking at the output of nodetool -h tpstats cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 14/02/2012, at 12:47 AM, Franc Carter wrote: Hi, I've been looking at tpstats as various test queries run and I noticed something I don't understand. I have a two node cluster with RF=2 on which I run 4 parallel queries, each job goes through a list of keys doing a multiget for 2 keys at a time. If two of the queries go to one node and the other two go to a different node then the pending queue on the node gets much longer than if they all go to the one node. I'm clearly missing something here as I would have expected the opposite cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
nodetool removetoken
I teminated (ec2 destruction) a node that I was wedged during bootstrap. However when I try to removetoken I get 'Token not found'. It looks a bit like this issue ? https://issues.apache.org/jira/browse/CASSANDRA-3737 nodetool -h 127.0.0.1 ring gives this Address DC RackStatus State Load OwnsToken 85070591730234615865843651857942052864 10.253.65.203 us-east 1a Up Normal 11.18 GB 50.00% 0 10.252.82.64us-east 1a Down Joining 320.45 KB 25.00% 42535295865117307932921825928971026432 10.253.86.224 us-east 1a Up Normal 11.01 GB 25.00% 85070591730234615865843651857942052864 and nodetool -h 127.0.0.1 removetoken 42535295865117307932921825928971026432 gives xception in thread main java.lang.UnsupportedOperationException: Token not found. at org.apache.cassandra.service.StorageService.removeToken(StorageService.java:2369) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27) at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788) at sun.reflect.GeneratedMethodAccessor165.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Any ideas on how to deal with this ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: nodetool removetoken
On Wed, Feb 15, 2012 at 8:49 AM, Brandon Williams dri...@gmail.com wrote: Before 1.0.8, use https://issues.apache.org/jira/browse/CASSANDRA-3337 to remove it. I'm missing something ;-( I don't see a solution in this link . . cheers On Tue, Feb 14, 2012 at 3:44 PM, Franc Carter franc.car...@sirca.org.au wrote: I teminated (ec2 destruction) a node that I was wedged during bootstrap. However when I try to removetoken I get 'Token not found'. It looks a bit like this issue ? https://issues.apache.org/jira/browse/CASSANDRA-3737 nodetool -h 127.0.0.1 ring gives this Address DC RackStatus State Load OwnsToken 85070591730234615865843651857942052864 10.253.65.203 us-east 1a Up Normal 11.18 GB 50.00% 0 10.252.82.64us-east 1a Down Joining 320.45 KB 25.00% 42535295865117307932921825928971026432 10.253.86.224 us-east 1a Up Normal 11.01 GB 25.00% 85070591730234615865843651857942052864 and nodetool -h 127.0.0.1 removetoken 42535295865117307932921825928971026432 gives xception in thread main java.lang.UnsupportedOperationException: Token not found. at org.apache.cassandra.service.StorageService.removeToken(StorageService.java:2369) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27) at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788) at sun.reflect.GeneratedMethodAccessor165.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Any ideas on how to deal with this ? thanks -- Franc Carter | Systems architect | Sirca Ltd franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: nodetool removetoken
On Wed, Feb 15, 2012 at 9:25 AM, Rob Coli rc...@palominodb.com wrote: On Tue, Feb 14, 2012 at 2:02 PM, Franc Carter franc.car...@sirca.org.auwrote: On Wed, Feb 15, 2012 at 8:49 AM, Brandon Williams dri...@gmail.comwrote: Before 1.0.8, use https://issues.apache.org/jira/browse/CASSANDRA-3337 to remove it. I'm missing something ;-( I don't see a solution in this link . . The solution is a patch : https://issues.apache.org/jira/secure/attachment/12500248/3337.txt If you apply this patch to your cassandra server, it will generate a JMX endpoint which will allow you to kill the token. Ahh - thanks cheers =Rob -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: keycache persisted to disk ?
2012/2/13 R. Verlangen ro...@us2.nl This is because of the warm up of Cassandra as it starts. On a start it will start fetching the rows that were cached: this will have to be loaded from the disk, as there is nothing in the cache yet. You can read more about this at http://wiki.apache.org/cassandra/LargeDataSetConsiderations I actually has the opposite 'problem'. I have a pair of servers that have been static since mid last week, but have seen performance vary significantly (x10) for exactly the same query. I hypothesised it was various caches so I shut down Cassandra, flushed the O/S buffer cache and then bought it back up. The performance wasn't significantly different to the pre-flush performance cheers 2012/2/13 Franc Carter franc.car...@sirca.org.au On Mon, Feb 13, 2012 at 5:03 PM, zhangcheng zhangch...@jike.com wrote: ** I think the keycaches and rowcahches are bothe persisted to disk when shutdown, and restored from disk when restart, then improve the performance. Thanks - that would explain at least some of what I am seeing cheers 2012-02-13 -- zhangcheng -- *发件人:* Franc Carter *发送时间:* 2012-02-13 13:53:56 *收件人:* user *抄送:* *主题:* keycache persisted to disk ? Hi, I am testing Cassandra on Amazon and finding performance can vary fairly wildly. I'm leaning towards it being an artifact of the AWS I/O system but have one other possibility. Are keycaches persisted to disk and restored on a clean shutdown and restart ? cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: keycache persisted to disk ?
On Mon, Feb 13, 2012 at 7:21 PM, Peter Schuller peter.schul...@infidyne.com wrote: I actually has the opposite 'problem'. I have a pair of servers that have been static since mid last week, but have seen performance vary significantly (x10) for exactly the same query. I hypothesised it was various caches so I shut down Cassandra, flushed the O/S buffer cache and then bought it back up. The performance wasn't significantly different to the pre-flush performance I don't get this thread at all :) Why would restarting with clean caches be expected to *improve* performance? I was expecting it to reduce performance due to cleaning of keycache and O/S buffer cache - performance stayed roughly the same And why is key cache loading involved other than to delay start-up and hopefully pre-populating caches for better (not worse) performance? If you want to figure out why queries seem to be slow relative to normal, you'll need to monitor the behavior of the nodes. Look at disk I/O statistics primarily (everyone reading this running Cassandra who aren't intimately familiar with iostat -x -k 1 should go and read up on it right away; make sure you understand the utilization and avg queue size columns), CPU usage, weather compaction is happening, etc. Yep - I've been looking at these - I don't see anything in iostat/dstat etc that point strongly to a problem. There is quite a bit of I/O load, but it looks roughly uniform on slow and fast instances of the queries. The last compaction ran 4 days ago - which was before I started seeing variable performance One easy way to see sudden bursts of poor behavior is to be heavily reliant on cache, and then have sudden decreases in performance due to compaction evicting data from page cache while also generating more I/O. Unlikely to be a cache issue - In one case an immediate second run of exactly the same query performed significantly worse. But that's total speculation. It is also the case that you cannot expect consistent performance on EC2 and that might be it. Variable performance from ec2 is my lead theory at the moment. But my #1 advise: Log into the node while it is being slow, and observe. Figure out what the bottleneck is. iostat, top, nodetool tpstats, nodetool netstats, nodetool compactionstats. I now why it is slow - it's clearly I/O bound. I am trying to hunt down why it is sometimes much faster even though I have (tried) to replicate the same conditions -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com) -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: keycache persisted to disk ?
2012/2/13 R. Verlangen ro...@us2.nl I also noticed that, Cassandra appears to perform better under a continues load. Are you sure the rows you're quering are actually in the cache? I'm making an assumption . . . I don't yet know enough about cassandra to prove they are in the cache. I have my keycache set to 2 million, and am only querying ~900,000 keys. so after the first time I'm assuming they are in the cache. cheers 2012/2/13 Franc Carter franc.car...@sirca.org.au 2012/2/13 R. Verlangen ro...@us2.nl This is because of the warm up of Cassandra as it starts. On a start it will start fetching the rows that were cached: this will have to be loaded from the disk, as there is nothing in the cache yet. You can read more about this at http://wiki.apache.org/cassandra/LargeDataSetConsiderations I actually has the opposite 'problem'. I have a pair of servers that have been static since mid last week, but have seen performance vary significantly (x10) for exactly the same query. I hypothesised it was various caches so I shut down Cassandra, flushed the O/S buffer cache and then bought it back up. The performance wasn't significantly different to the pre-flush performance cheers 2012/2/13 Franc Carter franc.car...@sirca.org.au On Mon, Feb 13, 2012 at 5:03 PM, zhangcheng zhangch...@jike.comwrote: ** I think the keycaches and rowcahches are bothe persisted to disk when shutdown, and restored from disk when restart, then improve the performance. Thanks - that would explain at least some of what I am seeing cheers 2012-02-13 -- zhangcheng -- *发件人:* Franc Carter *发送时间:* 2012-02-13 13:53:56 *收件人:* user *抄送:* *主题:* keycache persisted to disk ? Hi, I am testing Cassandra on Amazon and finding performance can vary fairly wildly. I'm leaning towards it being an artifact of the AWS I/O system but have one other possibility. Are keycaches persisted to disk and restored on a clean shutdown and restart ? cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: keycache persisted to disk ?
On Mon, Feb 13, 2012 at 7:49 PM, Peter Schuller peter.schul...@infidyne.com wrote: I'm making an assumption . . . I don't yet know enough about cassandra to prove they are in the cache. I have my keycache set to 2 million, and am only querying ~900,000 keys. so after the first time I'm assuming they are in the cache. Note that the key cache only caches the index positions in the data file, and not the actual data. The key cache will only ever eliminate the I/O that would have been required to lookup the index entry; it doesn't help to eliminate seeking to get the data (but as usual, it may still be in the operating system page cache). Yep - I haven't enabled row caches, my calculations at the moment indicate that the hit-ratio won't be great - but I'll be testing that later -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com) -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: keycache persisted to disk ?
On Mon, Feb 13, 2012 at 7:48 PM, Peter Schuller peter.schul...@infidyne.com wrote: Yep - I've been looking at these - I don't see anything in iostat/dstat etc that point strongly to a problem. There is quite a bit of I/O load, but it looks roughly uniform on slow and fast instances of the queries. The last compaction ran 4 days ago - which was before I started seeing variable performance [snip] I now why it is slow - it's clearly I/O bound. I am trying to hunt down why it is sometimes much faster even though I have (tried) to replicate the same conditions What does clearly I/O bound mean, and what is quite a bit of I/O load? the servers spending 50% of the time in io-wait In general, if you have queries that come in at some rate that is determined by outside sources (rather than by the time the last query took to execute), That's an interesting approach - is that likely to give close to optimal performance ? you will typically either get more queries than your cluster can take, or fewer. If fewer, there is a non-trivially sized grey area where overall I/O throughput needed is lower than that available, but the closer you are to capacity the more often requests have to wait for other I/O to complete, for purely statistical reasons. If you're running close to maximum capacity, it would be expected that the variation in query latency is high. That may well explain it - I'll have to think about what that means for our use case as load will be extremely bursty That said, if you're seeing consistently bad latencies for a while where you sometimes see consistently good latencies, that sounds different but would hopefully be observable somehow. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com) -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: keycache persisted to disk ?
On Mon, Feb 13, 2012 at 7:51 PM, Peter Schuller peter.schul...@infidyne.com wrote: For one thing, what does ReadStage's pending look like if you repeatedly run nodetool tpstats on these nodes? If you're simply bottlenecking on I/O on reads, that is the most easy and direct way to observe this empirically. If you're saturated, you'll see active close to maximum at all times, and pending racking up consistently. If you're just close, you'll likely see spikes sometimes. Yep, the readstage is backlogging consistently - but the thing I am trying to explain s why it is good sometimes in an environment that is pretty well controlled - other than being on ec2 -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com) -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: keycache persisted to disk ?
On Mon, Feb 13, 2012 at 8:00 PM, Peter Schuller peter.schul...@infidyne.com wrote: What is your total data size (nodetool info/nodetool ring) per node, your heap size, and the amount of memory on the system? 2 Node cluster, 7.9GB of ram (ec2 m1.large) RF=2 11GB per node Quorum reads 122 million keys heap size is 1867M (default from the AMI I am running) I'm reading about 900k keys As I was just going through cfstats - I noticed something I don't understand Key cache capacity: 906897 Key cache size: 906897 I set the key cache to 2million, it's somehow got to a rather odd number -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com) -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: keycache persisted to disk ?
On Mon, Feb 13, 2012 at 8:09 PM, Peter Schuller peter.schul...@infidyne.com wrote: the servers spending 50% of the time in io-wait Note that I/O wait is not necessarily a good indicator, depending on situation. In particular if you have multiple drives, I/O wait can mostly be ignored. Similarly if you have non-trivial CPU usage in addition to disk I/O, it is also not a good indicator. I/O wait is essentially giving you the amount of time CPU:s spend doing nothing because the only processes that would otherwise be runnable are waiting on disk I/O. But even a single process waiting on disk I/O - lots of I/O wait even if you have 24 drives. Yep - user space cpu is 20% or much worse when the io-wait goes in to the 90's - looks a great deal like IO bottleknecks The per-disk % utilization is generally a much better indicator (assuming no hardware raid device, and assuming no SSD), along with the average queue size. I doubt that figure is available sensibly in an ec2 instance In general, if you have queries that come in at some rate that is determined by outside sources (rather than by the time the last query took to execute), That's an interesting approach - is that likely to give close to optimal performance ? I just mean that it all depends on the situation. If you have, for example, some N number of clients that are doing work as fast as they can, bottlenecking only on Cassandra, you're essentially saturating the Cassandra cluster no matter what (until the client/network becomes a bottleneck). Under such conditions (saturation) you generally never should expect good latencies. For most non-batch job production use-cases, you tend to have incoming requests driven by something external such as user behavior or automated systems not related to the Cassandra cluster. In this cases, you tend to have a certain amount of incoming requests at any given time that you must serve within a reasonable time frame, and that's where the question comes in of how much I/O you're doing in relation to maximum. For good latencies, you always want to be significantly below maximum - particularly when platter based disk I/O is involved. That may well explain it - I'll have to think about what that means for our use case as load will be extremely bursty To be clear though, even your typical un-bursty load is still bursty once you look at it at sufficient resolution, unless you have something specifically ensuring that it is entirely smooth. A completely random distribution over time for example would look very even on almost any graph you can imagine unless you have sub-second resolution, but would still exhibit un-evenness and have an affect on latency. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com) -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: keycache persisted to disk ?
On Mon, Feb 13, 2012 at 8:15 PM, Peter Schuller peter.schul...@infidyne.com wrote: 2 Node cluster, 7.9GB of ram (ec2 m1.large) RF=2 11GB per node Quorum reads 122 million keys heap size is 1867M (default from the AMI I am running) I'm reading about 900k keys Ok, so basically a very significant portion of the data fits in page cache, but not all. yep As I was just going through cfstats - I noticed something I don't understand Key cache capacity: 906897 Key cache size: 906897 I set the key cache to 2million, it's somehow got to a rather odd number You're on 1.0 +? yep 1.07 Nowadays there is code to actively make caches smaller if Cassandra detects that you seem to be running low on heap. Watch cassandra.log for messages to that effect (don't remember the exact message right now). I just grep'd the logs and couldn't see anything that looked like that -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com) -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: active/pending queue lengths
On Tue, Feb 14, 2012 at 6:06 AM, aaron morton aa...@thelastpickle.comwrote: What CL are you reading at ? Quorum Write ops go to RF number of nodes, read ops go to RF number of nodes 10% (the default probability that Read Repair will be running) of the time and CL number of nodes 90% of the time. With 2 nodes and RF 2 the QUOURM is 2, every request will involve all nodes. Yep, the thing tat confuses is the different behaviour for reading from one node versus two As to why the pending list gets longer, do you have some more info ? What process are you using to measure ? It's hard to guess why. In this setup every node will have the data and should be able to do a local read and then on the other node. I have four pycassa clients, two making requests to one server and two making requests to the other (or all four making requests to the same server). The requested keys don't overlap and I would expect/assume the keys are in the keycache I am looking at the output of nodetool -h tpstats cheers Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 14/02/2012, at 12:47 AM, Franc Carter wrote: Hi, I've been looking at tpstats as various test queries run and I noticed something I don't understand. I have a two node cluster with RF=2 on which I run 4 parallel queries, each job goes through a list of keys doing a multiget for 2 keys at a time. If two of the queries go to one node and the other two go to a different node then the pending queue on the node gets much longer than if they all go to the one node. I'm clearly missing something here as I would have expected the opposite cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
keycache persisted to disk ?
Hi, I am testing Cassandra on Amazon and finding performance can vary fairly wildly. I'm leaning towards it being an artifact of the AWS I/O system but have one other possibility. Are keycaches persisted to disk and restored on a clean shutdown and restart ? cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: keycache persisted to disk ?
On Mon, Feb 13, 2012 at 5:03 PM, zhangcheng zhangch...@jike.com wrote: ** I think the keycaches and rowcahches are bothe persisted to disk when shutdown, and restored from disk when restart, then improve the performance. Thanks - that would explain at least some of what I am seeing cheers 2012-02-13 -- zhangcheng -- *发件人:* Franc Carter *发送时间:* 2012-02-13 13:53:56 *收件人:* user *抄送:* *主题:* keycache persisted to disk ? Hi, I am testing Cassandra on Amazon and finding performance can vary fairly wildly. I'm leaning towards it being an artifact of the AWS I/O system but have one other possibility. Are keycaches persisted to disk and restored on a clean shutdown and restart ? cheers -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: sensible data model ?
On Wed, Feb 8, 2012 at 6:05 AM, aaron morton aa...@thelastpickle.comwrote: None of those jump out at me as horrible for my case. If I modelled with Super Columns I would have less than 10,000 Super Columns with an average of 50 columns - big but no insane ? I would still try to do it without super columns. The common belief is they are about 10% slower, and they are a lot clunkier. There are some query and delete cases where they do things composite columns cannot, but in general I try to model things without using them first. Ok - it seems cleaner to model without them to me as well. Because of request overhead ? I'm currently using the batch interface of pycassa to do bulk reads. Is the same problem going to bite me if I have many clients reading (using bulk reads) ? In production we will have ~50 clients. pycassa has support for chunking requests to the server https://github.com/pycassa/pycassa/blob/master/pycassa/columnfamily.py#L633 It's because each row requested becomes a read task on the server and is placed into the read thread pool. There are only 32 (default) read thread in the pool. If one query comes along and requests 100 rows, it places 100 tasks in the thread pool where only 32 can be processed at a time. Some will back up as pending tasks and eventually be processed. If row reads reads take 1ms (just to pick a number, may be better) to read 100 rows we are talking about 3 or 4ms for that query. During that time any read requests received will have to wait for read threads. To that client this is excellent, it's has a high row throughput. To the other clients this is not, overall query throughput will drop. More is not always better. Note that as the number of nodes increases and this effect is may be reduced as reading 100 rows may result in the coordinator sending 25 row requests to 4 nodes. And there is also overhead involved in very big requests, see… http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Read-Latency-td5636553.html#a5652476 thanks Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 7/02/2012, at 2:28 PM, Franc Carter wrote: On Tue, Feb 7, 2012 at 6:39 AM, aaron morton aa...@thelastpickle.comwrote: Sounds like a good start. Super columns are not a great fit for modeling time series data for a few reasons, here is one http://wiki.apache.org/cassandra/CassandraLimitations None of those jump out at me as horrible for my case. If I modelled with Super Columns I would have less than 10,000 Super Columns with an average of 50 columns - big but no insane ? It's also a good idea to partition time series data so that the rows do not grow too big. You can have 2 billion columns in a row, but big rows have operational down sides. You could go with either: rows: entity_id:date column: property_name Which would mean each time your query for a date range you need to query multiple rows. But it is possible to get a range of columns / properties. Or rows: entity_id:time_partition column: date:property_name That's an interesting idea - I'll talk to the data experts to see if we have a sensible range. Where time_partition is something that makes sense in your problem domain, e.g. a calendar month. If you often query for days in a month you can then get all the columns for the days you are interested in (using a column range). If you only want to get a sub set of the entity properties you will need to get them all and filter them client side, depending on the number and size of the properties this may be more efficient than multiple calls. I'm find with doing work on the client side - I have a bias in that direction as it tends to scale better. One word of warning, avoid sending read requests for lots (i.e. 100's) of rows at once it will reduce overall query throughput. Some clients like pycassa take care of this for you. Because of request overhead ? I'm currently using the batch interface of pycassa to do bulk reads. Is the same problem going to bite me if I have many clients reading (using bulk reads) ? In production we will have ~50 clients. thanks Good luck. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/02/2012, at 12:12 AM, Franc Carter wrote: Hi, I'm pretty new to Cassandra and am currently doing a proof of concept, and thought it would be a good idea to ask if my data model is sane . . . The data I have, and need to query, is reasonably simple. It consists of about 10 million entities, each of which have a set of key/value properties for each day for about 10 years. The number of keys is in the 50-100 range and there will be a lot of overlap for keys in entity,days The queries I need to make are for sets of key/value properties for an entity on a day, e.g key1,keys2,key3 for 10 entities on 20 days. The number
Re: sensible data model ?
On Tue, Feb 7, 2012 at 6:39 AM, aaron morton aa...@thelastpickle.comwrote: Sounds like a good start. Super columns are not a great fit for modeling time series data for a few reasons, here is one http://wiki.apache.org/cassandra/CassandraLimitations None of those jump out at me as horrible for my case. If I modelled with Super Columns I would have less than 10,000 Super Columns with an average of 50 columns - big but no insane ? It's also a good idea to partition time series data so that the rows do not grow too big. You can have 2 billion columns in a row, but big rows have operational down sides. You could go with either: rows: entity_id:date column: property_name Which would mean each time your query for a date range you need to query multiple rows. But it is possible to get a range of columns / properties. Or rows: entity_id:time_partition column: date:property_name That's an interesting idea - I'll talk to the data experts to see if we have a sensible range. Where time_partition is something that makes sense in your problem domain, e.g. a calendar month. If you often query for days in a month you can then get all the columns for the days you are interested in (using a column range). If you only want to get a sub set of the entity properties you will need to get them all and filter them client side, depending on the number and size of the properties this may be more efficient than multiple calls. I'm find with doing work on the client side - I have a bias in that direction as it tends to scale better. One word of warning, avoid sending read requests for lots (i.e. 100's) of rows at once it will reduce overall query throughput. Some clients like pycassa take care of this for you. Because of request overhead ? I'm currently using the batch interface of pycassa to do bulk reads. Is the same problem going to bite me if I have many clients reading (using bulk reads) ? In production we will have ~50 clients. thanks Good luck. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/02/2012, at 12:12 AM, Franc Carter wrote: Hi, I'm pretty new to Cassandra and am currently doing a proof of concept, and thought it would be a good idea to ask if my data model is sane . . . The data I have, and need to query, is reasonably simple. It consists of about 10 million entities, each of which have a set of key/value properties for each day for about 10 years. The number of keys is in the 50-100 range and there will be a lot of overlap for keys in entity,days The queries I need to make are for sets of key/value properties for an entity on a day, e.g key1,keys2,key3 for 10 entities on 20 days. The number of entities and/or days in the query could be either very small or very large. I've modeled this with a simple column family for the keys with the row key being the concatenation of the entity and date. My first go, used only the entity as the row key and then used a supercolumn for each date. I decided against this mostly because it seemed more complex for a gain I didn't really understand. Does this seem sensible ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
sensible data model ?
Hi, I'm pretty new to Cassandra and am currently doing a proof of concept, and thought it would be a good idea to ask if my data model is sane . . . The data I have, and need to query, is reasonably simple. It consists of about 10 million entities, each of which have a set of key/value properties for each day for about 10 years. The number of keys is in the 50-100 range and there will be a lot of overlap for keys in entity,days The queries I need to make are for sets of key/value properties for an entity on a day, e.g key1,keys2,key3 for 10 entities on 20 days. The number of entities and/or days in the query could be either very small or very large. I've modeled this with a simple column family for the keys with the row key being the concatenation of the entity and date. My first go, used only the entity as the row key and then used a supercolumn for each date. I decided against this mostly because it seemed more complex for a gain I didn't really understand. Does this seem sensible ? thanks -- *Franc Carter* | Systems architect | Sirca Ltd marc.zianideferra...@sirca.org.au franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215