Re: General question regarding bootstrap and nodetool repair
On Thu, Jan 31, 2013 at 12:19 PM, Wei Zhu wz1...@yahoo.com wrote: But I am still not sure how about the my first question regarding the bootstrap, anyone? As I understand it, bootstrap occurs from a single replica. Which replica is chosen is based on some internal estimation of which is closest/least loaded/etc. But only from a single replica, so in RF=3, in order to be consistent with both you still have to run a repair. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: initial_token
On Thu, Jan 31, 2013 at 12:17 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Now by default a new partitioner is chosen Murmer3. Now = as of 1.2, to be unambiguous. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: General question regarding bootstrap and nodetool repair
On Thu, Jan 31, 2013 at 3:31 PM, Wei Zhu wz1...@yahoo.com wrote: The only reason I can think of is that the new node has the same IP as the dead node we tried to replace? After reading the bootstrap code, it shouldn't be the case. Is it a bug? Or anyone tried to replace a dead node with the same IP? You can use replace_token property to accomplish this. I would expect cassandra to get confused by having two nodes with the same ip. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Suggestion: Move some threads to the client-dev mailing list
On Wed, Jan 30, 2013 at 7:21 AM, Edward Capriolo edlinuxg...@gmail.com wrote: My suggestion: At minimum we should re-route these questions to client-dev or simply say, If it is not part of core Cassandra, you are looking in the wrong place for support +1, I find myself scanning past all those questions in order to find questions I am able to answer based solely on my operational knowledge of the Cassandra daemon. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Cassandra timeout whereas it is not much busy
On Wed, Jan 16, 2013 at 1:30 PM, Nicolas Lalevée nicolas.lale...@hibnet.org wrote: Here is the long story. After some long useless staring at the monitoring graphs, I gave a try to using the openjdk 6b24 rather than openjdk 7u9 OpenJDK 6 and 7 are both counter-recommended with regards to Cassandra. I've heard reports of mysterious behavior like the behavior you describe, when using OpenJDK 7. Try using the Sun/Oracle JVM? Is your JNA working? =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: node down = log explosion?
On Tue, Jan 22, 2013 at 5:03 AM, Sergey Olefir solf.li...@gmail.com wrote: I am load-testing counter increments at the rate of about 10k per second. Do you need highly performant counters that count accurately, without meaningful chance of over-count? If so, Cassandra's counters are probably not ideal. We wanted to test what happens if one node goes down, so we brought one node down in DC1 (i.e. the node that was handling half of the incoming writes). ... This led to a complete explosion of logs on the remaining alive node in DC1. I agree, this level of exception logging during replicateOnWrite (which is called every time a counter is incremented) seems like a bug. I would file a bug at the Apache JIRA. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: node down = log explosion?
On Tue, Jan 22, 2013 at 2:57 PM, Sergey Olefir solf.li...@gmail.com wrote: Do you have a suggestion as to what could be a better fit for counters? Something that can also replicate across DCs and survive link breakdown between nodes (across DCs)? (and no, I don't need 100.00% precision (although it would be nice obviously), I just need to be pretty close for the values of pretty) In that case, Cassandra counters are probably fine. On the subject of bug report -- I probably will -- but I'll wait a bit for more info here, perhaps there's some configuration or something that I just don't know about. Excepting on replicateOnWrite stage seems pretty unambiguous to me, and unexpected. YMMV? =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: is there a way to list who is connected to my cluster?
On Fri, Jan 11, 2013 at 10:32 AM, Brian Tarbox tar...@cabotresearch.com wrote: I'd like to be able to find out which processes are connected to my clusteris there a way to do this? No, not internally to Cassandra, short of enabling DEBUG logging for associated classes. Use netstat or similar. If you are interested in such a feature, please log into Cassandra's JIRA and vote for this issue : https://issues.apache.org/jira/browse/CASSANDRA-5084 Cassandra should expose connected client state via JMX =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Script to load sstables from v1.0.x to v 1.1.x
On Tue, Jan 8, 2013 at 8:41 AM, Todd Nine todd.n...@gmail.com wrote: I have recently been trying to restore backups from a v1.0.x cluster we have into a 1.1.7 cluster. This has not been as trivial as I expected, and I've had a lot of help from the IRC channel in tackling this problem. As a way of saying thanks, I'd like to contribute the updated ruby script I was originally given for accomplishing this task. Here it is. While I laud your contribution, I am still not fully understanding why this is not working automagically, as it should : http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-1-flexible-data-file-placement What about upgrading? Do you need to manually move all pre-1.1 data files to the new directory structure before upgrading to 1.1? No. Immediately after Cassandra 1.1 starts, it checks to see whether it has old directory structure and migrates all data files (including backups and snapshots) to the new directory structure if needed. So, just upgrade as you always do (don’t forget to read NEWS.txt first), and you will get more control over data files for free. Is it possible that, for example, the installation of the debian package results in your 1.1.x node starting up before you intend it to.. and then when you start it again with the 1.0 paths, it doesn't try to change the paths? * To check if sstables needs migration, we look at the System directory. If it contains a directory for the status cf, we'll attempt a sstable migrating. This quote from Directories.java (thx driftx!) suggests that any starting of a 1.1 node, which would result in a Status columnfamily being created, would make sstablesNeedsMigration return false. If this is your case due to the use of the debian package or similar which auto-starts, your input is welcomed at : https://issues.apache.org/jira/browse/CASSANDRA-2356 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Script to load sstables from v1.0.x to v 1.1.x
On Tue, Jan 8, 2013 at 11:56 AM, Todd Nine todd.n...@gmail.com wrote: Our current production cluster is still on 1.0.x, so we can either fire up a 1.0.x cluster, then upgrade every node to accomplish this, or just use the script. No 1.0 cluster is required to restore 1.0 directory structure to a 1.1 cluster and have the tables be migrated by Cassandra. The 1.1 node should look at the 1.0 directory structure you just restored and migrate it automagically. We also have a different number of nodes in stage vs production, so we'd still need to run a repair if we did a straight sstable copy. This is a compelling reason to bulk load. My commentary merely points out that if you *aren't* changing cluster size/topology, Cassandra 1.1 should be migrating the sstables for you. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: replace_token versus nodetool repair
On Mon, Jan 7, 2013 at 9:05 AM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: Is nodetool repair only usable if the node to repair has a valid (= up-to-date with its neighbors) schema? If the node is in the cluster, it should have the correct schema. If it doesn't have the correct schema, you should either wait until the schema is received, or (if it's stuck) wipe the schema on that node and re-join the node. If the data records are completely broken on a node with token, is it valid to clean the (data) records and to execute replace_token=token on the *same* node? Yes. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Inter-Cluster Communication
On Wed, Jan 2, 2013 at 4:33 AM, Everton Lima peitin.inu...@gmail.com wrote: I would to know if it is possible to create 2 clusters, in the first constain just meta-data and in the second just the real data. How the system will comunicate with this both cluster and one cluster communicate with other? Could any one help me? Cassandra does not have a mechanism for clusters to talk to each other. Your application could talk to both clusters, but they would be independent of each other. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Very large HintsColumnFamily
Before we start.. what version of cassandra? On Fri, Dec 21, 2012 at 4:25 PM, Keith Wright kwri...@nanigans.com wrote: This behavior seems to occur if I do a large amount of data loading using that node as the coordinator node. In general you want to use all nodes to coordinate, not a single one. Nodetool netstats never seems to show any streaming data. With past nodes it seemed like the node eventually fixed itself. That node is storing hints for other nodes it believes are or were at some point in DOWN state. The first step to preventing this problem from recurring is to understand why it believes/d other nodes are down. My conjecture is that you are overloading the coordinating node and/or other nodes with the large amount of write. Note that I am seeing severely degraded performance on this node when it attempts to compact the HintsColumnFamily to the point where I had to set setcompactionthroughput to 999 to ensure it doesn't run again (after which the node started serving requests much faster). Depending on version, your 40gb of hints could be in one 40gb wide row. Look at nodetool cfstats for HintsColumnFamily to determine if this is the case. Do you see Timed out replaying hint messages, or are the hints being successfully delivered? You have two broad options : 1) purge your hints and then either reload the data (if reloading it will be idempotent) or repair -pr on every node in the cluster. 2) reduce load enough that hints will be successfully delivered, reduce gc_grace_seconds on the hints cf to 0 and then do a major compaction. If I were you, I would probably do 1). The easiest way is to stop the node and remove all sstables in the HintsColumnFamily. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: question about config leading to an unbalanced ring
On Thu, Dec 20, 2012 at 10:18 AM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: With RF=3 and NetworkTopologyStrategy, The first replica per data center is placed according to the partitioner (same as with SimpleStrategy). Additional replicas in the same data center are then determined by walking the ring clockwise until a node in a different rack from the previous replica is found. So, if I understand correctly the data of rack1's 5 nodes will be replicated on the single node of rack2. And then, the node of rack1 will host all the data of the cluster. https://issues.apache.org/jira/browse/CASSANDRA-3810 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Monitoring the number of client connections
On Wed, Dec 19, 2012 at 4:20 PM, Hiller, Dean dean.hil...@nrel.gov wrote: What? I thought cassandra was using nio so thread per connection is not true? Here's the monkey test I used to verify my conjecture. 1) ps -eLf |grep jsvc |grep cassandra | wc -l # note number of threads 2) for name in {1..300}; do cassandra-cli -h `hostname` -k validkeyspace done 3) ps -eLf |grep jsvc | grep cassandra | wc -l # note much higher number of threads 4) for name in {1..300}; do kill %$name done 5) ps -eLf |grep jsvc | grep cassandra | wc -l # note that thread count drops like a rock as connections are GCed Via aaron_morton, here's the relevant chunk of cassandra.yaml : https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L347 # sync - One thread per thrift connection. .. # hsha - Stands for half synchronous, half asynchronous. .. # The default is sync because on Windows hsha is about 30% slower. On Linux, # sync/hsha performance is about the same, with hsha of course using less memory. So, by default Cassandra does in fact use one thread per thrift connection. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Monitoring the number of client connections
On Thu, Dec 20, 2012 at 12:41 PM, Rob Coli rc...@palominodb.com wrote: So, by default Cassandra does in fact use one thread per thrift connection. Also of note is that even with hsha, an *active* connection (where synchronous storage backend is doing something) consumes a thread. Some more background at : https://issues.apache.org/jira/browse/CASSANDRA-4277 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: State of Cassandra and Java 7
On Thu, Dec 13, 2012 at 11:43 AM, Drew Kutcharian d...@venarc.com wrote: With Java 6 begin EOL-ed soon (https://blogs.oracle.com/java/entry/end_of_public_updates_for), what's the status of Cassandra's Java 7 support? Anyone using it in production? Any outstanding *known* issues? I'd love to see an official statement from the project, due to the sort of EOL issues you're referring to. Unfortunately previous requests on this list for such a statement have gone unanswered. The non-official response is that various people run in production with Java 7 and it seems to work. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Help on MMap of SSTables
On Thu, Dec 6, 2012 at 7:36 PM, aaron morton aa...@thelastpickle.com wrote: So for memory mapped files, compaction can do a madvise SEQUENTIAL instead of current DONTNEED flag after detecting appropriate OS versions. Will this help? AFAIK Compaction does use memory mapped file access. The history : https://issues.apache.org/jira/browse/CASSANDRA-1470 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: reversed=true for CQL 3
On Thu, Dec 6, 2012 at 5:26 PM, Shahryar Sedghi shsed...@gmail.com wrote: I am on 1.1.4 now (I can go to 1.1.6 if needed) and apparently it is broken. I defined the table like this: In general people on 1.1.x below 1.1.6 should upgrade to at least 1.1.6 ASAP, because all versions of 1.1.x before 1.1.6 have broken Hinted Handoff. please let me know if I need to open a bug. I unfortunately don't know whether what you are doing should work in 1.1.4 or not. If I were you, I'd give 1.1.6/7 a shot, and if it still doesn't work probably file a bug. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: What is substituting keys_cached column family argument
On Wed, Dec 5, 2012 at 9:06 AM, Roman Yankin ro...@cognitivematch.com wrote: In Cassandra v 0.7 there was a column family property called keys_cached, now it's gone and I'm struggling to understand which of the below properties it's now substituted (if substituted at all)? Key and row caches are global in modern cassandra. You opt CFs out of the key cache, not opt in, because the default setting is keys_only on a per-CF basis. http://www.datastax.com/docs/1.1/configuration/node_configuration#row-cache-keys-to-save http://www.datastax.com/docs/1.1/configuration/node_configuration#key-cache-keys-to-save http://www.datastax.com/docs/1.1/configuration/storage_configuration#caching =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Rename cluster
On Thu, Nov 29, 2012 at 11:56 AM, Wei Zhu wz1...@yahoo.com wrote: I am trying to rename a cluster by following the instruction on Wiki: [...] I have to remove the data directory in order to change the cluster name. Luckily it's my testing box, so no harm. Just wondering what has been changed not to allow the modification through cli? What is the way of changing the cluster name without wiping out all the data now? The instructions on the wiki are wrong, because clients are now not able to not able to update the system keyspace. The below ticket is an expansion of the previous behavior, which must mean the behavior in question dates to 0.8 or early 1.0. https://issues.apache.org/jira/browse/CASSANDRA-3759 You don't need to remove the data directory to change the cluster name, you only need to remove the contents of the system keyspace's LocationInfo columnfamily, which is where Cluster Name is stored. Full, safe, offline process to rename a cluster : 1) put new cluster name in conf files 2) stop cluster (including draining and removing commitlog if you are in a version below 1.1.6 and cannot rely on drain to prevent log replay) 3) move LocationInfo files out of system keyspace of all nodes 4) start cluster I do not believe there is currently any online way to do the above operation. The check which prevents you from writing to the system keyspace is a keyspace level check. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: counters + replication = awful performance?
On Tue, Nov 27, 2012 at 3:21 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I mispoke really. It is not dangerous you just have to understand what it means. this jira discusses it. https://issues.apache.org/jira/browse/CASSANDRA-3868 Per Sylvain on the referenced ticket : I don't disagree about the efficiency of the valve, but at what price? 'Bootstrapping a node will make you lose increments (you don't know which ones, you don't know how many and this even if nothing goes wrong)' is a pretty bad drawback. That is pretty much why that option makes me uncomfortable: it does give you better performance, so people may be tempted to use it. Now if it was only a matter of replicating writes only through read-repair/repair, then ok, it's pretty dangerous but it's rather easy to explain/understand the drawback (if you don't lose a disk, you don't lose increments, and you'd better use CL.ALL or have read_repair_chance to 1). But the fact that it doesn't work with bootstrap/move makes me wonder if having the option at all is not making a disservice to users. To me anything that can be described as will make you lose increments (you don't know which ones, you don't know how many and this even if nothing goes wrong) and which therefore doesn't work with bootstrap/move is correctly described as dangerous. :D =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: counters + replication = awful performance?
On Wed, Nov 28, 2012 at 7:15 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I may be wrong but during a bootstrap hints can be silently discarded, if the node they are destined for leaves the ring. Yeah : https://issues.apache.org/jira/browse/CASSANDRA-2434 A user like this might benefit from DANGER counters. They are not looking for perfection, only better performance, and the counter row keys themselves role over in 5 minutes anyway. Yep, I agree that if you don't care about accurate counting, Cassandra counters may be for you. Cassandra counters in mongo mode are even more web scale! The unfortunate thing is that people seem to assume that software does what it is supposed to do, and probably do not get a great impression of said software when it doesn't. :D =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Booting up a Datacenter replication
On Fri, Nov 23, 2012 at 11:33 AM, Darvin Denmian darvin.denm...@gmail.com wrote: But right now I need to increased the level of data redundancy ... and to accomplish that I'll configure 3 new Cassandra nodes in other Data Center. https://issues.apache.org/jira/browse/CASSANDRA-3483 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Changing placement stratgy?
On Fri, Nov 23, 2012 at 3:33 AM, Thomas Stets thomas.st...@gmail.com wrote: Is there any advantage in using a different placement strategy, consigering that each node has all of the data anyway? No. In your case there is no advantage to NetworkTopologyStrategy. It is somewhat odd that you have one logical cluster in two physical datacenters, however. That's not usually the way people do it. Of course people don't often do RF=N either. :) If so, it is possible to change the placement strategy in an existing cluster? Yes. The only practical way is to change it such that it is a NOOP. In your case (RF=N), all changes will be a NOOP. Once changed, however, you can use the features of the new Strategy to your advantage. Although in your case it doesn't matter.. If you decide to try to change your Strategy in general, be sure to design a test that uses nodetool getendpoints to verify that replica sets say the same. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Looking for a good Ruby client
On Tue, Nov 20, 2012 at 11:40 PM, Timmy Turner timm.t...@gmail.com wrote: I thought you were going to expose the internals of CQL3 features like (wide rows with) complex keys and collections to CQL2 clients (which is something that should generally be possible, if Datastax' blog posts are accurate, i.e. an actual description of how things were implemented and not just a conceptual one). https://issues.apache.org/jira/browse/CASSANDRA-4377 https://issues.apache.org/jira/browse/CASSANDRA-4924 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Upgrade 1.1.2 - 1.1.6
On Mon, Nov 19, 2012 at 7:18 PM, Mike Heffner m...@librato.com wrote: We performed a 1.1.3 - 1.1.6 upgrade and found that all the logs replayed regardless of the drain. Your experience and desire for different (expected) behavior is welcomed on : https://issues.apache.org/jira/browse/CASSANDRA-4446 nodetool drain sometimes doesn't mark commitlog fully flushed If every production operator who experiences this issue shares their experience on this bug, perhaps the project will acknowledge and address it. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Invalid argument
On Tue, Nov 20, 2012 at 2:03 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: ] Thanks for the work around, setting disk_access_mode: standard worked. Do you have working JNA, for reference? =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Upgrade 1.1.2 - 1.1.6
On Thu, Nov 15, 2012 at 6:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: We had an issue with counters over-counting even using the nodetool drain command before upgrading... You're sure the over-count was caused by the upgrade?Counts can be counted on (heh) to overcount. What is the scale of the over-count? Also, do you realize that drain doesn't actually prevent over-replay, though it should? https://issues.apache.org/jira/browse/CASSANDRA-4446 Check for replayed operations in the system log.. grep -i replay /path/to/system.log (I see further down thread that your logs do in fact contain replay. Please comment to this effect, and how negative the experience is for counts, on 4446?) I saw that the sudo apt-get install cassandra stop the server and restart it automatically. So it updated without draining and restart before I had the time to reconfigure the conf files. Is this normal ? Is there a way to avoid it ? https://issues.apache.org/jira/browse/CASSANDRA-2356 Perhaps if you comment on here you can help convince the debian packager that auto-starting a distributed database when you install or upgrade its package has negative operational characteristics. After both of these updates I saw my current counters increase without any reason. Did I do anything wrong ? Expecting counters to not over-count may qualify as wrong. But your process seems reasonable. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Query regarding SSTable timestamps and counts
On Sun, Nov 18, 2012 at 7:57 PM, Ananth Gundabattula agundabatt...@gmail.com wrote: As per the above url, After running a major compaction, automatic minor compactions are no longer triggered, frequently requiring you to manually run major compactions on a routine basis. ( Just before the heading Tuning Column Family compression in the above link) This inaccurate statement has been questioned a few times on the mailing list. Generally what happens is people discuss it for about 10 emails and then give up because they can't really make sense of it. If you google for cassandra-user and that sentence above, you should find the threads. I suggest mailing d...@datastax.com, explaining your confusion, and asking them to fix it. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: row cache re-fill very slow
On Mon, Nov 19, 2012 at 6:17 AM, Andras Szerdahelyi andras.szerdahe...@ignitionone.com wrote: How is the saved row cache file processed? Are the cached row keys simply iterated over and their respective rows read from SSTables - possibly creating random reads with small enough sstable files, if the keys were not stored in a manner optimised for a quick re-fill ? - or is there a smarter algorithm ( i.e. scan through one sstable at a time, filter rows that should be in row cache ) at work and this operation is purely disk i/o bound ? Nope, that's it. I am quite confident that in the version you are running, it just assembles the row from disk, from the relevant SSTables, via the more or less normal read path. The more fragmented your sstables, the more random the i/o. These two 1.2.x era JIRA relate to the row cache startup penalty : https://issues.apache.org/jira/browse/CASSANDRA-4282 # multi-threaded row cache loading at startup https://issues.apache.org/jira/browse/CASSANDRA-3762 # improvements to the AutoSavingCache (which is the base class of AutoSavingRowCache) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Question regarding the need to run nodetool repair
On Thu, Nov 15, 2012 at 4:12 PM, Dwight Smith dwight.sm...@genesyslab.com wrote: I have a 4 node cluster, version 1.1.2, replication factor of 4, read/write consistency of 3, level compaction. Several questions. Hinted Handoff is broken in your version [1] (and all versions between 1.0.0 and 1.0.3 [2]). Upgrade to 1.1.6 ASAP so that the answers below actually apply, because working Hinted Handoff is involved. 1) Should nodetool repair be run regularly to assure it has completed before gc_grace? If it is not run, what are the exposures? If you do DELETE logical operations, yes. If not, no. gc_grace_seconds only applies to tombstones, and if you do not delete you have no tombstones. If you only DELETE in one columnfamily, that is the only one you have to repair within gc_grace. Exposure is zombie data, where a node missed a DELETE (and associated tombstone) but had a previous value for that column or row and this zombie value is resurrected and propagated by read repair. 2) If a node goes down, and is brought back up prior to the 1 hour hinted handoff expiration, should repair be run immediately? In theory, if hinted handoff is working, no. This is a good thing because otherwise simply restarting a node would trigger the need for repair. In practice I would be shocked if anyone has scientifically tested it to the degree required to be certain all edge cases are covered, so I'm not sure I would rely on this being true. Especially as key components of this guarantee such as Hinted Handoff can be broken for 3-5 point releases before anyone notices. It is because of this uncertainty that I recommend periodic repair even in clusters that don't do DELETE. 3) If the hinted handoff has expired, the plan is to remove the node and start a fresh node in its place. Does this approach cause problems? Yes. 1) You've lost any data that was only ever replicated to this node. With RF=3, this should be relatively rare, even with CL.ONE, because writes are much more likely to succeed-but-report-they-failed than vice versa. If you run periodic repair, you cover the case where something gets under-replicated and then even less replicated as nodes are replaced. 2) When you replace the node in its place (presumably using replace_token) you will only stream the relevant data from a single other replica. This means that, given 3 nodes A B C where datum X is on A and B, and B fails, it might be bootstrapped using C as a source, decreasing your replica count of X by 1. In order to deal with these issues, you need to run a repair of the affected node after bootstrapping/replace_tokening. Until this repair completes, CL.ONE reads might be stale or missing. I think what operators really want is a path by which they can bootstrap and then repair, before returning the node to the cluster. Unfortunately there are significant technical reasons which prevent this from being trivial. As such, I suggest increasing gc_grace_seconds and max_hint_window_in_ms to reduce the amount of repair you need to run. The negative to increasing gc_grace is that you store tombstones for longer before purging them. The negative to increasing max_hint_window_in_ms is that hints for a given token are stored in one row.. and very wide rows can exhibit pathological behavior. Also if you set max_hint_window_in_ms too high, you could cause cascading failure as nodes fill with hints, become less performant... thereby increasing the cluster-wide hint rate. Unless you have a very high write rate or really lazy ops people who leave nodes down for very long times, the cascading failure case is relatively unlikely. =Rob [1] https://issues.apache.org/jira/browse/CASSANDRA-4772 [2] https://issues.apache.org/jira/browse/CASSANDRA-3466 -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: backup/restore from sstable files ?
On Sat, Nov 10, 2012 at 3:00 PM, Tyler Hobbs ty...@datastax.com wrote: For an alternative that doesn't require the same ring topology, you can use the bulkloader, which will take care of distributing the data to the correct nodes automatically. For more details on which cases are best for the different bulk loading techniques : http://palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Single Node Cassandra Installation
On Sat, Nov 10, 2012 at 6:16 PM, Drew Kutcharian d...@venarc.com wrote: Thanks Rob, this makes sense. We only have one rack at this point, so I think it'd be better to start with PropertyFileSnitch to make Cassandra think that these nodes each are in a different rack without having to put them on different subnets. And I will have more flexibility (at the cost of keeping the property file in sync) when it comes to growth. Many people run successfully with PFS, and as you say it provides flexibility if you get a second rack. The overhead versus a non-rack-aware snitch is not significant. However if you are careful you should be able to switch to PFS or another rack aware snitch with no problems when you actually need it... :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Counter column families (pending replicate on write stage tasks)
On Mon, Nov 12, 2012 at 3:35 PM, cem cayiro...@gmail.com wrote: We are currently facing a performance issue with counter column families. I see lots of pending ReplicateOnWriteStage tasks in tpstats. Then I disabled replicate_on_write. It helped a lot. I want to use like that but I am not sure how to use it. Quoting Sylvain, one of the primary authors/maintainers of the Counters support... https://issues.apache.org/jira/browse/CASSANDRA-3868 I don't disagree about the efficiency of the valve, but at what price? 'Bootstrapping a node will make you lose increments (you don't know which ones, you don't know how many and this even if nothing goes wrong)' is a pretty bad drawback. That is pretty much why that option makes me uncomfortable: it does give you better performance, so people may be tempted to use it. Now if it was only a matter of replicating writes only through read-repair/repair, then ok, it's pretty dangerous but it's rather easy to explain/understand the drawback (if you don't lose a disk, you don't lose increments, and you'd better use CL.ALL or have read_repair_chance to 1). But the fact that it doesn't work with bootstrap/move makes me wonder if having the option at all is not making a disservice to users. IOW, don't be tempted to turn off replicate_on_write. It breaks counters. If you are under capacity at a steady state, increase capacity. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: leveled compaction and tombstoned data
On Thu, Nov 8, 2012 at 10:12 AM, B. Todd Burruss bto...@gmail.com wrote: my question is would leveled compaction help to get rid of the tombstoned data faster than size tiered, and therefore reduce the disk space usage? You could also... 1) run a major compaction 2) code up sstablesplit 3) profit! This method incurs a management penalty if not automated, but is otherwise the most efficient way to deal with tombstones and obsolete data.. :D =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: backup/restore from sstable files ?
On Thu, Nov 8, 2012 at 5:15 PM, Yang tedd...@gmail.com wrote: some of my colleagues seem to use this method to backup/restore a cluster, successfully: on each of the node, save entire /cassandra/data/ dir to S3, then on a new set of nodes, with exactly the same number of nodes, copy back each of the data/ dir. then boot up cluster. Yep, that works as long as the two clusters have the same tokens and replication strategies. but I wonder how it worked: doesn't the system keyspace store information specific to the current cluster, such as my sibling nodes in the cluster, my IP ?? all these would change once you copy the frozen data files onto a new set of nodes. Yes, for this reason you should not restore the system keyspace files (except, optionally, Schema.). Definitely you should not restore LocationInfo. LocationInfo contains ip-to-token mappings. Also you should make your target cluster have a unique cluster name, and the old cluster name is also stored in LocationInfo... =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: unsubscribe
On Thu, Nov 8, 2012 at 4:57 PM, Jeremy McKay jeremy.mc...@ntrepidcorp.com wrote: http://wiki.apache.org/cassandra/FAQ#unsubscribe -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Single Node Cassandra Installation
On Mon, Nov 5, 2012 at 12:23 PM, Drew Kutcharian d...@venarc.com wrote: Switching from SimpleStrategy to RackAware can be a pain. Can you elaborate a bit? What would be the pain point? If you don't maintain the same replica placement vis a vis nodes on your cluster, you have to dump and reload. Simple example, 6 node cluster RF=3 : SimpleSnitch : A B C D E F Data for natural range of A is also on B and C, the next nodes in the ring. RackAwareSnitches : A B C D E F racks they are in : 1 1 2 2 3 3 Data for natural range of A is also on C and E, because despite not being the next nodes in the RING, they are the first nodes in the next rack. If however you go from simple to rack aware and put your nodes in racks like : A B C D E F 1 2 3 1 2 3 Then you have the same replica placement that SimpleStrategy gives you and can safely switch strategies/snitches on an existing cluster. Data for A is on B and C, on the same hosts, but for different reasons. Use nodetool getendpoints to test. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: repair, compaction, and tombstone rows
On Fri, Nov 2, 2012 at 2:46 AM, horschi hors...@gmail.com wrote: might I ask why repair cannot simply ignore anything that is older than gc-grace? (like Aaron proposed) I agree that repair should not process any tombstones or anything. But in my mind it sounds reasonable to make repair ignore timed-out data. Because the timestamp is created on the client, there is no reason to repair these, right? IIRC, tombstone timestamps are written by the server, at compaction time. Therefore if you have RF=X, you have X different timestamps relative to GCGraceSeconds. I believe there was another thread about two weeks ago in which Sylvain detailed the problems with what you are proposing, when someone else asked approximately the same question. I even noticed an increase when running two repairs directly after each other. So even when data was just repaired, there is still data being transferred. I assume this is due some columns timing out within that timeframe and the entire row being repaired. Merkle trees are an optimization, what they trade for this optimization is over-repair. (FWIW, I agree that, if possible, this particular case of over-repair would be nice to eliminate.) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: repair, compaction, and tombstone rows
On Thu, Nov 1, 2012 at 1:43 AM, Sylvain Lebresne sylv...@datastax.com wrote: on all your columns), you may want to force a compaction (using the JMX call forceUserDefinedCompaction()) of that sstable. The goal being to get read of a maximum of outdated tombstones before running the repair (you could also alternatively run a major compaction prior to the repair, but major compactions have a lot of nasty effect so I wouldn't recommend that a priori). If sstablesplit (reverse compaction) existed, major compaction would be a simple solution to this case. You'd major compact and then split your One Giant SSTable With No Tombstones into a number of smaller ones. :) https://issues.apache.org/jira/browse/CASSANDRA-4766 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Data migration between clusters
On Tue, Oct 30, 2012 at 4:18 AM, 張 睿 chou...@cyberagent.co.jp wrote: Does anyone here know if there is an efficient way to migrate multiple cassandra clusters' data to a single cassandra cluster without any dataloss. Yes. 1) create schema which is superset of all columnfamilies and all keyspaces 2) if all source clusters were the same fixed number of nodes, create a new cluster with the same fixed number of nodes 3) nodetool drain and shut down all nodes on all participating clusters 4) copy sstables from old clusters, maintaining that data from source node [x] ends up on target node [x] 5) start cassandra However without more details as to your old clusters, new clusters, and availability requirements, I can't give you a more useful answer. Here's some background on bulk loading, including copy-the-sstables. http://palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Wrong data after rolling restart
On Mon, May 21, 2012 at 7:08 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Here are my 2 nodes starting logs, I hop it can help... https://gist.github.com/2762493 https://gist.github.com/2762495 I see in these logs that you replay 2 mutations per node, despite doing nodetool drain before restarting. However 2 replayed mutations per node is unlikely to corrupt a significant number of counters. As a nodetool drain is supposed to drain the commitlog entirely, you are encountering : https://issues.apache.org/jira/browse/CASSANDRA-4446 I also see that you are running 1.0.7. You are unlikely to receive any useful response from the project if you file this behavior as a bug against 1.0.7. If you restore your backup, you might wish to upgrade to 1.0.12 before doing so, in case this is an issue fixed in the interim. I wanted to try a new config. After doing a rolling restart I have all my counters false, with wrong values. I stopped my servers with the following : [ snip ] And after restarting the second one I have lost all the consistency of my data. All my statistics since September are totally false now in production. What does totally false mean? The most common inaccuracy of Cassandra Counters is that they slightly overcount, not that they are totally false in other ways. Did you repair this cluster at any time? 1 - How to fix it ? (I have a backup from this morning, but I will lose all the data after this date if I restore this backup) Restoring this backup is the only way you are likely to fix this. Once counters are corrupt/wrong you have no chance to survive make your time. Restoring this backup may not even fix it permanently, depending on what unknown cause is to blame. 2 - What happened ? How to avoid it ? Distributed counting has meaningful edge cases, and Cassandra Counters do not cover 100% of them. As such, I recommend not using them if accuracy is critically important. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: constant CMS GC using CPU time
On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot btal...@aeriagames.com wrote: The nodes with the most data used the most memory. All nodes are affected eventually not just one. The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. This sounds very much like my heap is so consumed by (mostly) bloom filters that I am in steady state GC thrash. Do you have heap graphs which show a healthy sawtooth GC cycle which then more or less flatlines? =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Java 7 support?
On Tue, Oct 16, 2012 at 4:45 PM, Edward Sargisson edward.sargis...@globalrelay.net wrote: The Datastax documentation says that Java 7 is not recommended[1]. However, Java 6 is due to EOL in Feb 2013 so what is the reasoning behind that comment? I've asked this approximate question here a few times, with no official response. The reason I ask is that in addition to Java 7 not being recommended, in Java 7 OpenJDK becomes the reference JVM, and OpenJDK is also not recommended. From other channels, I have conjectured that the current advice on Java 7 is it 'works' but is not as extensively tested (and definitely not as commonly deployed) as Java 6. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Repair Failing due to bad network
https://issues.apache.org/jira/browse/CASSANDRA-3483 Is directly on point for the use case in question, and introduces rebuild concept.. https://issues.apache.org/jira/browse/CASSANDRA-3487 https://issues.apache.org/jira/browse/CASSANDRA-3112 Are for improvements in repair sessions.. https://issues.apache.org/jira/browse/CASSANDRA-4767 Is for unambiguous indication of repair session status. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: cassandra 1.0.8 memory usage
On Fri, Oct 12, 2012 at 1:26 AM, Daniel Woo daniel.y@gmail.com wrote: What version of Cassandra? What JVM? Are JNA and Jamm working? cassandra 1.0.8. Sun JDK 1.7.0_05-b06, JNA memlock enabled, jamm works. The unusual aspect here is Sun JDK 1.7. Can you use 1.6 on an affected node and see if the problem disappears? https://issues.apache.org/jira/browse/CASSANDRA-4571 Exists in 1.1.x (not your case) and is for leaking descriptors and not memory, but affects both 1.6 and 1.7. JMAP shows that the per gen is only 40% used. What is the usage of the other gens? I have very few column families, maybe 30-50. The nodetool shows each node has 5 GB load. Most of your heap being consumed by 30-50 columnfamilies MBeans seems excessive. Disable swap for cassandra node I am gonna change swappiness to 20% Even setting swappiness to 0% does not prevent the kernel from swapping if swap is defined/enabled. I re-iterate my suggestion that you de-define/disable swap on any node running Cassandra. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: unnecessary tombstone's transmission during repair process
On Thu, Oct 11, 2012 at 8:41 AM, Alexey Zotov azo...@griddynamics.com wrote: Value of DeletedColumn is a serialized local deletion time. We know that local deletion time can be different on different nodes for the same tombstone. So hashes of the same tombstone on different nodes will be different. Is it true? Yes, this seems correct based on my understanding of the process of writing tombstones. I think that local deletion time shouldn't be considered in hash's calculation. I think you are correct; the only thing that matters is whether the tombstone exists or not. There may be something I am missing about why the very-unlikely-to-be-identical value should be considered a merkle tree failure. https://issues.apache.org/jira/browse/CASSANDRA-2279 Seems related to this issue, fwiw. Is transmission of the equals tombstones during repair process a feature? :) or is it a bug? I think it's a bug. If it's a bug, I'll create ticket and attach patch to it. Yay! =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: cassandra 1.0.8 memory usage
On Wed, Oct 10, 2012 at 11:04 PM, Daniel Woo daniel.y@gmail.com wrote: I am running a mini cluster with 6 nodes, recently we see very frequent ParNewGC on two nodes. It takes 200 - 800 ms on average, sometimes it takes 5 seconds. You know, hte ParNewGC is stop-of-wolrd GC and our client throws SocketTimeoutException every 3 minutes. What version of Cassandra? What JVM? Are JNA and Jamm working? I checked the load, it seems well balanced, and the two nodes are running on the same hardware: 2 * 4 cores xeon with 16G RAM, we give cassandrda 4G heap, including 800MB young generation. We did not see any swap usage during the GC, any idea about this? It sounds like the two nodes that are pathological right now have exhausted the perm gen with actual non-garbage, probably mostly the Bloom filters and the JMX MBeans. Then I took a heap dump, it shows that 5 instances of JmxMBeanServer holds 500MB memory and most of the referenced objects are JMX mbean related, it's kind of wired to me and looks like a memory leak. Do you have a large number of ColumnFamilies? How large is the data stored per node? =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: cassandra 1.0.8 memory usage
On Thu, Oct 11, 2012 at 11:02 AM, Rob Coli rc...@palominodb.com wrote: On Wed, Oct 10, 2012 at 11:04 PM, Daniel Woo daniel.y@gmail.com wrote: We did not see any swap usage during the GC, any idea about this? As an aside.. you shouldn't have swap enabled on a Cassandra node, generally. As a simple example, if you have swap enabled and use the off-heap row cache, the kernel might swap your row cache. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: 1.1.1 is repair still needed ?
On Tue, Oct 9, 2012 at 12:56 PM, Oleg Dulin oleg.du...@gmail.com wrote: My understanding is that the repair has to happen within gc_grace period. [ snip ] So the question is, is this still needed ? Do we even need to run nodetool repair ? If Hinted Handoff works in your version of Cassandra, and that version is 1.0, you should not need to repair if no node has crashed or been down for longer than max_hint_window_in_ms. This is because after 1.0, any failed write to a remote replica results in a hint, so any DELETE should eventually be fully replicated. However hinted handoff is meaningfully broken between 1.1.0 and 1.1.6 (unreleased) so you cannot rely on the above heuristic for consistency. In these versions, you have to repair (or read repair 100% of keys) once every GCGraceSeconds to prevent the possibility of zombie data. If it were possible to repair on a per-columnfamily basis, you could get a significant win by only repairing columnfamilies which take DELETE traffic. https://issues.apache.org/jira/browse/CASSANDRA-4772 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: 1.1.5 Missing Insert! Strange Problem
On Thu, Sep 27, 2012 at 3:25 PM, Arya Goudarzi gouda...@gmail.com wrote: rcoli helped me investigate this issue. The mystery was that the segment of commit log was probably not fsynced to disk since the setting was set to periodic with 10 second delay and CRC32 checksum validation failed skipping the reply, so what happened in my scenario can be explained by this. I am going to change our settings to batch mode. To be clear, I conjectured that this behavior is the cause of the issue. As there is no logging when Cassandra encounters a corrupt log segment [1] during replay, I was unable to verify this conjecture. Calling nodetool drain as part of a restart process should [2] eliminate any chance of unsynced writes being lost, and is likely to be more performant overall than changing to batch mode. =Rob [1] I plan to submit a patch for this.. [2] But doesn't necessarily in 1.0.x, CASSANDRA-4446 ... -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: downgrade from 1.1.4 to 1.0.X
On Thu, Sep 27, 2012 at 2:46 AM, Віталій Тимчишин tiv...@gmail.com wrote: I suppose the way is to convert all SST to json, then install previous version, convert back and load Only files flushed in the new version will need to be dumped/reloaded. Files which have not been scrubbed/upgraded (ie, have the 1.0 -h[x]- version) get renamed to different names in 1.1. You can revert all of these files back to 1.0 as long as you change their names back to 1.0 style names, which is presumably what your snapshots contain... =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: any ways to have compaction use less disk space?
On Wed, Sep 26, 2012 at 6:05 AM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Sep 26, 2012 at 2:35 AM, Rob Coli rc...@palominodb.com wrote: 150,000 sstables seem highly unlikely to be performant. As a simple example of why, on the read path the bloom filter for every sstable must be consulted... Unfortunately that's a bad example since that's not true. You learn something new every day. Thanks for the clarification. I reduce my claim to a huge number of SSTables are unlikely to be performant. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Why data tripled in size after repair?
On Wed, Sep 26, 2012 at 9:30 AM, Andrey Ilinykh ailin...@gmail.com wrote: [ repair ballooned my data size ] 1. Why repair almost triples data size? You didn't mention what version of cassandra you're running. In some old versions of cassandra (prior to 1.0), repair often creates even more extraneous data than it should by design. However, by design, Repair repairs differing ranges based on merkle trees. Merkle trees are an optimization, what you trade for the optimization is over-repair. When you have multiple replicas, each over-repairs. If you are running repair on your whole cluster, this is why you should use repair -pr, as it reduces the per-replica over-repair. 2. How to compact my data back to 100G? 1) do a major compaction, one CF at a time. if you only have one CF, you're out of luck because you don't have enough headroom. 2) then convince someone to write sstablesplit so you can turn your 100G sstable into [n] smaller sstables and/or learn to live with your giant sstable Or add a new data directory with more space in it, to allow you to compact. I mention the latter in case it is trivial to attach additional storage in your env. The other alternative is to wait. Most space will be reclaimed over time by minor compaction. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Can't change replication factor in Cassandra 1.1.2
On Wed, Jul 18, 2012 at 10:27 AM, Douglas Muth doug.m...@gmail.com wrote: Even though keyspace test1 had a replication_factor of 1 to start with, each of the above UPDATE KEYSPACE commands caused a new UUID to be generated for the schema, which I assume is normal and expected. I believe the actual issue you have is stuck schema for this keyspace, not anything to do with replication factor. To test this, try adding a ColumnFamily and see if it works. I bet it won't. There are anecdotal reports in the 1.0.8-1.1.5 timeframe of this happening. One of the causes is the one aaron pasted, but I do not believe that is the only cause of this edge case. As far as I know, however, there is no JIRA ticket open for stuck schema for keyspace ... perhaps you might want to look for and/or open one? =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: any ways to have compaction use less disk space?
On Sun, Sep 23, 2012 at 12:24 PM, Aaron Turner synfina...@gmail.com wrote: Leveled compaction've tamed space for us. Note that you should set sstable_size_in_mb to reasonably high value (it is 512 for us with ~700GB per node) to prevent creating a lot of small files. 512MB per sstable? Wow, that's freaking huge. From my conversations with various developers 5-10MB seems far more reasonable. I guess it really depends on your usage patterns, but that seems excessive to me- especially as sstables are promoted. 700gb = 716800mb / 5mb = 143360 150,000 sstables seem highly unlikely to be performant. As a simple example of why, on the read path the bloom filter for every sstable must be consulted... =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Is it possible to create a schema before a Cassandra node starts up ?
On Fri, Sep 14, 2012 at 7:05 AM, Xu, Zaili z...@pershing.com wrote: I am pretty new to Cassandra. I have a script that needs to set up a schema first before starting up the cassandra node. Is this possible ? Can I create the schema directly on cassandra storage and then when the node starts up it will pick up the schema ? Aaron gave you the scientific answer, which is that you can't load schema without starting a node. However if you : 1) start a node for the first time 2) load schema 3) call nodetool drain so all system keyspace CFs are guaranteed to be flushed to sstables 4) then, from your script, start that node (or a node with identical configuration) using the flushed system sstables (directly on the storage) You can set up a schema before starting up the cassandra node or having a cassandra node or cluster running all the time. This might be useful in for example testing contexts... =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Using the commit log for external synchronization
On Fri, Sep 21, 2012 at 4:31 AM, Ben Hood 0x6e6...@gmail.com wrote: So if I understand you correctly, one shouldn't code against what is essentially an internal artefact that could be subject to change as the Cassandra code base evolves and furthermore may not contain the information an application thinks it should contain. Pretty much. So in summary, given that there is no out of the box way of saying to Cassandra give me all mutations since timestamp X, I would either have to go for an event driven approach or reconsider the layout of the Cassandra store such that I could reconcile it in an efficient fashion. With : https://issues.apache.org/jira/browse/CASSANDRA-3690 - Streaming CommitLog backup You can stream your commitlog off-node as you write it. You can then restore this commitlog and tell cassandra to replay the commit log until a certain time by using restore_point_in_time. But... without : https://issues.apache.org/jira/browse/CASSANDRA-4392 - Create a tool that will convert a commit log into a series of readable CQL statements You are unable to skip bad transactions, so if you want to roll-forward but skip a TRUNCATE, you are out of luck. The above gets you most of the way there, but Aaron's point about the commitlog not reflecting whether the app met its CL remains true. The possibility that Cassandra might coalesce to a value that the application does not know was successfully written is one of its known edge cases... =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: How to replace a dead *seed* node while keeping quorum
On Tue, Sep 11, 2012 at 4:21 PM, Edward Sargisson edward.sargis...@globalrelay.net wrote: If the downed node is a seed node then neither of the replace a dead node procedures work (-Dcassandra.replace_token and taking initial_token-1). The ring remains split. [...] In other words, if the host name is on the seeds list then it appears that the rest of the ring refuses to bootstrap it. Close, but not exactly... ./src/java/org/apache/cassandra/service/StorageService.java line 559 of 3090 if (DatabaseDescriptor.isAutoBootstrap() DatabaseDescriptor.getSeeds().contains(FBUtilities.getBroadcastAddress()) !SystemTable.isBootstrapped()) logger_.info(This node will not auto bootstrap because it is configured to be a seed node.); getSeeds asks your seed provider for a list of seeds. If you are using the SimpleSeedProvider, this basically turns the list from seeds in cassandra.yaml on the local node into a list of hosts. So it isn't that the other nodes have this node in their seed list.. it's that the node you are replacing has itself in its own seed list, and shouldn't. I understand that it can be tricky in conf management tools to make seed nodes' seed lists not contain themselves, but I believe it is currently necessary in this case. FWIW, it's unclear to me (and Aaron Morton, whose curiousity was apparently equally piqued and is looking into it further..) why exactly seed nodes shouldn't bootstrap. It's possible that they only shouldn't bootstrap without being in hibernate mode, and that the code just hasn't been re-written post replace_token/hibernate to say that it's ok for seed nodes to bootstrap as long as they hibernate... =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Node-tool drain on Cassandra 1.0
On Sun, Sep 9, 2012 at 12:01 PM, Robin Verlangen ro...@us2.nl wrote: Deleting the commitlog files is harmless. It's just a tool that tries to keep Cassandra more in-sync with the other nodes. A standard repair will fix all problems that a commitlog replay might do too. This is not really true.. imagine a RF=2 cluster. 1) take a replica node down 2) write at CL.ONE to another replica node 3) replication fails to other replica due to it being down, hint is queued locally, this means both writes are only in memtables/mirrored in the commitlog 4) don't nodetool flush 5) stop node 6) delete commitlog You have now lost data, and repair can't fix it, because the data you've lost has not been written to any other node. This is one of the edge cases that makes CL.ONE pretty risky if you care about your data and use a RF under 3. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Node-tool drain on Cassandra 1.0
On Fri, Sep 7, 2012 at 6:38 AM, Rene Kochen rene.koc...@schange.com wrote: If I use node-tool drain, it does stop accepting writes and flushes the tables. However, is it normal that the commit log files are not deleted and that it gets replayed? It's not expected by design, but it does seem to be normal in cassandra 1.0.x. I've spoken with other operators and they anecdotally report the same behavior when doing the same operation you describe. https://issues.apache.org/jira/browse/CASSANDRA-4446 The more people who report that they have the issue, the greater chance of a response or fix, so I suggest commenting me too! on that ticket.. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: unsubscribe
http://wiki.apache.org/cassandra/FAQ#unsubscribe On Wed, Aug 29, 2012 at 3:57 PM, Juan Antonio Gomez Moriano mori...@exciteholidays.com wrote: -- *Juan Antonio Gomez Moriano* DEVELOPER TEAM LEADER [image: Excite Holidays] T +61 2 8061 2917 emori...@exciteholidays.com Wwww.exciteholidays.com A Suite 1901, 101 Grafton St, Bondi Junction, NSW 2022, Australia -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Practical node size limits
On Sun, Jul 29, 2012 at 7:40 PM, Dustin Wenz dustinw...@ebureau.com wrote: We've just set up a new 7-node cluster with Cassandra 1.1.2 running under OpenJDK6. It's worth noting that Cassandra project recommends Sun JRE. Without the Sun JRE, you might not be able to use JAMM to determine the live ratio. Very few people use OpenJDK in production, so using it also increases the likelihood that you might be the first to encounter a given issue. FWIW! =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: adding node to cluster
On Thu, Aug 30, 2012 at 10:39 PM, Casey Deccio ca...@deccio.net wrote: In what way are the lookups failing? Is there an exception? No exception--just failing in that the data should be there, but isn't. At ConsistencyLevel.ONE or QUORUM? If you are bootstrapping the node, I would expect there to be no chance of serving blank reads like this. As auto_bootstrap is set to true by default, I presume you are bootstrapping. Which node are you querying to get the no data response? =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: adding node to cluster
On Thu, Aug 30, 2012 at 10:18 AM, Casey Deccio ca...@deccio.net wrote: I'm adding a new node to an existing cluster that uses ByteOrderedPartitioner. The documentation says that if I don't configure a token, then one will be automatically generated to take load from an existing node. What I'm finding is that when I add a new node, (super) column lookups begin failing (not sure if it was the row lookup failing or the supercolumn lookup failing), and I'm not sure why. 1) You almost never actually want BOP. 2) You never want Cassandra to pick a token for you. IMO and the opinion of many others, the fact that it does this is a bug. Specify a token with initial_token. 3) You never want to use Supercolumns. The project does not support them but currently has no plan to deprecate them. Use composite row keys. 4) Unless your existing cluster consists of one node, you almost never want to add only a single new node to a cluster. In general you want to double it. In summary, you are Doing It just about as Wrong as possible... but on to your actual question ... ! :) In what way are the lookups failing? Is there an exception? =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: one node with very high loads
On Mon, Aug 27, 2012 at 9:25 AM, Senthilvel Rangaswamy senthil...@gmail.com wrote: We are running 1.1.2 on m1.xlarge with ephemeral store for data. We are seeing very high loads on one of the nodes in the ring, 30+. My first hunch would be that you are sending all client requests to this one node, so it is coordinating 30x as many requests as it should. If that's not the case, if I were you I would attempt to determine if the high i/o is high read or write on the node, via a tool like iotop. You can also compare the tpstats of two nodes with similar uptimes to see if your node is performing more of any stage than other members of its cohort. Once you determine whether it's read or write, determine which files are being read or written.. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Node forgets about most of its column families
On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: I was wondering if anybody had seen the following behaviour before and how we might detect it and keep the application running. I don't know the answer to your problem, but anyone who does will want to know in what version of Cassandra you are encountering this issue. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: nodetool repair - when is it not needed ?
On Wed, Aug 22, 2012 at 8:37 AM, Senthilvel Rangaswamy senthil...@gmail.com wrote: We are running Cassandra 1.1.2 on EC2. Our database is primarily all counters and we don't do any deletes. Does nodetool repair do anything for such a database. All the docs I read for nodetool repair suggests that nodetool repair is needed only if there is deletes. Since 1.0, repair is only needed if a node crashes. If a node crashes, my understanding is that a cluster-wide repair (with -pr on each node) is required, because the crashed node could have lost a hint for any other node. https://issues.apache.org/jira/browse/CASSANDRA-2034 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Why so slow?
On Sun, Aug 19, 2012 at 11:09 AM, Peter Morris mrpmor...@gmail.com wrote: Is the Windows community edition crippled for network use perhaps, or could the problem be something else? It's not crippled but it underperforms Cassandra on Linux. Cassandra contains various Linux specific optimizations which result in improved performance if they can be used. I'm not sure anyone has shiny graphs comparing the two, but I would expect Windows Cassandra to be discernably less performant. That said, this is not the issue in your OP. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: A few questions about Cassandra's native protocol
On Wed, Aug 22, 2012 at 2:12 AM, Christoph Hack christ...@tux21b.org wrote: 4. Prepared Statements FWIW, while I suppose a client author is technically a user of Cassandra, you appear to be making suggestions related to the development of Cassandra. As I understand the conceptual seperation between lists, you probably want to send such mails to cassandra-dev. :D =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: change cluster name
On Wed, Aug 8, 2012 at 10:28 PM, rajesh.ba...@orkash.com rajesh.ba...@orkash.com wrote: i would suggest you delete the files in your system keyspace folder except files like Schema*.*. This thread could have been much shorter with a judicious use of grep, heh ... user@hostname# grep -i name /etc/cassandra/cassandra.yaml cluster_name: 'QA Cass Cluster' user@hostname # grep 'Cass Cluster' /mnt/cassandra/data/system/* Binary file /mnt/cassandra/data/system/LocationInfo-hd-5-Data.db matches Just remove LocationInfo files from the system keyspace when changing Cluster Names. Nuking the other stuff is not necessary. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: SSTable format
On Fri, Jul 13, 2012 at 5:18 PM, Dave Brosius dbros...@baybroadband.net wrote: It depends on what partitioner you use. You should be using the RandomPartitioner, and if so, the rows are sorted by the hash of the row key. there are partitioners that sort based on the raw key value but these partitioners shouldn't be used as they have problems due to uneven partitioning of data. The formal way this works in the code is that SSTables are ordered by decorated row key, where decoration is only a transformation when you are not using OrderedPartitioner. FWIW, in case you see that DecoratedKey syntax while reading code.. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: cassandra on re-Start
On Mon, Jul 2, 2012 at 5:43 AM, puneet loya puneetl...@gmail.com wrote: When I restarted the system , it is showing the keyspace does not exist. Not even letting me to create the keyspace with the same name again. Paste the error you get. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Snapshot failing on JSON files in 1.1.0
On Tue, Jun 19, 2012 at 2:55 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Unable to create hard link from /raid0/cassandra/data/cassa_teads/stats_product-hc-233-Data.db to /raid0/cassandra/data/cassa_teads/snapshots/1340099026781/stats_product-hc-233-Data.db Are you able to create this hard link via the filesystem? I am conjecturing not. Is snapshots perhaps on a different mountpoint than the directory you are trying to create a snapshot via hardlinks? =Rob PS - boy, 9 emails in the thread.. full of log output, sure don't miss them not being bottom-quoted to every email... :) -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Snapshot failing on JSON files in 1.1.0
On Tue, Jun 19, 2012 at 8:55 PM, Rob Coli rc...@palominodb.com wrote: On Tue, Jun 19, 2012 at 2:55 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Unable to create hard link from /raid0/cassandra/data/cassa_teads/stats_product-hc-233-Data.db to /raid0/cassandra/data/cassa_teads/snapshots/1340099026781/stats_product-hc-233-Data.db Are you able to create this hard link via the filesystem? I am conjecturing not. FWIW, erno being given by OS and passed through Java is 1 : http://freespace.sourceforge.net/errno/linux.html 1 EPERM+Operation not permitted =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: GCInspector works every 10 seconds!
On Mon, Jun 18, 2012 at 12:07 AM, Jason Tang ares.t...@gmail.com wrote: After I enable key cache and row cache, the problem gone, I guess it because we have lots of data in SSTable, and it takes more time, memory and cpu to search the data. The Key Cache is usually a win if added like this. The Row cache is less likely to be. If I were you I would check your row cache hit rates to make sure you are actually getting win. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: 1.1 not removing commit log files?
On Thu, May 31, 2012 at 7:01 PM, aaron morton aa...@thelastpickle.com wrote: But that talks about segments not being cleared at startup. Does not explain why they were allowed to get past the limit in the first place. Perhaps the commit log size tracking for this limit does not, for some reason, track hints? This seems like the obvious answer given the state which appears to trigger it? This doesn't explain why the files aren't getting deleted after the hints are delivered, of course... =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Moving to 1.1
On Wed, May 30, 2012 at 4:08 AM, Vanger disc...@gmail.com wrote: 3) Java 7 now recommended for use by Oracle. We have several developers running local cassandra instances on it for a while without problems. Anybody tried it in production? Some time ago java 7 wasn't recommended for use with cassandra, what's for now? I have a variation of this question, which goes : Now that OpenJDK is the official Java reference implementation, are there plans to make Cassandra support it? https://blogs.oracle.com/henrik/entry/moving_to_openjdk_as_the Cassandra has (had?) a slightly passive-aggressive log message where it refers to any JDK other than Sun's as a buggy and suggests that you should upgrade to the Sun JDK. I'm fine with using whatever JDK is technically best, but within the enterprise using something other than the official reference implementation can be a tough sell. Wondering if people have a view as to the importance and/or feasibility of making OpenJDK supported. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: commitlog_sync_batch_window_in_ms change in 0.7
On Tue, May 29, 2012 at 10:29 PM, Pierre Chalamet pie...@chalamet.net wrote: You'd better use version 1.0.9 (using this one in production) or 1.0.10. 1.1 is still a bit young to be ready for prod unfortunately. OP described himself as experimenting which I inferred to mean not-production. I agree with others, 1.0.x is what I'd currently recommend for production. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: commitlog_sync_batch_window_in_ms change in 0.7
On Mon, May 28, 2012 at 6:53 AM, osishkin osishkin osish...@gmail.com wrote: I'm experimenting with Cassandra 0.7 for some time now. I feel obligated to recommend that you upgrade to Cassandra 1.1. Cassandra 0.7 is better than 0.6, but I definitely still wouldn't be experimenting with this old version in 2012. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: nodetool repair taking forever
On Sat, May 19, 2012 at 8:14 AM, Raj N raj.cassan...@gmail.com wrote: Hi experts, [ repair seems to be hanging forever ] https://issues.apache.org/jira/browse/CASSANDRA-2433 Affects 0.8.4. I also believe there is a contemporaneous bug (reported by Stu Hood?) regarding failed repair resulting in extra disk usage, but I can't currently find it in JIRA. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Migrating from a windows cluster to a linux cluster.
On Thu, May 24, 2012 at 12:44 PM, Steve Neely sne...@rallydev.com wrote: It also seems like a dark deployment of your new cluster is a great method for testing the Linux-based systems before switching your mision critical traffic over. Monitor them for a while with real traffic and you can have confidence that they'll function correctly when you perform the switchover. FWIW, I would love to see graphs which show their compared performance under identical write load and then show the cut-over point for reads between the two clusters. My hypothesis is that your linux cluster will magically be much more perfomant/less loaded due to many linux-specific optimizations in Cassandra, but I'd dig seeing this illustrated in an apples to apples sense with real app traffic. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Number of keyspaces
On Tue, May 22, 2012 at 4:56 AM, samal samalgo...@gmail.com wrote: Not ideally, now cass has global memtable tuning. Each cf correspond to memory in ram. Year wise cf means it will be in read only state for next year, memtable will still consume ram. An empty memtable seems unlikely to consume a meaningful amount of RAM. I'm sure by reading the code I could estimate how little memory is involved, but I'd be surprised if it is over a few megabytes. This is independent from the other overhead associated with a CF being defined, of course. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: exception when cleaning up...
On Tue, May 22, 2012 at 3:00 AM, aaron morton aa...@thelastpickle.com wrote: 1) Isolating the node from the cluster to stop write activity. You can either start the node with the -Dcassandra.join_ring=false JVM option or use nodetool disablethrift and disablegossip to stop writes. Note that this will not stop existing Hinted Handoff sessions which target the node. As a result of the last caveat here, I recommend either restarting the node with the join_ring as false or using iptables to firewall off ports 7000 and 9160. If you want to be sure that you have stopped write activity right now, nuking these ports from orbit is the only way to be sure. disablethrift/disablegossip as currently implemented are not sufficient for this goal. https://issues.apache.org/jira/browse/CASSANDRA-4162 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Migrating a column family from one cluster to another
On Thu, May 17, 2012 at 9:37 AM, Bryan Fernandez bfernande...@gmail.com wrote: What would be the recommended approach to migrating a few column families from a six node cluster to a three node cluster? The easiest way (if you are not using counters) is : 1) make sure all filenames of sstables are unique [1] 2) copy all sstablefiles from the 6 nodes to all 3 nodes 3) run a cleanup compaction on the 3 nodes =Rob [1] https://issues.apache.org/jira/browse/CASSANDRA-1983 -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Migrating a column family from one cluster to another
On Fri, May 18, 2012 at 1:41 PM, Poziombka, Wade L wade.l.poziom...@intel.com wrote: How does counters affect this? Why would be different? Oh, actually this is an obsolete caution as of Cassandra 0.8beta1 : https://issues.apache.org/jira/browse/CASSANDRA-1938 Sorry! :) =Rob PS - for historical reference, before this ticket the counts were based on the ip address of the nodes and things would be hosed if you did the copy-all-the-sstables operations. it is easy for me to forget that almost no one was using cassandra counters before 0.8, heh. -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Inconsistent dependencies
On Tue, Apr 24, 2012 at 12:56 PM, Matthias Pfau p...@l3s.de wrote: we just noticed that cassandra is currently published with inconsistent dependencies. The inconsistencies exist between the published pom and the published distribution (tar.gz). I compared hashes of the libs of several versions and the inconsistencies are different each time. However, I have not found a single cassandra release without inconsistencies. Was there every any answer to this question or resolution to this issue? If not, I suggest to Matthias that he file a JIRA ticket on the Apache Cassandra JIRA. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Adding a second datacenter
On Tue, Apr 24, 2012 at 3:24 PM, Bill Au bill.w...@gmail.com wrote: Everything went smoothly until I ran the last step, which is to run nodetool repair on all the nodes in the new data center. Repair is hanging on all the new nodes. I had to hit control-C to break out of it. [ snip ] Did I missed anything or did something wrong? How do I recover from this? http://wiki.apache.org/cassandra/Operations Running nodetool repair: Like all nodetool operations in 0.7, repair is blocking: it will wait for the repair to finish and then exit. This may take a long time on large data sets. Since 0.7, all nodetool operations are blocking. While repair does in fact have bugs which make it possible that it will hang in all extant release versions, the fact that nodetool repair (hopefully you were using -pr option?) takes a long time to return does not indicate that it is hanging. If you see repair and AES messages in system.log, it is probably not in fact hung. If you don't see said messages for a long time, it might be hung, in which case the only remedy currently available to you is to restart the affected nodes. =Rob PS - I know this is a reply on a relatively old thread and I think you maybe received assistance on another thread after this one. If so, apologies! -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: getting status of long running repair
On Fri, May 4, 2012 at 10:30 AM, Bill Au bill.w...@gmail.com wrote: I know repair may take a long time to run. I am running repair on a node with about 15 GB of data and it is taking more than 24 hours. Is that normal? Is there any way to get status of the repair? tpstats does show 2 active and 2 pending AntiEntropySessions. But netstats and compactionstats show no activity. As indicated by various recent threads to this effect, many versions of cassandra (including current 1.0.x release) contain bugs which sometimes prevent repair from completing. The other threads suggest that some of these bugs result in the state you are in now, where you do not see anything that looks like appropriate activity. Unfortunately the only solution offered on these other threads is the one I will now offer, which is to restart the participating nodes and re-start the repair. I am unaware of any JIRA tickets tracking these bugs (which doesn't mean they don't exist, of course) so you might want to file one. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: JNA + Cassandra security
On Mon, Apr 30, 2012 at 6:48 PM, Jonathan Ellis jbel...@gmail.com wrote: On Mon, Apr 30, 2012 at 7:49 PM, Cord MacLeod cordmacl...@gmail.com wrote: Hello group, I'm a new Cassandra and Java user so I'm still trying to get my head around a few things. If you've disabled swap on a machine what is the reason to use JNA? Faster snapshots, giving hints to the page cache with fadvise. If you are running in Linux, you really do want this enabled. Otherwise, for example, compaction blows out your page cache. (FWIW, in case it is not immediately apparent what sort of hints Cassandra might give to the page cache with fadvise..) https://issues.apache.org/jira/browse/CASSANDRA-1470 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Question regarding major compaction.
On Tue, May 1, 2012 at 4:31 AM, Henrik Schröder skro...@gmail.com wrote: But what's the difference between doing an extra read from that One Big File, than doing an extra read from whatever SSTable happen to be largest in the course of automatic minor compaction? The primary differences, as I understand it, are that the index performance and bloom filter false positive rate for your One Big File are worse. First, you are more likely to get a bloom filter false positive due to the intrinsic degradation of bloom filter performance as number of keys increases. Next, after traversing the SStable index to get to the closest indexed key, you will be forced to scan past more keys which are not your key in order to get to the key which is your key. So I'm still confused. I don't see a significant difference between doing the occasional major compaction or leaving it to do automatic minor compactions. What am I missing? Reads will continually degrade with automatic minor compactions as well, won't they? I still don't really understand what precisely continually degrade means here either, FWIW, or the two operating paradigms being compared under what sort of workloads. As a simple example, I don't believe performance will continually do anything if your workload does not issue logical UPDATE or DELETE to rows. The documentation statement seems confusingly-vaguely-yet-strongly phrased, even if true. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Taking a Cluster Wide Snapshot
I copied all the snapshots from each individual nodes where the snapshot data size was around 12Gb on each node to a common folder(one folder alone). Strangely I found duplicate file names in multiple snapshots and more strangely the data size was different of each duplicate file which lead to the total data size to close to 13Gb(else have to be overwritten) where as the expectation was 12*6 = 72Gb. You have detected via experimentation that the namespacing of sstable filenames per CF per node is not unique. In order to do the operation you are doing, you have to rename them to be globally unique. Just inflate the integer part is the easiest way. https://issues.apache.org/jira/browse/CASSANDRA-1983 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Taking a Cluster Wide Snapshot
On Thu, Apr 26, 2012 at 10:38 PM, Shubham Srivastava shubham.srivast...@makemytrip.com wrote: On another thought I could also try copying the data of my keyspace alone from one node to another node in the new cluster (I have both the old and new clusters having same nodes DC1:6,DC2:6 with same tokens) with the same tokens. Would there be any risk of the new cluster getting joined to the old cluster probably if the data inside keyspace is aware of the original IP's etc. As a result of this very concern while @ Digg... https://issues.apache.org/jira/browse/CASSANDRA-769 tl;dr : as long as your cluster names are unique in your cluster config (**and you do not copy the System keyspace, letting the new cluster initialize with the new cluster name**), nodes are at no risk of joining the wrong cluster. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Resident size growth
On Tue, Apr 10, 2012 at 8:40 AM, ruslan usifov ruslan.usi...@gmail.com wrote: mmap doesn't depend on jna FWIW, this confusion is as a result of the use of *mlockall*, which is used to prevent mmapped files from being swapped, which does depend on JNA. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Off-heap row cache and mmapped sstables
On 4/12/12, Omid Aladini omidalad...@gmail.com wrote: Cassandra issues an mlockall [1] before mmap-ing sstables to prevent the kernel from paging out heap space in favor of memory-mapped sstables. I was wondering, what happens to the off-heap row cache (saved or unsaved)? Is it possible that the kernel pages out off-heap row cache in favor of resident mmap-ed sstable pages? For what it's worth, I find this conjecture plausible given my understanding of the Cassandra ticket which resulted in the use of JNA+mlockall. I'd love to hear an opinion from someone from the project with more in-depth knowledge. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Will Cassandra balance load across replicas?
On Thu, Apr 5, 2012 at 9:22 AM, zhiming shen zhiming.s...@gmail.com wrote: Thanks for your reply. My question is about the impact of replication on load balancing. Say we have nodes ABCD... in the ring. ReplicationFactor is 3 so the data on A will also have replicas on B and C. If we are reading data own by A, and A is already very busy, will the requests be forwarded to B and C? How about update requests? Google cassandra dynamic snitch. -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Linux Filesystem for Cassandra
On Wed, Apr 4, 2012 at 1:15 PM, Michael Widmann michael.widm...@gmail.com wrote: If you wanna use - ZFS - use smartos / openindiana and cassandra on top dont work around with a FUSE FS. Maybe BSD (not knowing their version of zfs / zpool) http://zfsonlinux.org/ (I can't vouch for it, but FYI this is non-FUSE ZFS for linux, seems actively developed etc.) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: multi region EC2
On Mon, Mar 26, 2012 at 3:31 PM, Deno Vichas but what if i already have a bunch (8g per node) data that i need and i don't have a way to re-create it. Note that the below may have unintended consequences if using Counter column families. It actually can be done with the cluster running, below is the least tricky version of this process. a) stop writing to your cluster b) do a major compaction and then stop cluster c) ensure globally unique filenames for all sstable files for all cfs for all nodes d) copy all sstables to all new nodes e) start cluster, join new nodes, run cleanup compactions =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Nodetool snapshot, consistency and replication
On Mon, Apr 2, 2012 at 9:19 AM, R. Verlangen ro...@us2.nl wrote: - 3 node cluster - RF = 3 - fully consistent (not measured, but let's say it is) Is it true that when I take a snaphot at only one of the 3 nodes this contains all the data in the cluster (at least 1 replica)? Yes. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb