Re: DELETE does not delete :)
W dniu 07.10.2013 08:02, Alexander Shutyaev pisze: * We have not modified any *consistency settings* in our app, so I assume we have the *default QUORUM* (2 out of 3 in our case) consistency *for reads and writes*. cqlsh uses ONE by default, pycassa uses ONE by default too. I have no experience with DataStax's Java driver, but I'd assume it uses ONE by default too. Quick grep in the souce tells me this too: ./driver-core/src/main/java/com/datastax/driver/core/QueryOptions.java: * The default consistency level for queries: {@code ConsistencyLevel.ONE}. public static final ConsistencyLevel DEFAULT_CONSISTENCY_LEVEL = ConsistencyLevel.ONE; M.
Re: Disappearing index data.
I had similar issue (reported many times here, there's also a JIRA issue, but people reporting this problem were unable to reproduce it). What I can say is that for me the solution was to run major compaction on the index CF via JMX. To be clear - we're not talking about compacting the CF that IS indexed (your CF), but the internal Cassandra's one, which is responsible for storing index data. MBean you should look for looks like this: org.apache.cassandra.db:type=IndexColumnFamilies,keyspace=KS,columnfamily=CF.IDX M. W dniu 07.10.2013 15:22, Tom van den Berge pisze: On a 2-node cluster with replication factor 2, I have a column family with an index on one of the columns. Every now and then, I notice that a lookup of the record through the index on node 1 produces the record, but the same lookup on node 2 does not! If I do a lookup by row key, the record is found, and the indexed value is there. So as far as I can tell, the index on one of the nodes looses values, and is no longer in sync with the other node, even though the replication factor requires it. I typically repair these issues by storing the indexed column value again. The indexed data is static data; it doesn't change. I'm running cassandra 1.2.3. I'm running a nodetool repair on each node every day (although this does not fix this problem). This problem worries me a lot. I don't have a clue about the cause of it. Any help would be greatly appreciated. Tom
Re: Cassandra Heap Size for data more than 1 TB
I was experimenting with 128 vs. 512 some time ago and I was unable to see any difference in terms of performance. I'd probably check 1024 too, but we migrated to 1.2 and heap space was not an issue anymore. M. W dniu 02.10.2013 16:32, srmore pisze: I changed my index_interval from 128 to index_interval: 128 to 512, does it make sense to increase more than this ? On Wed, Oct 2, 2013 at 9:30 AM, cem cayiro...@gmail.com wrote: Have a look to index_interval. Cem. On Wed, Oct 2, 2013 at 2:25 PM, srmore comom...@gmail.com wrote: The version of Cassandra I am using is 1.0.11, we are migrating to 1.2.X though. We had tuned bloom filters (0.1) and AFAIK making it lower than this won't matter. Thanks ! On Tue, Oct 1, 2013 at 11:54 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Which Cassandra version are you on? Essentially heap size is function of number of keys/metadata. In Cassandra 1.2 lot of the metadata like bloom filters were moved off heap. On Tue, Oct 1, 2013 at 9:34 PM, srmore comom...@gmail.com wrote: Does anyone know what would roughly be the heap size for cassandra with 1TB of data ? We started with about 200 G and now on one of the nodes we are already on 1 TB. We were using 8G of heap and that served us well up until we reached 700 G where we started seeing failures and nodes flipping. With 1 TB of data the node refuses to come back due to lack of memory. needless to say repairs and compactions takes a lot of time. We upped the heap from 8 G to 12 G and suddenly everything started moving rapidly i.e. the repair tasks and the compaction tasks. But soon (in about 9-10 hrs) we started seeing the same symptoms as we were seeing with 8 G. So my question is how do I determine what is the optimal size of heap for data around 1 TB ? Following are some of my JVM settings -Xms8G -Xmx8G -Xmn800m -XX:NewSize=1200M XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=4 Thanks !
Re: Cassandra Heap Size for data more than 1 TB
Currently we have 480-520 GB of data per node, so it's not even close to 1TB, but I'd bet that reaching 700-800GB shouldn't be a problem in terms of everyday performance - heap space is quite low, no GC issues etc. (to give you a comparison: when working on 1.1 and having ~300-400GB per node we had a huge problem with bloom filters and heap space, so we had to bump it to 12-16 GB; on 1.2 it's not an issue anymore). However, our main concern is the time that we'll need to rebuild broken node, so we are going to extend the cluster soon to avoid such problems and keep our nodes about 50% smaller. M. W dniu 03.10.2013 15:02, srmore pisze: Thanks Mohit and Michael, That's what I thought. I have tried all the avenues, will give ParNew a try. With the 1.0.xx I have issues when data sizes go up, hopefully that will not be the case with 1.2. Just curious, has anyone tried 1.2 with large data set, around 1 TB ? Thanks ! On Thu, Oct 3, 2013 at 7:20 AM, Michał Michalski mich...@opera.com wrote: I was experimenting with 128 vs. 512 some time ago and I was unable to see any difference in terms of performance. I'd probably check 1024 too, but we migrated to 1.2 and heap space was not an issue anymore. M. W dniu 02.10.2013 16:32, srmore pisze: I changed my index_interval from 128 to index_interval: 128 to 512, does it make sense to increase more than this ? On Wed, Oct 2, 2013 at 9:30 AM, cem cayiro...@gmail.com wrote: Have a look to index_interval. Cem. On Wed, Oct 2, 2013 at 2:25 PM, srmore comom...@gmail.com wrote: The version of Cassandra I am using is 1.0.11, we are migrating to 1.2.X though. We had tuned bloom filters (0.1) and AFAIK making it lower than this won't matter. Thanks ! On Tue, Oct 1, 2013 at 11:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Which Cassandra version are you on? Essentially heap size is function of number of keys/metadata. In Cassandra 1.2 lot of the metadata like bloom filters were moved off heap. On Tue, Oct 1, 2013 at 9:34 PM, srmore comom...@gmail.com wrote: Does anyone know what would roughly be the heap size for cassandra with 1TB of data ? We started with about 200 G and now on one of the nodes we are already on 1 TB. We were using 8G of heap and that served us well up until we reached 700 G where we started seeing failures and nodes flipping. With 1 TB of data the node refuses to come back due to lack of memory. needless to say repairs and compactions takes a lot of time. We upped the heap from 8 G to 12 G and suddenly everything started moving rapidly i.e. the repair tasks and the compaction tasks. But soon (in about 9-10 hrs) we started seeing the same symptoms as we were seeing with 8 G. So my question is how do I determine what is the optimal size of heap for data around 1 TB ? Following are some of my JVM settings -Xms8G -Xmx8G -Xmn800m -XX:NewSize=1200M XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=4 Thanks !
Re: Recommended hardware
Hi Tim, Not sure if you've seen this, but I'd start from DataStax's documentation: http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/architecture/architecturePlanningAbout_c.html?pagename=docsversion=1.2file=cluster_architecture/cluster_planning Taking a look at the mailinglist's archive might be useful too. M. W dniu 23.09.2013 18:17, Tim Dunphy pisze: Hello, I am running Cassandra 2.0 on a 2gb memory 10 gb HD in a virtual cloud environment. It's supporting a php application running on the same node. Mostly this instance runs smoothly but runs low on memory. Depending on how much the site is used, the VM will swap out sometimes excessively. I realize this setup may not be enough to support a cassandra instance. I was wondering if there were any recommended hardware specs someone could point me to for both physical and virtual (cloud) type environments. Thank you, Tim Sent from my iPhone
Re: Row size in cfstats vs cfhistograms
I believe the reason is that cfhistograms tells you about the sizes of the rows returned by given node in a response to the read request, while cfstats tracks the largest row stored on given node. M. W dniu 19.09.2013 11:31, Rene Kochen pisze: Hi all, I use Cassandra 1.0.11 If I do cfstats for a particular column family, I see a Compacted row maximum size of 43388628 However, when I do a cfhistograms I do not see such a big row in the Row Size column. The biggest row there is 126934. Can someone explain this? Thanks! Rene
Re: Why don't you start off with a “single small” Cassandra server as you usually do it with MySQL?
You might be interested in this: http://mail-archives.apache.org/mod_mbox/cassandra-user/201308.mbox/%3ccaeqobhpav25pcgjfwbkmd1rzxvrif94e6lpybpj3mu_bqn9...@mail.gmail.com%3E M. W dniu 18.09.2013 15:34, Ertio Lew pisze: For any website just starting out, the load initially is minimal grows with a slow pace initially. People usually start with their MySQL based sites with a single server(***that too a VPS not a dedicated server) running as both app server as well as DB server usually get too far with this setup only as they feel the need they separate the DB from the app server giving it a separate VPS server. This is what a start up expects the things to be while planning about resources procurement. But so far what I have seen, it's something very different with Cassandra. People usually recommend starting out with atleast a 3 node cluster, (on dedicated servers) with lots lots of RAM. 4GB or 8GB RAM is what they suggest to start with. So is it that Cassandra requires more hardware resources in comparison to MySQL, for a website to deliver similar performance, serve similar load/ traffic same amount of data. I understand about higher storage requirements of Cassandra due to replication but what about other hardware resources ? Can't we start off with Cassandra based apps just like MySQL. Starting with 1 or 2 VPS adding more whenever there's a need ? I don't want to compare apples with oranges. I just want to know how much more dangerous situation I may be in when I start out with a single node VPS based cassandra installation Vs a single node VPS based MySQL installation. Difference between these two situations. Are cassandra servers more prone to be unavailable than MySQL servers ? What is bad if I put tomcat too along with Cassandra as people use LAMP stack on single server. - *This question is also posted at StackOverflow herehttp://stackoverflow.com/questions/18462530/why-dont-you-start-off-with-a-single-small-cassandra-server-as-you-usually has an open bounty worth +50 rep.*
Re: cassandra disk access
2. when cassandra lookups a key in sstable (assuming bloom-filter and other stuff failed, also assuming the key is located in this single sstable), cassandra DO NOT USE sequential I/O. She probably will read the hash-table slot or similar structure, then cassandra will do another disk seek in order to get the value (and probably the key). Also probably there will need another seek, if there is key collision there will need additional seeks. It will use the Index Sample (RAM) first, then it will use full Index (disk) and finally it will read data from SSTable (disk). There's no such thing like collision in this case. 3. once the data (e.g. the row) is located, a sequential read for entire row will occur. (Once again I assume there is single well compacted sstable). Also if disk is not fragmented, the data will be placed on disk sectors one after the other. Yes, this is how I understand it too. M.
Re: cassandra disk access
I'm not sure how accurate it is (it's from 2011, one of its sources is from 2010), but I'm pretty sure it's more or less OK: http://blog.csdn.net/firecoder/article/details/7019435 M. W dniu 07.08.2013 10:34, Nikolay Mihaylov pisze: thanks It will use the Index Sample (RAM) first, then it will use full Index (disk) and finally it will read data from SSTable (disk). There's no such thing like collision in this case. so it still have 2 seeks :) where I can see the internal structure of the sstable i tried to find it documented but was unable to find anything ? On Wed, Aug 7, 2013 at 11:27 AM, Michał Michalski mich...@opera.com wrote: 2. when cassandra lookups a key in sstable (assuming bloom-filter and other stuff failed, also assuming the key is located in this single sstable), cassandra DO NOT USE sequential I/O. She probably will read the hash-table slot or similar structure, then cassandra will do another disk seek in order to get the value (and probably the key). Also probably there will need another seek, if there is key collision there will need additional seeks. It will use the Index Sample (RAM) first, then it will use full Index (disk) and finally it will read data from SSTable (disk). There's no such thing like collision in this case. 3. once the data (e.g. the row) is located, a sequential read for entire row will occur. (Once again I assume there is single well compacted sstable). Also if disk is not fragmented, the data will be placed on disk sectors one after the other. Yes, this is how I understand it too. M.
Re: memtable overhead
Not sure how up-to-date this info is, but from some discussions that happened here long time ago I remember that a minimum of 1MB per Memtable needs to be allocated. The other constraint here is memtable_total_space_in_mb setting in cassandra.yaml, which you might wish to tune when having a lot of CFs. M. W dniu 23.07.2013 07:12, Darren Smythe pisze: The way weve gone about our data models has resulted in lots of column families and just looking for guidelines about how much space each column table adds. TIA On Sun, Jul 21, 2013 at 11:19 PM, Darren Smythe darren1...@gmail.comwrote: Hi, How much overhead (in heap MB) does an empty memtable use? If I have many column families that aren't written to often, how much memory do these take up? TIA -- Darren
Re: Cassandra 2 vs Java 1.6
I believe it won't run on 1.6. Java 1.7 is required to compile C* 2.0+ and once it's done, you cannot run it using Java 1.6 (this is what Unsupported major.minor version error tells you about; java version 50 is 1.6 and 51 is 1.7). M. W dniu 22.07.2013 10:06, Andrew Cobley pisze: I know it was decided to drop the requirement for Java 1.6 for cassandra some time ago, but my question is should 2.0.0-beta1http://www.apache.org/dyn/closer.cgi?path=/cassandra/2.0.0/apache-cassandra-2.0.0-beta1-bin.tar.gz run under java 1.6 at all ? I tried and got the following error: macaroon:bin administrator$ Exception in thread main java.lang.UnsupportedClassVersionError: org/apache/cassandra/service/CassandraDaemon : Unsupported major.minor version 51.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) at java.lang.ClassLoader.defineClass(ClassLoader.java:615) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) macaroon:bin administrator$ java -version java version 1.6.0_43 Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-10M4203) Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode) macaroon:bin administrator$ It's fine by me if thats the case ! Andy The University of Dundee is a registered Scottish Charity, No: SC015096
Re: is there a key to sstable index file?
SSTables are immutable - once they're written to disk, they cannot be changed. On read C* checks *all* SSTables [1], but to make it faster, it uses Bloom Filters, that can tell you if a row is *not* in a specific SSTable, so you don't have to read it at all. However, *if* you read it in case you have to, you don't read a whole SSTable - there's an in-memory Index Sample, that is used for binary search and returning only a (relatively) small block of real (full, on-disk) index, which you have to scan to find a place to retrieve the data from SSTable. Additionally you have a KeyCache to make reads faster - it points location of data in SSTable, so you don't have to touch Index Sample and Index at all. Once C* retrieves all data parts (including the Memtable part), timestamps are used to find the most recent version of data. [1] I believe that it's not true for all cases, as I saw a piece of code somewhere in the source, that starts checking SSTables in order from the newest to the oldest one (in terms of data timestamps - AFAIR SSTable MetaData stores info about smallest and largest timestamp in SSTable), and once the newest data for all columns are retrieved (assuming that schema is defined), retrieving data stops and older SSTables are not checked. If someone could confirm that it works this way and it's not something that I saw in my dream and now believe it's real, I'd be glad ;-) W dniu 17.07.2013 22:58, S Ahmed pisze: Since SSTables are mutable, and they are ordered, does this mean that there is a index of key ranges that each SS table holds, and the value could be 1 more sstables that have to be scanned and then the latest one is chosen? e.g. Say I write a value abc to CF1. This gets stored in a sstable. Then I write def to CF1, this gets stored in another sstable eventually. How when I go to fetch the value, it has to scan 2 sstables and then figure out which is the latest entry correct? So is there an index of key's to sstables, and there can be 1 or more sstables per key? (This is assuming compaction hasn't occurred yet).
Re: is there a key to sstable index file?
Thanks! :-) M. W dniu 18.07.2013 08:42, Jean-Armel Luce pisze: @Michal : look a this for the improvement of read performance : https://issues.apache.org/jira/browse/CASSANDRA-2498 Best regards. Jean Armel 2013/7/18 Michał Michalski mich...@opera.com SSTables are immutable - once they're written to disk, they cannot be changed. On read C* checks *all* SSTables [1], but to make it faster, it uses Bloom Filters, that can tell you if a row is *not* in a specific SSTable, so you don't have to read it at all. However, *if* you read it in case you have to, you don't read a whole SSTable - there's an in-memory Index Sample, that is used for binary search and returning only a (relatively) small block of real (full, on-disk) index, which you have to scan to find a place to retrieve the data from SSTable. Additionally you have a KeyCache to make reads faster - it points location of data in SSTable, so you don't have to touch Index Sample and Index at all. Once C* retrieves all data parts (including the Memtable part), timestamps are used to find the most recent version of data. [1] I believe that it's not true for all cases, as I saw a piece of code somewhere in the source, that starts checking SSTables in order from the newest to the oldest one (in terms of data timestamps - AFAIR SSTable MetaData stores info about smallest and largest timestamp in SSTable), and once the newest data for all columns are retrieved (assuming that schema is defined), retrieving data stops and older SSTables are not checked. If someone could confirm that it works this way and it's not something that I saw in my dream and now believe it's real, I'd be glad ;-) W dniu 17.07.2013 22:58, S Ahmed pisze: Since SSTables are mutable, and they are ordered, does this mean that there is a index of key ranges that each SS table holds, and the value could be 1 more sstables that have to be scanned and then the latest one is chosen? e.g. Say I write a value abc to CF1. This gets stored in a sstable. Then I write def to CF1, this gets stored in another sstable eventually. How when I go to fetch the value, it has to scan 2 sstables and then figure out which is the latest entry correct? So is there an index of key's to sstables, and there can be 1 or more sstables per key? (This is assuming compaction hasn't occurred yet).
Re: manually removing sstable
Hi Aaron, * Tombstones will only be purged if all fragments of a row are in the SStable(s) being compacted. According to my knowledge it's not necessarily true. In a specific case this patch comes into play: https://issues.apache.org/jira/browse/CASSANDRA-4671 We could however purge tombstone if we know that the non-compacted sstables doesn't have any info that is older than the tombstones we're about to purge (since then we know that the tombstones we'll consider can't delete data in non compacted sstables). M. W dniu 12.07.2013 10:25, aaron morton pisze: That sounds sane to me. Couple of caveats: * Remember that Expiring Columns turn into Tombstones and can only be purged after TTL and gc_grace. * Tombstones will only be purged if all fragments of a row are in the SStable(s) being compacted. Cheers - Aaron Morton Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 11/07/2013, at 10:17 PM, Theo Hultberg t...@iconara.net wrote: a colleague of mine came up with an alternative solution that also seems to work, and I'd just like your opinion on if it's sound. we run find to list all old sstables, and then use cmdline-jmxclient to run the forceUserDefinedCompaction function on each of them, this is roughly what we do (but with find and xargs to orchestrate it) java -jar cmdline-jmxclient-0.10.3.jar - localhost:7199 org.apache.cassandra.db:type=CompactionManager forceUserDefinedCompaction=the_keyspace,db_file_name the downside is that c* needs to read the file and do disk io, but the upside is that it doesn't require a restart. c* does a little more work, but we can schedule that during off-peak hours. another upside is that it feels like we're pretty safe from screwups, we won't accidentally remove an sstable with live data, the worst case is that we ask c* to compact an sstable with live data and end up with an identical sstable. if anyone else wants to do the same thing, this is the full cron command: 0 4 * * * find /path/to/cassandra/data/the_keyspace_name -maxdepth 1 -type f -name '*-Data.db' -mtime +8 -printf forceUserDefinedCompaction=the_keyspace_name,\%P\n | xargs -t --no-run-if-empty java -jar /usr/local/share/java/cmdline-jmxclient-0.10.3.jar - localhost:7199 org.apache.cassandra.db:type=CompactionManager just change the keyspace name and the path to the data directory. T# On Thu, Jul 11, 2013 at 7:09 AM, Theo Hultberg t...@iconara.net wrote: thanks a lot. I can confirm that it solved our problem too. looks like the C* 2.0 feature is perfect for us. T# On Wed, Jul 10, 2013 at 7:28 PM, Marcus Eriksson krum...@gmail.com wrote: yep that works, you need to remove all components of the sstable though, not just -Data.db and, in 2.0 there is this: https://issues.apache.org/jira/browse/CASSANDRA-5228 /Marcus On Wed, Jul 10, 2013 at 2:09 PM, Theo Hultberg t...@iconara.net wrote: Hi, I think I remember reading that if you have sstables that you know contain only data that whose ttl has expired, it's safe to remove them manually by stopping c*, removing the *-Data.db files and then starting up c* again. is this correct? we have a cluster where everything is written with a ttl, and sometimes c* needs to compact over a 100 gb of sstables where we know ever has expired, and we'd rather just manually get rid of those. T#
Re: Deletion use more space.
Deletion is not really removing data, but it's adding tombstones (markers) of deletion. They'll be later merged with existing data during compaction and - in the end (see: gc_grace_seconds) - removed, but by this time they'll take some space. http://wiki.apache.org/cassandra/DistributedDeletes M. W dniu 16.07.2013 11:46, 杨辉强 pisze: Hi, all: I use cassandra 1.2.4 and I have 4 nodes ring and use byte order partitioner. I had inserted about 200G data in the ring previous days. Today I write a program to scan the ring and then at the same time delete the items that are scanned. To my surprise, the cassandra cost more disk usage. Anybody can tell me why? Thanks.
Re: too many open files
It doesn't tell you anything if file ends it with ic-###, except pointing out the SSTable version it uses (ic in this case). Files related to secondary index contain something like this in the filename: KS-CF.IDX-NAME, while in regular CFs do not contain any dots except the one just before file extension. M. W dniu 15.07.2013 09:38, Paul Ingalls pisze: Also, looking through the log, it appears a lot of the files end with ic- which I assume is associated with a secondary index I have on the table. Are secondary indexes really expensive from a file descriptor standpoint? That particular table uses the default compaction scheme... On Jul 15, 2013, at 12:00 AM, Paul Ingalls paulinga...@gmail.com wrote: I have one table that is using leveled. It was set to 10MB, I will try changing it to 256MB. Is there a good way to merge the existing sstables? On Jul 14, 2013, at 5:32 PM, Jonathan Haddad j...@jonhaddad.com wrote: Are you using leveled compaction? If so, what do you have the file size set at? If you're using the defaults, you'll have a ton of really small files. I believe Albert Tobey recommended using 256MB for the table sstable_size_in_mb to avoid this problem. On Sun, Jul 14, 2013 at 5:10 PM, Paul Ingalls paulinga...@gmail.com wrote: I'm running into a problem where instances of my cluster are hitting over 450K open files. Is this normal for a 4 node 1.2.6 cluster with replication factor of 3 and about 50GB of data on each node? I can push the file descriptor limit up, but I plan on having a much larger load so I'm wondering if I should be looking at something else…. Let me know if you need more info… Paul -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Restart node = hinted handoff flood
My blind guess is: https://issues.apache.org/jira/browse/CASSANDRA-5179 In our case the only sensible solution was to pause hints delivery and disable storing them (both done with a nodetool: pausehandoff and disablehandoff). Once they TTL'd (3 hours by default I believe?) I turned HH on again and started to repair. However, problem has returned on the next day, so I had to do a quick C* upgrade with the version having this patch applied (we use a self-built 1.2.1 with a few additional patches applied). M. W dniu 04.07.2013 18:41, Alain RODRIGUEZ pisze: The point is that there is no way, afaik, to limit the speed of these Hinted Handoff since it's not a stream like repair or bootstrap, no way either to keep the node out of the ring during the time it is receiving hints since hints and normal traffic both go through gossip protocol on port 7000. How to avoid this Hinted Handoff flood on returning nodes ? Alain 2013/7/4 Alain RODRIGUEZ arodr...@gmail.com Hi, Using C*1.2.2 12 EC2 xLarge cluster. When I restart a node, if it spend a few minutes down, when I bring it up, all the cpu are blocked at 100%, even once compactions are disabled, inducing a very big and intolerable latency in my app. I suspect Hinted Handoff to be the cause of this. disabling gossip fix the problem, enabling it again brings the latency back (with a lot of gc, dropped messages...). Is there a way to disable HH ? Are they responsible for this issue ? I currently have this node down, any fast insight would be appreciated. Alain
Re: CorruptBlockException - recover?
I think I'd try removing broken SSTables (when node is down) and running repair then. M. W dniu 05.07.2013 09:10, Jan Kesten pisze: Hi, i tried to scrub the keyspace - but with no success either, the process threw an exception when hitting the corrupt block and stopped then. I will rebootstrap the node :-) Thanks anyways, Jan On 03.07.2013 19:10, Glenn Thompson wrote: For what its worth. I did this when I had this problem. It didn't work out for me. Perhaps I did something wrong. On Wed, Jul 3, 2013 at 11:06 AM, Robert Coli rc...@eventbrite.com mailto:rc...@eventbrite.com wrote: On Wed, Jul 3, 2013 at 7:04 AM, ifjke j.kes...@enercast.de mailto:j.kes...@enercast.de wrote: I found that one of my cassandra nodes died recently (machine hangs). I restarted the node an run a nodetool repair, while running it has thrown a org.apache.cassandra.io http://org.apache.cassandra.io.compress.CorruptBlockException. Is there any way to recover from this? Or would it be best to delete the nodes contents and bootstrap it again? If you scrub this SSTable (either with the online or offline version of scrub) it will remove the corrupt data and re-write the rest of the SSTable which isn't corrupt into a new SSTable. That is probably safer for your data than deleting the entire set of data on this replica. When that's done, restart the repair. =Rob
Re: going down from RF=3 to RF=2, repair constantly falls over with JVM OOM
I don't think you need to run repair if you decrease RF. At least I wouldn't do it. In case of *decreasing* RF have 3 nodes containing some data, but only 2 of them should store them from now on, so you should rather run cleanup, instead of repair, toget rid of the data on 3rd replica. And I guess it should work (in terms of disk space and memory), if you've been able to perform compaction. Repair makes sense if you *increase* RF, so the data are streamed to the new replicas. M. W dniu 04.07.2013 12:20, Evan Dandrea pisze: Hi, We've made the mistake of letting our nodes get too large, now holding about 3TB each. We ran out of enough free space to have a successful compaction, and because we're on 1.0.7, enabling compression to get out of the mess wasn't feasible. We tried adding another node, but we think this may have put too much pressure on the existing ones it was replicating from, so we backed out. So we decided to drop RF down to 2 from 3 to relieve the disk pressure and started building a secondary cluster with lots of 1 TB nodes. We ran repair -pr on each node, but it’s failing with a JVM OOM on one node while another node is streaming from it for the final repair. Does anyone know what we can tune to get the cluster stable enough to put it in a multi-dc setup with the secondary cluster? Do we actually need to wait for these RF3-RF2 repairs to stabilize, or could we point it at the secondary cluster without worry of data loss? We’ve set the heap on these two problematic nodes to 20GB, up from the equally too high 12GB, but we’re still hitting OOM. I had seen in other threads that tuning down compaction might help, so we’re trying the following: in_memory_compaction_limit_in_mb 32 (down from 64) compaction_throughput_mb_per_sec 8 (down from 16) concurrent_compactors 2 (the nodes have 24 cores) flush_largest_memtables_at 0.45 (down from 0.50) stream_throughput_outbound_megabits_per_sec 300 (down from 400) reduce_cache_sizes_at 0.5 (down from 0.6) reduce_cache_capacity_to 0.35 (down from 0.4) -XX:CMSInitiatingOccupancyFraction=30 Here’s the log from the most recent repair failure: http://paste.ubuntu.com/5843017/ The OOM starts at line 13401. Thanks for whatever insight you can provide.