Re: [MASSMAIL]Re: Any use-case about a migration from SQL Server to Cassandra?
https://labs.spotify.com/2015/06/23/user-database-switch/ On Wed, Jun 24, 2015 at 5:57 PM, Marcos Ortiz mlor...@uci.cu wrote: Where is the link, Carlos? On 24/06/15 07:18, Carlos Alonso wrote: This article from Spotify Labs is a really nice write up of migrating SQL (Postgres in this case) to Cassandra Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 23 June 2015 at 20:23, Alex Popescu al...@datastax.com wrote: On Tue, Jun 23, 2015 at 12:13 PM, Marcos Ortiz mlor...@uci.cu wrote: 2- They used heavily C# in a Microsoft-based environment, so I need to know if the .Net driver is ready to use for production The DataStax C# driver has been used in production for quite a while by numerous users. It is the most up-to-date, feature rich, and tunable C# driver for Apache Cassandra and DataStax Enterprise. Anyways, if there's anything missing we are always happy to improve it. (as you can see from my sig, I do work for DataStax, but the above is very true) -- Bests, Alex Popescu | @al3xandru Sen. Product Manager @ DataStax -- Marcos Ortiz http://about.me/marcosortiz, Sr. Product Manager (Data Infrastructure) at UCI @marcosluis2186 http://twitter.com/marcosluis2186 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: connections remain on CLOSE_WAIT state after process is killed after upgrade to 2.0.15
For the record: https://issues.apache.org/jira/browse/CASSANDRA-9630 On Mon, Jun 15, 2015 at 7:19 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Just a quick update, I was able to fix the problem by reverting the patch CASSANDRA-8336 in our custom cassandra build. I don't know the root cause yet though. I will open a JIRA ticket and post here for reference later. On Fri, Jun 12, 2015 at 11:31 AM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello, We recently upgraded a cluster from 2.0.12 to 2.0.15 and now whenever we stop/kill a cassandra process, some other nodes keep a connection with the dead node in the CLOSE_WAIT state on port 7000 for about 5-20 minutes. So, if I start the killed node again, it cannot handshake with the nodes which have a connection on the CLOSE_WAIT state until that connection is closed, so they remain on the down state to each other for 5-20 minutes, until they can handshake again. I believe this is somehow related to the fixes CASSANDRA-8336 and CASSANDRA-9238, and also could be a duplicate of CASSANDRA-8072. I will continue to investigate to see if I find more evidences, but any help at this point would be appreciated, or at least a confirmation that it could be related to any of these tickets. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
connections remain on CLOSE_WAIT state after process is killed after upgrade to 2.0.15
Hello, We recently upgraded a cluster from 2.0.12 to 2.0.15 and now whenever we stop/kill a cassandra process, some other nodes keep a connection with the dead node in the CLOSE_WAIT state on port 7000 for about 5-20 minutes. So, if I start the killed node again, it cannot handshake with the nodes which have a connection on the CLOSE_WAIT state until that connection is closed, so they remain on the down state to each other for 5-20 minutes, until they can handshake again. I believe this is somehow related to the fixes CASSANDRA-8336 and CASSANDRA-9238, and also could be a duplicate of CASSANDRA-8072. I will continue to investigate to see if I find more evidences, but any help at this point would be appreciated, or at least a confirmation that it could be related to any of these tickets. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Why select returns tombstoned results?
What version of Cassandra are you running? Are you by any chance running repairs on your data? On Mon, Mar 30, 2015 at 5:39 PM, Benyi Wang bewang.t...@gmail.com wrote: Thanks for replying. In cqlsh, if I change to Quorum (Consistency quorum), sometime the select return the deleted row, sometime not. I have two virtual data centers: service (3 nodes) and analytics(4 nodes collocate with Hadoop data nodes).The table has 3 replicas in service and 2 in analytics. When I wrote, I wrote into analytics using local_one. So I guest the data may not replicated to all nodes yet. I will try to use strong consistency for write. On Mon, Mar 30, 2015 at 11:59 AM, Prem Yadav ipremya...@gmail.com wrote: Increase the read CL to quorum and you should get correct results. How many nodes do you have in the cluster and what is the replication factor for the keyspace? On Mon, Mar 30, 2015 at 7:41 PM, Benyi Wang bewang.t...@gmail.com wrote: Create table tomb_test ( guid text, content text, range text, rank int, id text, cnt int primary key (guid, content, range, rank) ) Sometime I delete the rows using cassandra java driver using this query DELETE FROM tomb_test WHERE guid=? and content=? and range=? in Batch statement with UNLOGGED. CONSISTENCE_LEVEL is local_one. But if I run SELECT * FROM tomb_test WHERE guid='guid-1' and content='content-1' and range='week' or SELECT * FROM tomb_test WHERE guid='guid-1' and content='content-1' and range='week' and rank = 1 The result shows the deleted rows. If I run this select, the deleted rows are not shown SELECT * FROM tomb_test WHERE guid='guid-1' and content='content-1' If I run delete statement in cqlsh, the deleted rows won't show up. How can I fix this? -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Should a node that is bootstrapping be receiving writes in addition to the streams it is receiving?
I'm also facing a similar issue while bootstrapping a replacement node via -Dreplace_address flag. The node is streaming data from neighbors, but cfstats shows 0 counts for all metrics of all CFs in the bootstrapping node: SSTable count: 0 SSTables in each level: [0, 0, 0, 0, 0, 0, 0, 0, 0] Space used (live), bytes: 0 Space used (total), bytes: 0 SSTable Compression Ratio: 0.0 Number of keys (estimate): 0 Memtable cell count: 0 Memtable data size, bytes: 0 Memtable switch count: 0 Local read count: 0 Local read latency: 0.000 ms Local write count: 0 Local write latency: 0.000 ms Pending tasks: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.0 Bloom filter space used, bytes: 0 Compacted partition minimum bytes: 0 Compacted partition maximum bytes: 0 Compacted partition mean bytes: 0 Average live cells per slice (last five minutes): 0.0 Average tombstones per slice (last five minutes): 0.0 I also checked via JMX and all the write counts are zero. Is the node supposed to receive writes during bootstrap? The other funny thing during bootstrap, is that nodetool status shows that the bootsrapping node is Up/Normal (UN), instead of Up/Joining(UJ), is this expected or is it a bug? The bootstrapping node does not even appear in the nodetool status of other nodes. UN X.Y.Z.244 15.9 GB1 3.7% 52fb21e-4621-4533-b201-8c1a7adbe818 rack If I do a nodetool netstats, I see: Mode: JOINING Bootstrap 647d4b30-c11e-11e4-9249-173e73521fb44 Cheers, Paulo On Thu, Oct 16, 2014 at 3:53 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Oct 15, 2014 at 10:07 PM, Peter Haggerty peter.hagge...@librato.com wrote: The node wrote gigs of data to various CFs during the bootstrap so it was clearly writing in some sense and it has the expected behavior after the bootstrap. Is cfstats correct when it reports that there were no writes during a bootstrap? As I understand it : Writes (extra writes, from the perspective of replication factor, f/e a RF=3 cluster has effective RF=4 during bootstrap, but not relevant for consistency purposes until end of bootstrap) occur via the storage protocol during bootstrap, so I would expect to see those reflected in cfstats. I'm relatively confident it is in fact receiving those writes, so your confusion might just be a result of how it's reported? =Rob http://twitter.com/rcolidba -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: best supported spark connector for Cassandra
I used to use calliope, which was really awesome before DataStax native integration with Spark. Now I'm quite happy with the official DataStax spark connector, it's very straightforward to use. I never tried to use these drivers with Java though, I'd suggest you to use them with Scala, which is the best option to write spark jobs. On Fri, Feb 13, 2015 at 12:12 PM, Carlos Rolo r...@pythian.com wrote: Not for sure ;) If you need Cassandra support I can forward you to someone to talk to at Pythian. Regards, Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Tel: 1649 www.pythian.com On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Actually, I am not the one looking for support, but I thank you a lot anyway. But from your message I guess the answer is yes, Datastax is not the only Cassandra vendor offering support and changing official Cassandra source at this moment, is this right? From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0. Regarding the Cassandra support, I can introduce you to someone in Stratio that can help you. 2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net: Thanks for the hint Gaspar. Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0? I had interest in knowing more about Stratio when I was working on a start up. Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a project is the need of an area supporting it, and it seems people are specially concerned about how many vendors an open source solution has to provide support. This seems to be kind of an advantage of HBase, as there are many vendors supporting it, but I wonder if Stratio can be considered an alternative to Datastax reggarding Cassandra support? It's not my call here to decide anything, but as part of the community it helps to have this business scenario clear. I could say Cassandra could be the best fit technical solution for some projects but sometimes non-technical factors are in the game, like this need for having more than one vendor available... From: gmu...@stratio.com Subject: Re: best supported spark connector for Cassandra My suggestion is to use Java or Scala instead of Python. For Java/Scala both the Datastax and Stratio drivers are valid and similar options. As far as I know they both take care about data locality and are not based on the Hadoop interface. The advantage of Stratio Deep is that allows you to integrate Spark not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others as well. Stratio has a forked Cassandra for including some additional features such as Lucene based secondary indexes. So Stratio driver works fine with the Apache Cassandra and also with their fork. You can find some examples of using Deep here: https://github.com/Stratio/deep-examples Please if you need some help with Stratio Deep do not hesitate to contact us. 2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com: I am using Calliope cassandra-spark connector( http://tuplejump.github.io/calliope/), which is quite handy and easy to use! The only problem is that it is a bit outdates , works with Spark 1.1.0, hopefully new version comes soon. best, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on
Re: Database schema migration
Hello José, There isn't yet an officially supported way to perform schema migrations afaik, but there are quite a few tools on github that perform migrations either from within the application, or external tools. We currently use this tool to perform migrations embedded in the application: https://github.com/fromanator/mutagen-cassandra You may find other options in the mail list archives. Cheers, On Thu, Jan 29, 2015 at 8:31 AM, José Guilherme Vanz guilherme@gmail.com wrote: Hello I am studying Cassandra for while and to practice the libraries and concepts I will implement a simple Cassandra client. During my research I faced a doubt about schema migrations. What the common/best practice in production clusters? I mean, who actually make the schema migration? The application or the cluster mananger have to update the schema before update the application? All the best Vanz -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Database schema migration
This might be of interest (you probably have already found it): http://grokbase.com/t/cassandra/user/14bs9zvasf/cassandra-schema-migrator On Thu, Jan 29, 2015 at 9:16 AM, José Guilherme Vanz guilherme@gmail.com wrote: Hi, Ricardo Thank you for your quick reply. =] I'll take a look in the mutagen-cassandra and others I find in the archives All the best On Thu, Jan 29, 2015 at 8:38 AM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello José, There isn't yet an officially supported way to perform schema migrations afaik, but there are quite a few tools on github that perform migrations either from within the application, or external tools. We currently use this tool to perform migrations embedded in the application: https://github.com/fromanator/mutagen-cassandra You may find other options in the mail list archives. Cheers, On Thu, Jan 29, 2015 at 8:31 AM, José Guilherme Vanz guilherme@gmail.com wrote: Hello I am studying Cassandra for while and to practice the libraries and concepts I will implement a simple Cassandra client. During my research I faced a doubt about schema migrations. What the common/best practice in production clusters? I mean, who actually make the schema migration? The application or the cluster mananger have to update the schema before update the application? All the best Vanz -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 -- Att. José Guilherme Vanz br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/ http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/ O sofrimento é passageiro, desistir é para sempre - Bernardo Fonseca, recordista da Antarctic Ice Marathon. -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: get partition key from tombstone warnings?
Yep, you may register and log into the Apache JIRA and click Vote for this issue, in the upper right-side of the ticket. On Wed, Jan 21, 2015 at 11:30 PM, Ian Rose ianr...@fullstory.com wrote: Ah, thanks for the pointer Philip. Is there any kind of formal way to vote up issues? I'm assuming that adding a comment of +1 or the like is more likely to be *counter*productive. - Ian On Wed, Jan 21, 2015 at 5:02 PM, Philip Thompson philip.thomp...@datastax.com wrote: There is an open ticket for this improvement at https://issues.apache.org/jira/browse/CASSANDRA-8561 On Wed, Jan 21, 2015 at 4:55 PM, Ian Rose ianr...@fullstory.com wrote: When I see a warning like Read 9 live and 5769 tombstoned cells in ... etc is there a way for me to see the partition key that this query was operating on? The description in the original JIRA ticket ( https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though exposing this information was one of the original goals, but it isn't obvious to me in the logs... Cheers! - Ian -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Reload/resync system.peers table
Hello, Due to CASSANDRA-6053 there are lots of ghost nodes on the system.peers table, because decommisioned nodes were not properly removed from this table. Is there any automatic way of reloading/resyncing the system.peers table? Or the only way is by removing ghost nodes? Tried to restart the node with -Dcassandra.load_ring_state=false, but didn't work. Cheers, Paulo -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Nodes get stuck in crazy GC loop after some time, leading to timeouts
Thanks a lot for the help Graham and Robert! Will try increasing heap and see how it goes. Here are my gc settings, if they're still helpful (they're mostly the defaults): -Xms6G -Xmx6G -Xmn400M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways On Wed, Dec 3, 2014 at 2:17 AM, Jason Wee peich...@gmail.com wrote: ack and many thanks for the tips and help.. jason On Wed, Dec 3, 2014 at 4:49 AM, Robert Coli rc...@eventbrite.com wrote: On Mon, Dec 1, 2014 at 11:07 PM, Jason Wee peich...@gmail.com wrote: Hi Rob, any recommended documentation on describing explanation/configuration of the JVM heap and permanent generation ? We stucked in this same situation too. :( The archives of this list are chock full of explorations of various cases. Your best bet is to look for a good Aaron Morton reference where he breaks down the math between generations. I swear there was a blog post of his on this subject, but the best I can find is this slidedeck : http://www.slideshare.net/aaronmorton/cassandra-tk-2014-large-nodes =Rob -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Nodes get stuck in crazy GC loop after some time, leading to timeouts
Hello, This is a recurrent behavior of JVM GC in Cassandra that I never completely understood: when a node is UP for many days (or even months), or receives a very high load spike (3x-5x normal load), CMS GC pauses start becoming very frequent and slow, causing periodic timeouts in Cassandra. Trying to run GC manually doesn't free up memory. The only solution when a node reaches this state is to restart the node. We restart the whole cluster every 1 or 2 months, to avoid machines getting into this crazy state. We tried tuning GC size and parameters, different cassandra versions (1.1, 1.2, 2.0), but this behavior keeps happening. More recently, during black friday, we received about 5x our normal load, and some machines started presenting this behavior. Once again, we restart the nodes an the GC behaves normal again. I'm attaching a few pictures comparing the heap of healthy and sick nodes: http://imgur.com/a/Tcr3w You can clearly notice some memory is actually reclaimed during GC in healthy nodes, while in sick machines very little memory is reclaimed. Also, since GC is executed more frequently in sick machines, it uses about 2x more CPU than non-sick nodes. Have you ever observed this behavior in your cluster? Could this be related to heap fragmentation? Would using the G1 collector help in this case? Any GC tuning or monitoring advice to troubleshoot this issue? Any advice or pointers will be kindly appreciated. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Cassandra COPY to CSV and DateTieredCompactionStrategy
Regarding the first question you need to configure your application to write to both CFs (old and new) during the migration phase. I'm not sure about the second question, but my guess is that only the writeTime will be taken into account. On Thu, Nov 27, 2014 at 10:54 AM, Batranut Bogdan batra...@yahoo.com wrote: Hello all, I have a few things that I need to understand. 1 . Here is the scenario: we have a HUGE cf where there are daily writes it is like a time series. Now we want to change the type of a column in primary key. What I think we can do is to export to csv, create the new table and write back the transformed data. But here is the catch... the constant writes in the cf. I assume that by the time the export finishes, new data will be inserted in the source cf. So is there a tool that will export data without having to stop the writes? 2. I have seen that there is a new compaction strategy: DTCS, that will better fit historical data. This compaction strategy will take into account writeTime() of an entry or will it be smart enough and detect that the column family is a time series and take into account those timestamps when creating the time windows? I am asking this since when we write to the cf, the time for a particular record is 00:00h of a given day, so basicaly all entries have the same timestamp value in the cf but of course different writeTime() . -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Repair/Compaction Completion Confirmation
Hey guys, Just reviving this thread. In case anyone is using the cassandra_range_repair tool (https://github.com/BrianGallew/cassandra_range_ repair), please sync your repositories because the tool was not working before due to a critical bug on the token range definition method. For more information on the bug please check here: https://github.com/BrianGallew/cassandra_range_repair/pull/18 Cheers, On Tue, Oct 28, 2014 at 7:53 AM, Colin co...@clark.ws wrote: When I use virtual nodes, I typically use a much smaller number - usually in the range of 10. This gives me the ability to add nodes easier without the performance hit. -- *Colin Clark* +1-320-221-9531 On Oct 28, 2014, at 10:46 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: I have been trying this yesterday too. https://github.com/BrianGallew/cassandra_range_repair Not 100% bullet proof -- Indeed I found that operations are done multiple times, so it is not very optimised. Though it is open sourced so I guess you can improve things as much as you want and contribute. Here is the issue I raised yesterday https://github.com/BrianGallew/cassandra_range_repair/issues/14. I am also trying to improve our repair automation since we now have multiple DC and up to 800 GB per node. Repairs are quite heavy right now. Good luck, Alain 2014-10-28 4:59 GMT+01:00 Ben Bromhead b...@instaclustr.com: https://github.com/BrianGallew/cassandra_range_repair This breaks down the repair operation into very small portions of the ring as a way to try and work around the current fragile nature of repair. Leveraging range repair should go some way towards automating repair (this is how the automatic repair service in DataStax opscenter works, this is how we perform repairs). We have had a lot of success running repairs in a similar manner against vnode enabled clusters. Not 100% bullet proof, but way better than nodetool repair On 28 October 2014 08:32, Tim Heckman t...@pagerduty.com wrote: On Mon, Oct 27, 2014 at 1:44 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Oct 27, 2014 at 1:33 PM, Tim Heckman t...@pagerduty.com wrote: I know that when issuing some operations via nodetool, the command blocks until the operation is finished. However, is there a way to reliably determine whether or not the operation has finished without monitoring that invocation of nodetool? In other words, when I run 'nodetool repair' what is the best way to reliably determine that the repair is finished without running something equivalent to a 'pgrep' against the command I invoked? I am curious about trying to do the same for major compactions too. This is beyond a FAQ at this point, unfortunately; non-incremental repair is awkward to deal with and probably impossible to automate. In The Future [1] the correct solution will be to use incremental repair, which mitigates but does not solve this challenge entirely. As brief meta commentary, it would have been nice if the project had spent more time optimizing the operability of the critically important thing you must do once a week [2]. https://issues.apache.org/jira/browse/CASSANDRA-5483 =Rob [1] http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1 [2] Or, more sensibly, once a month with gc_grace_seconds set to 34 days. Thank you for getting back to me so quickly. Not the answer that I was secretly hoping for, but it is nice to have confirmation. :) Cheers! -Tim -- Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr http://twitter.com/instaclustr | +61 415 936 359 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Increase in dropped mutations after major upgrade from 1.2.18 to 2.0.10
Hey, We've seen a considerable increase in the number of dropped mutations after a major upgrade from 1.2.18 to 2.0.10. I initially thought it was due to the extra load incurred by upgradesstables, but the dropped mutations continue even after all sstables are upgraded. Additional info: Overall (read, write and range) latency improved with the upgrade, which is great, but I don't understand why dropped mutations has increased. I/O and CPU load is pretty much the same, number of completed tasks is the only metric that increased together with dropped mutations. I also noticed that the number of all time blocked FlushWriter operations is about 5% of completed operations, don't know if this is related, but in case it helps out... Anyone has a clue on what could that be? Or what should we monitor to find out? Any help or JIRA pointers would be kindly appreciated. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Increase in dropped mutations after major upgrade from 1.2.18 to 2.0.10
On Mon, Nov 10, 2014 at 12:46 PM, Duncan Sands duncan.sa...@gmail.com wrote: Hi Paulo, On 10/11/14 15:18, Paulo Ricardo Motta Gomes wrote: Hey, We've seen a considerable increase in the number of dropped mutations after a major upgrade from 1.2.18 to 2.0.10. I initially thought it was due to the extra load incurred by upgradesstables, but the dropped mutations continue even after all sstables are upgraded. are the clocks on all your nodes synchronized with each other? Ciao, Duncan. Yes, the servers are synchronized via NTP. Cheers! Additional info: Overall (read, write and range) latency improved with the upgrade, which is great, but I don't understand why dropped mutations has increased. I/O and CPU load is pretty much the same, number of completed tasks is the only metric that increased together with dropped mutations. I also noticed that the number of all time blocked FlushWriter operations is about 5% of completed operations, don't know if this is related, but in case it helps out... Anyone has a clue on what could that be? Or what should we monitor to find out? Any help or JIRA pointers would be kindly appreciated. Cheers, -- *Paulo Motta* Chaordic | /Platform/ _www.chaordic.com.br http://www.chaordic.com.br/_ +55 48 3232.3200 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: efficiently generate complete database dump in text format
The best way to generate dumps from Cassandra is via Hadoop integration (or spark). You can find more info here: http://www.datastax.com/documentation/cassandra/2.1/cassandra/configuration/configHadoop.html http://wiki.apache.org/cassandra/HadoopSupport On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar gbhatna...@gmail.com wrote: Hi, We have a Cassandra database column family containing 320 millions rows and each row contains about 15 columns. We want to take monthly dump of this single column family contained in this database in text format. We are planning to take following approach to implement this functionality 1. Take a snapshot of Cassandra database using nodetool utility. We specify -cf flag to specify column family name so that snapshot contains data corresponding to a single column family. 2. We take backup of this snapshot and move this backup to a separate physical machine. 3. We using SStable to json conversion utility to json convert all the data files into json format. We have following questions/doubts regarding the above approach a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json record and can I safely ignore all such json records? b) If I ignore all records marked by d flag, than can generated json files in step 3, contain duplicate records? I mean do multiple entries for same key. Do there can be any other better approach to generate data dumps in text format. Regards, Gaurav -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Multi-DC Repairs and Token Questions
This related issue might be of interest: https://issues.apache.org/jira/browse/CASSANDRA-7450 In 1.2 -pr option does make cross DC repairs, but you must ensure that all nodes from all datacenter execute repair, otherwise some ranges will be missing. This fix enables -pr and -local together, which was disabled in 2.0 because it didn't work (it also does not work in 1.2). On Tue, Oct 7, 2014 at 5:46 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi guys, sorry about digging this up, but, is this bug also affecting 1.2.x versions ? I can't see this being backported to 1.2 on the Jira. Was this bug introduced in 2.0 ? Anyway, how does nodetool repair -pr behave on a multi DC env, does it make cross DC repairs or not ? Should we remove the pr option in a multi DC context to remove entropy between DCs ? I mean a repair -pr is supposed to repair the primary range for the current node, does it also repair corresponding primary range in other DCs ? Thanks for insight around this. 2014-06-03 8:06 GMT+02:00 Nick Bailey n...@datastax.com: See https://issues.apache.org/jira/browse/CASSANDRA-7317 On Mon, Jun 2, 2014 at 8:57 PM, Matthew Allen matthew.j.al...@gmail.com wrote: Hi Rameez, Chovatia, (sorry I initially replied to Dwight individually) SN_KEYSPACE and MY_KEYSPACE are just typos (was try to mask out identifiable information), they are same keyspace. Keyspace: SN_KEYSPACE: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [DC_VIC:2, DC_NSW:2] In a nutshell, replication is working as expected, I'm just confused about token range assignments in a Multi-DC environment and how repairs should work From http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configGenTokens_c.html, it specifies *Multiple data center deployments: calculate the tokens for each data center so that the hash range is evenly divided for the nodes in each data center* Given that nodetool -repair isn't multi-dc aware, in our production 18 node cluster (9 nodes in each DC), which of the following token ranges should be used (Murmur3 Partitioner) ? Token range divided evenly over the 2 DC's/18 nodes as below ? Node DC_NSWDC_VIC 1'-9223372036854775808''-8198552921648689608' 2'-7173733806442603408''-6148914691236517208' 3'-5124095576030431008''-4099276460824344808' 4'-3074457345618258608''-2049638230412172408' 5'-1024819115206086208''-8' 6'1024819115206086192' '2049638230412172392' 7'3074457345618258592' '4099276460824344792' 8'5124095576030430992' '6148914691236517192' 9'7173733806442603392' '8198552921648689592' Or An offset used for DC_VIC (i.e. DC_NSW + 100) ? Node DC_NSW DC_VIC 1 '-9223372036854775808''-9223372036854775708' 2 '-7173733806442603407''-7173733806442603307' 3 '-5124095576030431006''-5124095576030430906' 4 '-3074457345618258605''-3074457345618258505' 5 '-1024819115206086204''-1024819115206086104' 6 '1024819115206086197' '1024819115206086297' 7 '3074457345618258598' '3074457345618258698' 8 '5124095576030430999' '5124095576030431099' 9 '7173733806442603400' '7173733806442603500' It's too late for me to switch to vnodes, hope that makes sense, thanks Matt On Thu, May 29, 2014 at 12:01 AM, Rameez Thonnakkal ssram...@gmail.com wrote: as Chovatia mentioned, the keyspaces seems to be different. try Describe keyspace SN_KEYSPACE and describe keyspace MY_KEYSPACE from CQL. This will give you an idea about how many replicas are there for these keyspaces. On Wed, May 28, 2014 at 11:49 AM, chovatia jaydeep chovatia_jayd...@yahoo.co.in wrote: What is your partition type? Is it org.apache.cassandra.dht.Murmur3Partitioner? In your repair command i do see there are two different KeySpaces MY_KEYSPACE and SN_KEYSPACE, are these two separate key spaces or typo? -jaydeep On Tuesday, 27 May 2014 10:26 PM, Matthew Allen matthew.j.al...@gmail.com wrote: Hi, Am a bit confused regarding data ownership in a multi-dc environment. I have the following setup in a test cluster with a keyspace with (placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {'DC_NSW':2,'DC_VIC':2};) Datacenter: DC_NSW == Replicas: 2 Address RackStatus State Load OwnsToken 0 nsw1 rack1 Up Normal 1007.43 MB 100.00% -9223372036854775808 nsw2 rack1 Up Normal 1008.08 MB 100.00% 0 Datacenter: DC_VIC == Replicas: 2 Address RackStatus State Load OwnsToken 100 vic1 rack1 Up Normal 1015.1 MB 100.00% -9223372036854775708 vic2 rack1 Up Normal 1015.13 MB 100.00% 100 My understanding is that both
backport of CASSANDRA-6916
Hello, Has anyone backported incremental replacement of compacted SSTables (CASSANDRA-6916) to 2.0? Is it doable or there are many dependencies introduced in 2.1? Haven't checked the ticket detail yet, but just in case anyone has interesting info to share. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: backport of CASSANDRA-6916
own purposes but wouldn't mind making it public so people could patch it themselves if they want too.. (if nobody has already done so) :) On Tue, Sep 16, 2014 at 8:13 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Sep 16, 2014 at 2:56 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Has anyone backported incremental replacement of compacted SSTables (CASSANDRA-6916) to 2.0? Is it doable or there are many dependencies introduced in 2.1? Haven't checked the ticket detail yet, but just in case anyone has interesting info to share. Are you looking to patch for public consumption, or for your own purposes? I just took the temperature of #cassandra-dev and they were cold on the idea as a public patch, because of potential impact on stability. =Rob -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: backport of CASSANDRA-6916
Because I want this specific feature, and not all 2.1 features, even though this is probably one of the most significant changes in 2.1. Upgrading would be nice, but want to wait a little more before fully jumping into 2.1 :) We're having sudden peaks on read latency some time after a massive batch write which is mostly likely caused by cold page cache of newly compacted sstables, which will hopefully be solved by this. On Tue, Sep 16, 2014 at 8:25 PM, James Briggs james.bri...@yahoo.com wrote: Paulo: Out of curiosity, why not just upgrade to 2.1 if you want the new features? You know you want to! :) Thanks, James Briggs -- Cassandra/MySQL DBA. Available in San Jose area or remote. -- *From:* Robert Coli rc...@eventbrite.com *To:* user@cassandra.apache.org user@cassandra.apache.org *Sent:* Tuesday, September 16, 2014 4:13 PM *Subject:* Re: backport of CASSANDRA-6916 On Tue, Sep 16, 2014 at 2:56 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Has anyone backported incremental replacement of compacted SSTables (CASSANDRA-6916) to 2.0? Is it doable or there are many dependencies introduced in 2.1? Haven't checked the ticket detail yet, but just in case anyone has interesting info to share. Are you looking to patch for public consumption, or for your own purposes? I just took the temperature of #cassandra-dev and they were cold on the idea as a public patch, because of potential impact on stability. =Rob -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Quickly loading C* dataset into memory (row cache)
Apparently Apple is using Cassandra as a massive multi-DC cache, as per their announcement during the summit, but probably DSE with in-memory enabled option. Would love to hear about similar use cases. On Fri, Sep 12, 2014 at 12:20 PM, Ken Hancock ken.hanc...@schange.com wrote: +1 for Redis. It's really nice, good primitives, and then you can do some really cool stuff chaining multiple atomic operations to create larger atomics through the lua scripting. On Thu, Sep 11, 2014 at 12:26 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Sep 11, 2014 at 8:30 AM, Danny Chan tofuda...@gmail.com wrote: What are you referring to when you say memory store? RAM disk? memcached? In 2014, probably Redis? =Rob -- *Ken Hancock *| System Architect, Advanced Advertising SeaChange International 50 Nagog Park Acton, Massachusetts 01720 ken.hanc...@schange.com | www.schange.com | NASDAQ:SEAC http://www.schange.com/en-US/Company/InvestorRelations.aspx Office: +1 (978) 889-3329 | [image: Google Talk:] ken.hanc...@schange.com | [image: Skype:]hancockks | [image: Yahoo IM:]hancockks[image: LinkedIn] http://www.linkedin.com/in/kenhancock [image: SeaChange International] http://www.schange.com/This e-mail and any attachments may contain information which is SeaChange International confidential. The information enclosed is intended only for the addressees herein and may not be copied or forwarded without permission from SeaChange International. -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Too many SSTables after rebalancing cluster (LCS)
Deleting the json manifest worked like a charm. After 2 days of compactions I've got 50GB extra space! :) Just a quick addendum, after deleting the json metadata file, I needed to restart the node, otherwise it just reloads the file from memory. Version: 1.2.16 On Wed, Aug 27, 2014 at 8:13 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 27, 2014 at 3:27 PM, Nate McCall n...@thelastpickle.com wrote: Another option to force things - deleting the json metadata file for that table will cause LCS to put all SSTables in level 0 and begin recompacting them. That's possible in versions where the level is in a JSON file, which is versions before 2.0. In 2.0+ you can use nodetool for the same purpose. https://issues.apache.org/jira/browse/CASSANDRA-5271 (Fixed; 2.0 beta 1): Create tool to drop sstables to level 0 =Rob -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Too many SSTables after rebalancing cluster (LCS)
Great idea, will try that (right now is 10%, but being more aggressive should hopefully work). Cheers! On Wed, Aug 27, 2014 at 7:02 PM, Nate McCall n...@thelastpickle.com wrote: Try turning down 'tombstone_threshold' to something like '0.05' from it's default of '0.2.' This will cause the SSTable to be considered for tombstone only compactions more frequently (if %5 of the columns are tombstones instead of 20%). For a bit more info, see: http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/compactSubprop.html On Tue, Aug 26, 2014 at 1:38 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hey folks, After adding more nodes and moving tokens of old nodes to rebalance the ring, I noticed that the old nodes had significant more data then the newly bootstrapped nodes, even after cleanup. I noticed that the old nodes had a much larger number of SSTables on LCS CFs, and most of them located on the last level: Node N-1 (old node): [1, 10, 102/100, 173, 2403, 0, 0, 0, 0] (total:2695) *Node N (new node): [1, 10, 108/100, 214, 0, 0, 0, 0, 0] (total: 339)*Node N+1 (old node): [1, 10, 87, 113, 1076, 0, 0, 0, 0] (total: 1287) Since these sstables have a lot of tombstones, and they're not updated frequently, they remain in the last level forever, and are never cleaned. What is the solution here? The good old change to STCS and then back to LCS, or is there something less brute force? Environment: Cassandra 1.2.16 - non-vnondes Any help would be very much appreciated. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 -- - Nate McCall Austin, TX @zznate Co-Founder Sr. Technical Consultant Apache Cassandra Consulting http://www.thelastpickle.com -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Too many SSTables after rebalancing cluster (LCS)
Hey folks, After adding more nodes and moving tokens of old nodes to rebalance the ring, I noticed that the old nodes had significant more data then the newly bootstrapped nodes, even after cleanup. I noticed that the old nodes had a much larger number of SSTables on LCS CFs, and most of them located on the last level: Node N-1 (old node): [1, 10, 102/100, 173, 2403, 0, 0, 0, 0] (total:2695) *Node N (new node): [1, 10, 108/100, 214, 0, 0, 0, 0, 0] (total: 339)*Node N+1 (old node): [1, 10, 87, 113, 1076, 0, 0, 0, 0] (total: 1287) Since these sstables have a lot of tombstones, and they're not updated frequently, they remain in the last level forever, and are never cleaned. What is the solution here? The good old change to STCS and then back to LCS, or is there something less brute force? Environment: Cassandra 1.2.16 - non-vnondes Any help would be very much appreciated. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: EC2 SSD cluster costs
Still using good ol' m1.xlarge here + external caching (memcached). Trying to adapt our use case to have different clusters for different use cases so we can leverage SSD at an acceptable cost in some of them. On Tue, Aug 19, 2014 at 1:05 PM, Shane Hansen shanemhan...@gmail.com wrote: Again, depends on your use case. But we wanted to keep the data per node below 500gb, and we found raided ssds to be the best bang for the buck for our cluster. I think we moved to from the i2 to c3 because our bottleneck tended to be CPU utilization (from parsing requests). (Discliamer, we're not cassandra veterans but we're not part of the RF=N=3 club) On Tue, Aug 19, 2014 at 10:00 AM, Russell Bradberry rbradbe...@gmail.com wrote: Short answer, it depends on your use-case. We migrated to i2.xlarge nodes and saw an immediate increase in performance. If you just need plain ole raw disk space and don’t have a performance requirement to meet then the m1 machines would work, or hell even SSD EBS volumes may work for you. The problem we were having is that we couldn’t fill the m1 machines because we needed to add more nodes for performance. Now we have much more power and just the right amount of disk space. Basically saying, these are not apples-to-apples comparisons On August 19, 2014 at 11:57:10 AM, Jeremy Jongsma (jer...@barchart.com) wrote: The latest consensus around the web for running Cassandra on EC2 seems to be use new SSD instances. I've not seen any mention of the elephant in the room - using the new SSD instances significantly raises the cluster cost per TB. With Cassandra's strength being linear scalability to many terabytes of data, it strikes me as odd that everyone is recommending such a large storage cost hike almost without reservation. Monthly cost comparison for a 100TB cluster (non-reserved instances): m1.xlarge (2x420 non-SSD): $30,000 (120 nodes) m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option) i2.xlarge (1x800 SSD): $76,000 (125 nodes) Best case, the cost goes up 150%. How are others approaching these new instances? Have you migrated and eaten the costs, or are you staying on previous generation until prices come down? -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: How to maintain the N-most-recent versions of a value?
You might be interested in the following ticket: https://issues.apache.org/jira/browse/CASSANDRA-3929 There's a patch available that was not integrated because it's not possible to guarantee exactly N values will be kept, and there are some other problems with deletions, but it may be useful depending on your usage characteristics. On Fri, Jul 18, 2014 at 7:58 AM, Laing, Michael michael.la...@nytimes.com wrote: The cql you provided is invalid. You probably meant something like: CREATE TABLE foo ( rowkey text, family text, qualifier text, version int, value blob, PRIMARY KEY ((rowkey, family, qualifier), version)) WITH CLUSTERING ORDER BY (version DESC); We use ttl's and LIMIT for structures like these, paying attention to the construction of the partition key so that partition sizes are reasonable. If the blob might be large, store it somewhere else. We use S3 but you could also put it in another C* table. In 2.1 the row cache may help as it will store N rows per recently accessed partition, starting at the beginning of the partition. ml On Fri, Jul 18, 2014 at 6:30 AM, Benedict Elliott Smith belliottsm...@datastax.com wrote: If the versions can be guaranteed to be a adjacent (i.e. if the latest version is V, the prior version is V-1) you could issue a delete at the same time as an insert for V-N-(buffer) where buffer = 0 In general guaranteeing that is probably hard, so this seems like something that would be nice to have C* manage for you. Unfortunately we don't have anything on the roadmap to help with this. A custom compaction strategy might do the trick, or permitting some filter during compaction that can omit/tombstone certain records based on the input data. This latter option probably wouldn't be too hard to implement, although it might not offer any guarantees about expiring records in order without incurring extra compaction cost (you could reasonably easily guarantee the most recent N are present, but the cleaning up of older records might happen haphazardly, in no particular order, and without any promptness guarantees, if you want to do it cheaply). Feel free to file a ticket, or submit a patch! On Fri, Jul 18, 2014 at 1:32 AM, Clint Kelly clint.ke...@gmail.com wrote: Hi everyone, I am trying to design a schema that will keep the N-most-recent versions of a value. Currently my table looks like the following: CREATE TABLE foo ( rowkey text, family text, qualifier text, version long, value blob, PRIMARY KEY (rowkey, family, qualifier, version)) WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version DESC)); Is there any standard design pattern for updating such a layout such that I keep the N-most-recent (version, value) pairs for every unique (rowkey, family, qualifier)? I can't think of any way to do this without doing a read-modify-write. The best thing I can think of is to use TTL to approximate the desired behavior (which will work if I know how often we are writing new data to the table). I could also use LIMIT N in my queries to limit myself to only N items, but that does not address any of the storage-size issues. In case anyone is curious, this question is related to some work that I am doing translating a system built on HBase (which provides this keep the N-most-recent-version-of-a-cell behavior) to Cassandra while providing the user with as-similar-as-possible an interface. Best regards, Clint -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: unable to find sufficient sources for streaming range
Are you using the -Dcassandra.replace_address=address_of_dead_node flag to replace the removed node, according to http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_replace_node_t.html ? If yes and the new node has the same address as the replaced node, you might be hitting CASSANDRA-6622 ( https://issues.apache.org/jira/browse/CASSANDRA-6622), that was fixed only in 1.2.16. Cheers, On Wed, Jul 2, 2014 at 8:14 PM, Daning Wang dan...@netseer.com wrote: We are running Cassandra 1.2.5 We have 8 nodes cluster, and we removed one machine from cluster and try to add it back(the purpose is we are using vnodes, some node has more tokens so by rejoining this machine we hope it could get some loads from the busy machines). But we got following exception and the node cannot add to the ring anymore. Please help, Thanks in advance, INFO 16:01:56,260 JOINING: Starting to bootstrap... ERROR 16:01:56,514 Exception encountered during startup java.lang.IllegalStateException: unable to find sufficient sources for streaming range (131921530760098415548184818173535242096,132123583169200197961735373586277861750] at org.apache.cassandra.dht.RangeStreamer.getRangeFetchMap(RangeStreamer.java:205) at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:129) at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:81) at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:924) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:693) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:548) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:445) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:413) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:456) java.lang.IllegalStateException: unable to find sufficient sources for streaming range (131921530760098415548184818173535242096,132123583169200197961735373586277861750] at org.apache.cassandra.dht.RangeStreamer.getRangeFetchMap(RangeStreamer.java:205) at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:129) at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:81) at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:924) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:693) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:548) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:445) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:413) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:456) Exception encountered during startup: unable to find sufficient sources for streaming range (131921530760098415548184818173535242096,132123583169200197961735373586277861750] ERROR 16:01:56,518 Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321) at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:362) at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88) at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:513) Daning -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: nodetool repair -snapshot option?
If you find it useful, I created a tool where you input the node IP, keyspace, column family, and optionally the number of partitions (default: 32K), and it outputs the list of subranges for that node, CF, partition size: https://github.com/pauloricardomg/cassandra-list-subranges So you can basically iterate over the output of that and do subrange repair for each node and cf, maybe in parallel. :) On Mon, Jun 30, 2014 at 10:26 PM, Phil Burress philburress...@gmail.com wrote: One last question. Any tips on scripting a subrange repair? On Mon, Jun 30, 2014 at 7:12 PM, Phil Burress philburress...@gmail.com wrote: We are running repair -pr. We've tried subrange manually and that seems to work ok. I guess we'll go with that going forward. Thanks for all the info! On Mon, Jun 30, 2014 at 6:52 PM, Jaydeep Chovatia chovatia.jayd...@gmail.com wrote: Are you running full repair or on subset? If you are running full repair then try running on sub-set of ranges which means less data to worry during repair and that would help JAVA heap in general. You will have to do multiple iterations to complete entire range but at-least it will work. -jaydeep On Mon, Jun 30, 2014 at 3:22 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Jun 30, 2014 at 3:08 PM, Yuki Morishita mor.y...@gmail.com wrote: Repair uses snapshot option by default since 2.0.2 (see NEWS.txt). As a general meta comment, the process by which operationally important defaults change in Cassandra seems ad-hoc and sub-optimal. For to record, my view was that this change, which makes repair even slower than it previously was, was probably overly optimistic. It's also weird in that it changes default behavior which has been unchanged since the start of Cassandra time and is therefore probably automated against. Why was it so critically important to switch to snapshot repair that it needed to be shotgunned as a new default in 2.0.2? =Rob -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
repair takes 10x more time in one DC compared to the other
Hello, I'm running repair on a large CF with the --local flag in 2 different DCs. In one of the DCs the operation takes about 1 hour per node, while in the other it takes 10 hours per node. I would expect the times to differ, but not so much. The writes on that CF all come from the DC where it takes 10 hours per node, could this be the cause why it takes so long on this DC? Additional info: C* 1.2.16, both DCs have the same replication factor. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: repair takes 10x more time in one DC compared to the other
Thanks for the explanation, but I got slightly confused: From my understanding, you just described the behavior of the -pr/--partitioner-range option: Repair only the first range returned by the partitioner for the node. , so I would understand that repairs in the same CFs in different DCs with only the -pr option could take different times. However according to the description of the -local/--in-local-dc option, it only repairs against nodes in the same data center, but you said that the range will be repaired for all replica in all data-centers, even with the -local option, or did you confuse it with -pr option? In any case, I'm using both -local and -pr options, what is the expected behavior in that case? Cheers, On Wed, Jun 25, 2014 at 12:46 PM, Sylvain Lebresne sylv...@datastax.com wrote: TL;DR, this is not unexpected and this is perfectly fine. For every node, 'repair --local' will repair the primary (where primary means the first range on the ring picked by the consistent hashing for this node given its token, nothing more) range of the node in the ring. And that range will be repaired for all replica in all data-centers. When you assign tokens to multiple DC, it's actually pretty common to offset the tokens of one DC slightly compared to the other one. This will result in the primary ranges being always small in one DC but not the other. But please note that this perfectly ok, it does not imply any imbalance in data-centers. It also don't really mean that the node of one DC actually do a lot more work than the other ones: all nodes most likely contribute roughly the same amount of work to the repair. It only mean that the nodes of one DC coordinate more repair work that those of the other DC. Which is not really a big deal since coordinating a repair is cheap. -- Sylvain On Wed, Jun 25, 2014 at 4:43 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello, I'm running repair on a large CF with the --local flag in 2 different DCs. In one of the DCs the operation takes about 1 hour per node, while in the other it takes 10 hours per node. I would expect the times to differ, but not so much. The writes on that CF all come from the DC where it takes 10 hours per node, could this be the cause why it takes so long on this DC? Additional info: C* 1.2.16, both DCs have the same replication factor. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: repair takes 10x more time in one DC compared to the other
Hmm.. good to find out, thanks for the reference! This explains the time differences between repairs in different DCs. But I think using -local and -pr should still be supported simultaneously, since you may want to repair nodes sequentially in the local DC (-local) without re-repairing ranges of neighbor nodes (-pr). On Wed, Jun 25, 2014 at 1:48 PM, Sylvain Lebresne sylv...@datastax.com wrote: I see. Well, you shouldn't use both -local and -pr together, they don't make sense together. Which is the reason why their combination will be rejected in 2.0.9 (you can check https://issues.apache.org/jira/browse/CASSANDRA-7317 for details). Basically, the result of using both is that lots of stuffs don't get repaired. On Wed, Jun 25, 2014 at 6:11 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Thanks for the explanation, but I got slightly confused: From my understanding, you just described the behavior of the -pr/--partitioner-range option: Repair only the first range returned by the partitioner for the node. , so I would understand that repairs in the same CFs in different DCs with only the -pr option could take different times. However according to the description of the -local/--in-local-dc option, it only repairs against nodes in the same data center, but you said that the range will be repaired for all replica in all data-centers, even with the -local option, or did you confuse it with -pr option? In any case, I'm using both -local and -pr options, what is the expected behavior in that case? Cheers, On Wed, Jun 25, 2014 at 12:46 PM, Sylvain Lebresne sylv...@datastax.com wrote: TL;DR, this is not unexpected and this is perfectly fine. For every node, 'repair --local' will repair the primary (where primary means the first range on the ring picked by the consistent hashing for this node given its token, nothing more) range of the node in the ring. And that range will be repaired for all replica in all data-centers. When you assign tokens to multiple DC, it's actually pretty common to offset the tokens of one DC slightly compared to the other one. This will result in the primary ranges being always small in one DC but not the other. But please note that this perfectly ok, it does not imply any imbalance in data-centers. It also don't really mean that the node of one DC actually do a lot more work than the other ones: all nodes most likely contribute roughly the same amount of work to the repair. It only mean that the nodes of one DC coordinate more repair work that those of the other DC. Which is not really a big deal since coordinating a repair is cheap. -- Sylvain On Wed, Jun 25, 2014 at 4:43 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello, I'm running repair on a large CF with the --local flag in 2 different DCs. In one of the DCs the operation takes about 1 hour per node, while in the other it takes 10 hours per node. I would expect the times to differ, but not so much. The writes on that CF all come from the DC where it takes 10 hours per node, could this be the cause why it takes so long on this DC? Additional info: C* 1.2.16, both DCs have the same replication factor. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Best practices for repair
Hello Paolo, I just published an open source version of the dsetool list_subranges command, which will enable you to perform subrange repair as described in the post. You can find the code and usage instructions here: https://github.com/pauloricardomg/cassandra-list-subranges Currently available for 1.2.16, but I guess that just changing the version on the pom.xml and recompiling it will make it work on 2.0.x. Cheers, Paulo On Thu, Jun 19, 2014 at 4:40 PM, Jack Krupansky j...@basetechnology.com wrote: The DataStax doc should be current best practices: http://www.datastax.com/documentation/cassandra/2.0/ cassandra/operations/ops_repair_nodes_c.html If you or anybody else finds it inadequate, speak up. -- Jack Krupansky -Original Message- From: Paolo Crosato Sent: Thursday, June 19, 2014 10:13 AM To: user@cassandra.apache.org Subject: Best practices for repair Hi eveybody, we have some problems running repairs on a timely schedule. We have a three node deployment, and we start repair on one node every week, repairing one columnfamily by one. However, when we run into the big column families, usually repair sessions hangs undefinitely, and we have to restart them manually. The script runs commands like: nodetool repair keyspace columnfamily one by one. This has not been a major issue for some time, since we never delete data, however we would like to sort the issue once and for all. Reading resources on the net, I came to the conclusion that we could: 1) either run a repair sessione like the one above, but with the -pr switch, and run it on every node, not just on one 2) or run sub range repair as described here http://www.datastax.com/dev/blog/advanced-repair-techniques , which would be the best option. However the latter procedure would require us to write some java program that calls describe_splits to get the tokens to feed nodetool repair with. The second procedure is available out of the box only in the commercial version of the opscenter, is this true? I would like to know if these are the current best practices for repairs or if there is some other option that makes repair easier to perform, and more reliable that it is now. Regards, Paolo Crosato -- Paolo Crosato Software engineer/Custom Solutions e-mail: paolo.cros...@targaubiest.com -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Cannot query secondary index
Our approach for this scenario is to run a hadoop job that periodically cleans old entries, but I admit it's far from ideal. Would be nice to have a more native way to perform these kinds of tasks. There's a legend about a compaction strategy that keeps only the N first entries of a partition key, but I don't think it was implemented yet, but if I remember correctly there's a JIRA ticket about it. On Tue, Jun 10, 2014 at 3:39 PM, Redmumba redmu...@gmail.com wrote: Honestly, this has been by far my single biggest obstacle with Cassandra for time-based data--cleaning up the old data when the deletion criteria (i.e., date) isn't the primary key. I've asked about a few different approaches, but I haven't really seen any feasible options that can be implemented easily. I've seen the following: 1. Use date-based tables, then drop old tables, ala audit_table_20140610, audit_table_20140609, etc.. But then I run into the issue of having to query every table--I would have to execute queries against every day to get the data, and then merge the data myself. Unless, there's something in the binary driver I'm missing, it doesn't sound like this would be practical. 2. Use a TTL But then I have to basically decide on a value that works for everything and, if it ever turns out I overestimated, I'm basically SOL, because my cluster will be out of space. 3. Maintain a separate index of days to keys, and use this index as the reference for which keys to delete. But then this requires maintaining another index and a relatively manual delete. I can't help but feel that I am just way over-engineering this, or that I'm missing something basic in my data model. Except for the last approach, I can't help but feel that I'm overlooking something obvious. Andrew Of course, Jonathan, I'll do my best! It's an auditing table that, right now, uses a primary key consisting of a combination of a combined partition id of the region and the object id, the date, and the process ID. Each event in our system will create anywhere from 1-20 rows, for example, and multiple parts of the system might be working on the same object ID. So the CF is constantly being appended to, but reads are rare. CREATE TABLE audit ( id bigint, region ascii, date timestamp, pid int, PRIMARY KEY ((id, region), date, pid) ); Data is queried on a specific object ID and region. Optionally, users can restrict their query to a specific date range, which the above data model provides. However, we generate quite a bit of data, and we want a convenient way to get rid of the oldest data. Since our system scales with the time of year, we might get 50GB a day during peak, and 5GB of data off peak. We could pick the safest number--let's say, 30 days--and set the TTL using that. The problem there is that, most of the year, we'll be using a very small percentage of our available space 90% of the year. What I'd like to be able to do is drop old tables as needed--i.e., let's say when we hit 80% load across the cluster (or some such metric that takes the cluster-wide load into account), I want to drop the oldest day's records until we're under 80%. That way, we're always using the maximum amount of space we can, without having to worry about getting to the point where we run out of space cluster-wide. My thoughts are--we could always make the date part of the primary key, but then we'd either a) have to query the entire range of dates, or b) we'd have to force a small date range when querying. What are the penalties? Do you have any other suggestions? On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield jlacefi...@datastax.com wrote: Hello, Will you please describe the use case and what you are trying to model. What are some questions/queries that you would like to serve via Cassandra. This will help the community help you a little better. Jonathan Lacefield Solutions Architect, DataStax (404) 822 3487 http://www.linkedin.com/in/jlacefield http://www.datastax.com/cassandrasummit14 On Mon, Jun 9, 2014 at 7:51 PM, Redmumba redmu...@gmail.com wrote: I've been trying to work around using date-based tables because I'd like to avoid the overhead. It seems, however, that this is just not going to work. So here's a question--for these date-based tables (i.e., a table per day/week/month/whatever), how are they queried? If I keep 60 days worth of auditing data, for example, I'd need to query all 60 tables--can I do that smoothly? Or do I have to have 60 different select statements? Is there a way for me to run the same query against all the tables? On Mon, Jun 9, 2014 at 3:42 PM, Redmumba redmu...@gmail.com wrote: Ah, so the secondary indices are really secondary against the primary key. That makes sense. I'm beginning to see why the whole date-based table approach is the only one I've been able to
Re: I have a deaf node?
This post should definitely make to the hall of fame!! :) On Mon, Jun 2, 2014 at 12:05 AM, Tim Dunphy bluethu...@gmail.com wrote: That made my day. Not to worry thought unless you start seeing the number 23 in your host ids. Yeah man, glad to provide some comic relief to the list! ;) On Sun, Jun 1, 2014 at 11:01 PM, Apostolis Xekoukoulotakis xekou...@gmail.com wrote: That made my day. Not to worry thought unless you start seeing the number 23 in your host ids. On Jun 2, 2014 12:40 AM, Kevin Burton bur...@spinn3r.com wrote: could be worse… it could be under caffeinated and say decafbad … On Sat, May 31, 2014 at 10:45 AM, Tim Dunphy bluethu...@gmail.com wrote: I think the deaf thing is just the ending of the host ID in hexadecimal. It's an extraordinary coincidence that it ends with DEAF :D Hah.. yeah that thought did cross my mind. :) On Sat, May 31, 2014 at 1:35 PM, DuyHai Doan doanduy...@gmail.com wrote: I think the deaf thing is just the ending of the host ID in hexadecimal. It's an extraordinary coincidence that it ends with DEAF :D On Sat, May 31, 2014 at 6:38 PM, Tim Dunphy bluethu...@gmail.com wrote: I didn't realize cassandra nodes could develop hearing problems. :) But I have a dead node in my cluster I would like to get rid of. [root@beta:~] #nodetool status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.10.1.94 199.6 KB 256 49.4% fd2f76ae-8dcf-4e93-a37f-bf1e9088696e rack1 DN 10.10.1.64 ? 256 50.6% f2a48fc7-a362-43f5-9061-4bb3739f*deaf * rack1 I was just wondering what this could indicate and if that might mean that I will have some more trouble than I would be bargaining for in getting rid of it. I've made a couple of attempts to get rid of this so far. I'm about to try again. Thanks Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: How does cassandra page through low cardinality indexes?
Really informative thread, thank you! We had a secondary index trauma a while ago, and since then we knew it was not a good idea for most of the cases, but now it's even more clear why. On Thu, May 29, 2014 at 5:31 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, May 29, 2014 at 1:08 PM, DuyHai Doan doanduy...@gmail.com wrote: Hello Robert There are some maths involved when considering the performance of secondary index in C* Yes, these are the maths which are behind my FIXMEs in the original post. I merely have not had time to explicitly describe them in the context of that draft post. Thank you for doing so! When I reference them in my eventual post, I will be sure to credit you. Because of its distributed nature, finding a *good* use-case for 2nd index is quite tricky, partly because it depends on the query pattern but also on the cluster size and data distribution. Yep, and if you're doing this tricky thing, you probably want less opacity and more explicit understanding of what is happening under the hood and you want to be sure you won't run into a bug in the implementation, hence manual secondary index CFs. Apart from the performance aspect, secondary index column families use SizeTiered compaction so for an use case with a lot of update you'll have plenty of tombstones... I'm not sure how end user can switch to Leveled Compaction for 2nd index... Per Aleksey, secondary index column families actually use the compaction strategy of the column family they index. I agree that this seems weird, and is likely just another implementation detail you relinquish control of for the convenience of 2i. =Rob -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Number of rows under one partition key
Hey, We are considering upgrading from 1.2 to 2.0, why don't you consider 2.0 ready for production yet, Robert? Have you wrote about this somewhere already? A bit off-topic in this discussion but it would be interesting to know, your posts are generally very enlightening. Cheers, On Thu, May 29, 2014 at 8:51 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, May 15, 2014 at 6:10 AM, Vegard Berget p...@fantasista.no wrote: I know this has been discussed before, and I know there are limitations to how many rows one partition key in practice can handle. But I am not sure if number of rows or total data is the deciding factor. Both. In terms of data size, partitions containing over a small number of hundreds of Megabytes begin to see diminishing returns in some cases. Partitions over 64 megabytes are compacted on disk, which should give you a rough sense of what Cassandra considers a large partition. Should we add another partition key to avoid 1 000 000 rows in the same thrift-row (which is how I understand it is actually stored)? Or is 1 000 000 rows okay? Depending on row size and access patterns, 1Mn rows is not extremely large. There are, however, some row sizes and operations where this order of magnitude of columns might be slow. Other considerations, for example compaction strategy and if we should do an upgrade to 2.0 because of this (we will upgrade anyway, but if it is recommended we will continue to use 2.0 in development and upgrade the production environment sooner) You should not upgrade to 2.0 in order to address this concern. You should upgrade to 2.0 when it is stable enough to run in production, which IMO is not yet. YMMV. I have done some testing, inserting a million rows and selecting them all, counting them and selecting individual rows (with both clientid and id) and it seems fine, but I want to ask to be sure that I am on the right track. If the access patterns you are using perform the way you would like with representative size data, sounds reasonable to me? If you are able to select all million rows within a reasonable percentage of the relevant timeout, I presume they cannot be too huge in terms of data size! :D =Rob -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Suggestions for upgrading cassandra
I've written a bit about upgrading from 1.1 to 1.2, non-vnodes: http://monkeys.chaordic.com.br/operation/zero-downtime-cassandra-upgrade/ Some tips may be valid for a more recent upgrade, but I'm sure the community has more specific tips regarding the upgrade from 1.2 to 2.0. On Tue, May 27, 2014 at 2:57 PM, Eric Plowe eric.pl...@gmail.com wrote: i have a cluster that is running 1.2.6. I'd like to upgrade that cluster to 2.0.7 Any suggestions/tips that would make the upgrade process smooth? -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: cassandra boot is stuck in hint compaction.
What is the Cassandra version? Are the same sstables being compacted over and over? Please post a sample of the compaction log and the output of DESCRIBE TABLE system.hints; on cqlsh. Cheers, On Sun, May 25, 2014 at 6:12 AM, Igor Shprukh i...@newage.co.il wrote: -- hi guys, we have a 6 node cluster, consisting of 5 linux machines and a windows one. after a hard shutdown of the windows machine, the node is stuck on hints compaction for more than half an hour and cassandra won't start. must say that it is a strong machine with 16gb of ram and 250 gb of space dedicated to the node. all other nodes are up. what could be the problem causing this? thank you in advance. -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Disable FS journaling
Hello, Has anyone disabled file system journaling on Cassandra nodes? Does it make any difference on write performance? Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Disable FS journaling
Thanks for the links! Forgot to mention, using XFS here, as suggested by the Cassandra wiki. But just double checked and it's apparently not possible to disable journaling on XFS. One of ours sysadmin just suggested disabling journaling, since it's mostly for recovery purposes, and Cassandra already does that pretty well with commitlog, replication and anti-entropy. It would anyway be nice to know if there could be any performance benefits from it. But I personally don't think it would help much, due to the append-only nature of cassandra writes. On Tue, May 20, 2014 at 12:43 PM, Michael Shuler mich...@pbandjelly.orgwrote: On 05/20/2014 09:54 AM, Samir Faci wrote: I'm not sure you'd be gaining much by doing this. This is probably dependent on the file system you're referring to when you say journaling. There's a few of them around, You could opt to use ext2 instead of ext3/4 in the unix world. A quick google search linked me to this: ext2/3 is not a good choice for file size limitation and performance reasons. I started to search for a couple links, and a quick check of the links I posted a couple years ago seem to still be interesting ;) http://mail-archives.apache.org/mod_mbox/cassandra-user/ 201204.mbox/%3c4f7c5c16.1020...@pbandjelly.org%3E (repost from above) Hopefully this is some good reading on the topic: https://www.google.com/search?q=xfs+site%3Ahttp%3A%2F% 2Fmail-archives.apache.org%2Fmod_mbox%2Fcassandra-user one of the more interesting considerations: http://mail-archives.apache.org/mod_mbox/cassandra-user/201004.mbox/% 3ch2y96b607d1004131614k5382b3a5ie899989d62921...@mail.gmail.com%3E http://wiki.apache.org/cassandra/CassandraHardware http://wiki.apache.org/cassandra/LargeDataSetConsiderations http://www.datastax.com/dev/blog/questions-from-the-tokyo- cassandra-conference -- Kind regards, Michael -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Disable FS journaling
On Tue, May 20, 2014 at 1:24 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Journal enabled is faster on almost all operations. Good to know, thanks! Recovery here is more about saving you from waiting 1/2 hour from a traditional full file system check. On an EC2 environment you normally lose the machine anyway on failures, so that's not of much use in that case. Feel free to wait if you want though! :) Regards, Terje On 21 May 2014, at 01:11, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Thanks for the links! Forgot to mention, using XFS here, as suggested by the Cassandra wiki. But just double checked and it's apparently not possible to disable journaling on XFS. One of ours sysadmin just suggested disabling journaling, since it's mostly for recovery purposes, and Cassandra already does that pretty well with commitlog, replication and anti-entropy. It would anyway be nice to know if there could be any performance benefits from it. But I personally don't think it would help much, due to the append-only nature of cassandra writes. On Tue, May 20, 2014 at 12:43 PM, Michael Shuler mich...@pbandjelly.orgwrote: On 05/20/2014 09:54 AM, Samir Faci wrote: I'm not sure you'd be gaining much by doing this. This is probably dependent on the file system you're referring to when you say journaling. There's a few of them around, You could opt to use ext2 instead of ext3/4 in the unix world. A quick google search linked me to this: ext2/3 is not a good choice for file size limitation and performance reasons. I started to search for a couple links, and a quick check of the links I posted a couple years ago seem to still be interesting ;) http://mail-archives.apache.org/mod_mbox/cassandra-user/ 201204.mbox/%3c4f7c5c16.1020...@pbandjelly.org%3E (repost from above) Hopefully this is some good reading on the topic: https://www.google.com/search?q=xfs+site%3Ahttp%3A%2F% 2Fmail-archives.apache.org%2Fmod_mbox%2Fcassandra-user one of the more interesting considerations: http://mail-archives.apache.org/mod_mbox/cassandra-user/201004.mbox/% 3ch2y96b607d1004131614k5382b3a5ie899989d62921...@mail.gmail.com%3E http://wiki.apache.org/cassandra/CassandraHardware http://wiki.apache.org/cassandra/LargeDataSetConsiderations http://www.datastax.com/dev/blog/questions-from-the-tokyo- cassandra-conference -- Kind regards, Michael -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Disable reads during node rebuild
That'll be really useful, thanks!! On Wed, May 14, 2014 at 7:47 PM, Aaron Morton aa...@thelastpickle.comwrote: As of 2.0.7, driftx has added this long-requested feature. Thanks A - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 13/05/2014, at 9:36 am, Robert Coli rc...@eventbrite.com wrote: On Mon, May 12, 2014 at 10:18 AM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Is there a way to disable reads from a node while performing rebuild from another datacenter? I tried starting the node in write survery mode, but the nodetool rebuild command does not work in this mode. As of 2.0.7, driftx has added this long-requested feature. https://issues.apache.org/jira/browse/CASSANDRA-6961 Note that it is impossible to completely close the race window here as long as writes are incoming, this functionality just dramatically shortens it. =Rob -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Mutation messages dropped
It means asynchronous write mutations were dropped, but if the writes are completing without TimedOutException, then at least ConsistencyLevel replicas were correctly written. The remaining replicas will eventually be fixed by hinted handoff, anti-entropy (repair) or read repair. More info: http://wiki.apache.org/cassandra/FAQ#dropped_messages Please note that 1 mutation != 1 record. For instance, if 1 row has N columns, than a record write for that row will have N mutations AFAIK (please correct me if I'm wrong). On Fri, May 9, 2014 at 8:52 AM, Raveendran, Varsha IN BLR STS varsha.raveend...@siemens.com wrote: Hello, I am writing around 10Million records continuously into a single node Cassandra (2.0.5) . In the Cassandra log file I see an entry “*272 MUTATION messages dropped in last 5000ms*” . Does this mean that 272 records were not written successfully? Thanks, Varsha -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)
Hello Anton, What version of Cassandra are you using? If between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working. This was fixed in 2.0.7: https://issues.apache.org/jira/browse/CASSANDRA-6436 If you can't upgrade you can copy AbstractCFIF and CFIF to your project and apply the patch there. Cheers, Paulo On Wed, May 14, 2014 at 10:29 PM, Anton Brazhnyk anton.brazh...@genesys.com wrote: Greetings, I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read just part of it - something like Spark's sample() function. Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but it doesn't work. The limit is just ignored and the entire column family is scanned. It seems this kind of feature is just not supported and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO). Questions: 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? WBR, Anton -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Efficient bulk range deletions without compactions by dropping SSTables.
Hello Kevin, In 2.0.X an SSTable is automatically dropped if it contains only tombstones: https://issues.apache.org/jira/browse/CASSANDRA-5228. However this will most likely happen if you use LCS. STCS will create sstables of larger size that will probably have mixed expired and unexpired data. This could be solved by the single-sstable tombstone compaction that unfortunately is not working well ( https://issues.apache.org/jira/browse/CASSANDRA-6563). I don't know of a way to manually drop specific sstables safely, you could try implementing a script that compares sstables timestamps to check if an sstable is safely droppable as done in CASSANDRA-5228. There are proposals to create a compaction strategy optimized for log only data that only deletes old sstables but it's not ready yet AFAIK. Cheers, Paulo On Mon, May 12, 2014 at 8:53 PM, Kevin Burton bur...@spinn3r.com wrote: We have a log only data structure… everything is appended and nothing is ever updated. We should be totally fine with having lots of SSTables sitting on disk because even if we did a major compaction the data would still look the same. By 'lots' I mean maybe 1000 max. Maybe 1GB each. However, I would like a way to delete older data. One way to solve this could be to just drop an entire SSTable if all the records inside have tombstones. Is this possible, to just drop a specific SSTable? -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profilehttps://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Automatic tombstone removal issue (STCS)
I just updated CASSANDRA-6563 with more details and proposed a patch to solve the issue, in case anyone else is interested. https://issues.apache.org/jira/browse/CASSANDRA-6563 On Tue, May 6, 2014 at 10:00 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Robert: thanks for the support, you are right, this belonged more to the dev list but I didn't think of it. Yuki: thanks a lot for the clarification, this is what I suspected. I understand it's costly to check row by row overlap in order to decide if a SSTable is candidate for compaction, but doesn't the compaction process already performs this check when removing tombstones? So, couldn't this check be dropped during decision time and let the compaction run anyway? This optimization is specially interesting with large STCS sstables, where the token range will very likely overlap with all other sstables, so it's a pity it's almost never being triggered in these cases. On Tue, May 6, 2014 at 9:32 PM, Yuki Morishita mor.y...@gmail.com wrote: Hi Paulo, The reason we check overlap is not to resurrect deleted data by only dropping tombstone marker from single SSTable. And we don't want to check row by row to determine if SSTable is droppable since it takes time, so we use token ranges to determine if it MAY have droppable columns. On Tue, May 6, 2014 at 7:14 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello, Sorry for being persistent, but I'd love to clear my understanding on this. Has anyone seen single sstable compaction being triggered for STCS sstables with high tombstone ratio? Because if the above understanding is correct, the current implementation almost never triggers this kind of compaction, since the token ranges of a node's sstable almost always overlap. Could this be a bug or is it expected behavior? Thank you, On Mon, May 5, 2014 at 8:59 AM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello, After noticing that automatic tombstone removal (CASSANDRA-3442) was not working in an append-only STCS CF with 40% of droppable tombstone ratio I investigated why the compaction was not being triggered in the largest SSTable with 16GB and about 70% droppable tombstone ratio. When the code goes to check if the SSTable is candidate to be compacted (AbstractCompactionStrategy.worthDroppingTombstones), it verifies if all the others SSTables overlap with the current SSTable by checking if the start and end tokens overlap. The problem is that all SSTables contain pretty much the whole node token range, so all of them overlap nearly all the time, so the automatic tombstone removal never happens. Is there any case in STCS where all sstables token ranges DO NOT overlap? I understand during the tombstone removal process it's necessary to verify if the compacted row exists in any other SSTable, but I don't understand why it's necessary to verify if the token ranges overlap to decide if a tombstone compaction must be executed on a single SSTable with high droppable tombstone ratio. Any clarification would be kindly appreciated. PS: Cassandra version: 1.2.16 -- Paulo Motta Chaordic | Platform www.chaordic.com.br +55 48 3232.3200 -- Paulo Motta Chaordic | Platform www.chaordic.com.br +55 48 3232.3200 -- Yuki Morishita t:yukim (http://twitter.com/yukim) -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Bootstrap failure on C* 1.2.13
Hello, After about 3 months I was able to solve this issue, which happened again after another node died. The problem is the datastax 1.2 node replacement docs [1] said that This procedure applies to clusters using vnodes. If not using vnodes, use the instructions in the Cassandra 1.1 documentation. However, the 1.1 docs did not mention the property -Dcassandra.replace_address=address_of_dead_node, which was only introduced in 1.2. So, what happens without this flag is that the replacement node tries to stream data from the dead node, failing the bootstrap process. Adding this flag solves the problem. Big thanks to driftx from #cassandra who helped troubleshoot the issue. The docs were already updated to mention the property even for non-vnodes cluster. [1] http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_replace_node_t.html Cheers, On Sat, Feb 15, 2014 at 3:31 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Rob, I don't understand how setting those initial_token might solve this issue. Even more since we cannot set them before bootstrapping... Plus, once those tokens set, we would have to modify them after any new bootstrap / decommission. Which would also imply to run a rolling restart for the new configuration (cassandra.yaml) to be taken into account. This is quite a heavy process to perform a NOOP... What did I miss ? Thanks for getting involved and trying to help anyway :). Alain 2014-02-15 1:13 GMT+01:00 Robert Coli rc...@eventbrite.com: On Fri, Feb 14, 2014 at 10:08 AM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: But in our case, our cluster was not using VNodes, so this workaround will probably not work with VNodes, since you cannot specify the 256 tokens from the old node. Sure you can, in a comma delimited list. I plan to write a short blog post about this, but... I recommend that anyone using Cassandra, vnodes or not, always explicitly populate their initial_token line in cassandra.yaml. There are a number of cases where you will lose if you do not do so, and AFAICT no cases where you lose by doing so. If one is using vnodes and wants to do this, the process goes like : 1) set num_tokens to the desired number of vnodes 2) start node/bootstrap 3) use a one liner like jeffj's : nodetool info -T | grep ^Token | awk '{ print $3 }' | tr \\n , | sed -e 's/,$/\n/' to get a comma delimited list of the vnode tokens 4) insert this comma delimited list in initial_token, and comment out num_tokens (though it is a NOOP) =Rob -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Cassandra hadoop job fails if any node is DOWN
Hello, One of the nodes of our Analytics DC is dead, but ColumnFamilyInputFormat (CFIF) still assigns Hadoop input splits to it. This leads to many failed tasks and consequently a failed job. * Tasks fail with: java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Failed to open a transport to XX.75:9160. (obviously, the node is dead) * Job fails with: Job Failed: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201404180250_4207_m_79 We use RF=2 and CL=LOCAL_ONE for hadoop jobs, C* 1.2.16. Is this expected behavior? I checked CFIF code, but it always assigns input splits to all the ring nodes, no matter if the node is dead or alive. What we do to fix is patch CFIF to blacklist the dead node, but this is not very automatic procedure. Am I not getting something here? Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Disable reads during node rebuild
That's a nice workaround, will be really helpful in emergency situations like this. Thanks, On Mon, May 12, 2014 at 6:58 PM, Aaron Morton aa...@thelastpickle.comwrote: I'm not able to replace a dead node using the ordinary procedure (boostrap+join), and would like to rebuild the replacement node from another DC. Normally when you want to add a new DC to the cluster the command to use is nodetool rebuild $DC_NAME .(with auto_bootstrap: false) That will get the node to stream data from the $DC_NAME The problem is that if I start a node with auto_bootstrap=false to perform the rebuild, it automatically starts serving empty reads (CL=LOCAL_ONE). When adding a new DC the nodes wont be processing reads, that is not the case for you. You should disable the client API’s to prevent the clients from calling the new nodes, use -Dcassandra.start_rpc=false and -Dcassandra.start_native_transport=false in cassandra-env.sh or appropriate settings in cassandra.yaml Disabling reads from other nodes will be harder. IIRC during bootstrap a different timeout (based on ring_delay) is used to detect if the bootstrapping node is down. However if the node is running and you use nodetool rebuild i’m pretty sure the normal gossip failure detectors will kick in. Which means you cannot disable gossip to prevent reads. Also we would want the node to be up for writes. But what you can do is artificially set the severity of the node high so the dynamic snitch will route around it. See https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/locator/DynamicEndpointSnitchMBean.java#L37 * Set the value to something high on the node you will be rebuilding, the number or cores on the system should do. (jmxterm is handy for this http://wiki.cyclopsgroup.org/jmxterm) * Check nodetool gossipinfo on the other nodes to see the SEVERITY app state has propagated. * Watch completed ReadStage tasks on the node you want to rebuild. If you have read repair enabled it will still get some traffic. * Do rebuild * Reset severity to 0 Hope that helps. Aaron - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 13/05/2014, at 5:18 am, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello, I'm not able to replace a dead node using the ordinary procedure (boostrap+join), and would like to rebuild the replacement node from another DC. The problem is that if I start a node with auto_bootstrap=false to perform the rebuild, it automatically starts serving empty reads (CL=LOCAL_ONE). Is there a way to disable reads from a node while performing rebuild from another datacenter? I tried starting the node in write survery mode, but the nodetool rebuild command does not work in this mode. Thanks, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Automatic tombstone removal issue (STCS)
Robert: thanks for the support, you are right, this belonged more to the dev list but I didn't think of it. Yuki: thanks a lot for the clarification, this is what I suspected. I understand it's costly to check row by row overlap in order to decide if a SSTable is candidate for compaction, but doesn't the compaction process already performs this check when removing tombstones? So, couldn't this check be dropped during decision time and let the compaction run anyway? This optimization is specially interesting with large STCS sstables, where the token range will very likely overlap with all other sstables, so it's a pity it's almost never being triggered in these cases. On Tue, May 6, 2014 at 9:32 PM, Yuki Morishita mor.y...@gmail.com wrote: Hi Paulo, The reason we check overlap is not to resurrect deleted data by only dropping tombstone marker from single SSTable. And we don't want to check row by row to determine if SSTable is droppable since it takes time, so we use token ranges to determine if it MAY have droppable columns. On Tue, May 6, 2014 at 7:14 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello, Sorry for being persistent, but I'd love to clear my understanding on this. Has anyone seen single sstable compaction being triggered for STCS sstables with high tombstone ratio? Because if the above understanding is correct, the current implementation almost never triggers this kind of compaction, since the token ranges of a node's sstable almost always overlap. Could this be a bug or is it expected behavior? Thank you, On Mon, May 5, 2014 at 8:59 AM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello, After noticing that automatic tombstone removal (CASSANDRA-3442) was not working in an append-only STCS CF with 40% of droppable tombstone ratio I investigated why the compaction was not being triggered in the largest SSTable with 16GB and about 70% droppable tombstone ratio. When the code goes to check if the SSTable is candidate to be compacted (AbstractCompactionStrategy.worthDroppingTombstones), it verifies if all the others SSTables overlap with the current SSTable by checking if the start and end tokens overlap. The problem is that all SSTables contain pretty much the whole node token range, so all of them overlap nearly all the time, so the automatic tombstone removal never happens. Is there any case in STCS where all sstables token ranges DO NOT overlap? I understand during the tombstone removal process it's necessary to verify if the compacted row exists in any other SSTable, but I don't understand why it's necessary to verify if the token ranges overlap to decide if a tombstone compaction must be executed on a single SSTable with high droppable tombstone ratio. Any clarification would be kindly appreciated. PS: Cassandra version: 1.2.16 -- Paulo Motta Chaordic | Platform www.chaordic.com.br +55 48 3232.3200 -- Paulo Motta Chaordic | Platform www.chaordic.com.br +55 48 3232.3200 -- Yuki Morishita t:yukim (http://twitter.com/yukim) -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Automatic tombstone removal issue (STCS)
Hello, Sorry for being persistent, but I'd love to clear my understanding on this. Has anyone seen single sstable compaction being triggered for STCS sstables with high tombstone ratio? Because if the above understanding is correct, the current implementation almost never triggers this kind of compaction, since the token ranges of a node's sstable almost always overlap. Could this be a bug or is it expected behavior? Thank you, On Mon, May 5, 2014 at 8:59 AM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello, After noticing that automatic tombstone removal (CASSANDRA-3442) was not working in an append-only STCS CF with 40% of droppable tombstone ratio I investigated why the compaction was not being triggered in the largest SSTable with 16GB and about 70% droppable tombstone ratio. When the code goes to check if the SSTable is candidate to be compacted (AbstractCompactionStrategy.worthDroppingTombstones), it verifies if all the others SSTables overlap with the current SSTable by checking if the start and end tokens overlap. The problem is that all SSTables contain pretty much the whole node token range, so all of them overlap nearly all the time, so the automatic tombstone removal never happens. Is there any case in STCS where all sstables token ranges DO NOT overlap? I understand during the tombstone removal process it's necessary to verify if the compacted row exists in any other SSTable, but I don't understand why it's necessary to verify if the token ranges overlap to decide if a tombstone compaction must be executed on a single SSTable with high droppable tombstone ratio. Any clarification would be kindly appreciated. PS: Cassandra version: 1.2.16 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Automatic tombstone removal issue (STCS)
Hello, After noticing that automatic tombstone removal (CASSANDRA-3442) was not working in an append-only STCS CF with 40% of droppable tombstone ratio I investigated why the compaction was not being triggered in the largest SSTable with 16GB and about 70% droppable tombstone ratio. When the code goes to check if the SSTable is candidate to be compacted (AbstractCompactionStrategy.worthDroppingTombstones), it verifies if all the others SSTables overlap with the current SSTable by checking if the start and end tokens overlap. The problem is that all SSTables contain pretty much the whole node token range, so all of them overlap nearly all the time, so the automatic tombstone removal never happens. Is there any case in STCS where all sstables token ranges DO NOT overlap? I understand during the tombstone removal process it's necessary to verify if the compacted row exists in any other SSTable, but I don't understand why it's necessary to verify if the token ranges overlap to decide if a tombstone compaction must be executed on a single SSTable with high droppable tombstone ratio. Any clarification would be kindly appreciated. PS: Cassandra version: 1.2.16 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Is a hint stored when a mutation is dropped?
The official docs say that dropped mutations are only fixed by Read Repair and Anti-entropy (http://wiki.apache.org/cassandra/FAQ#dropped_messages). However, in this thread ( http://grokbase.com/t/cassandra/user/1235ctdbca/mutation-dropped-messages) Aaron Morton says that Hinted Handoff also repairs dropped mutations, but I couldn't find more info on that. Is this still the behavior on 1.2+? To illustrate: If I write with RF=2, CL=ONE: one mutation is accepted, the write returns and the other mutation is dropped. Does the coordinator store a hint of the dropped replica? Even without running repair, will I be able to read that write from the dropped replica in 30 minutes? Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: clearing tombstones?
I have a similar problem here, I deleted about 30% of a very large CF using LCS (about 80GB per node), but still my data hasn't shrinked, even if I used 1 day for gc_grace_seconds. Would nodetool scrub help? Does nodetool scrub forces a minor compaction? Cheers, Paulo On Fri, Apr 11, 2014 at 12:12 PM, Mark Reddy mark.re...@boxever.com wrote: Yes, running nodetool compact (major compaction) creates one large SSTable. This will mess up the heuristics of the SizeTiered strategy (is this the compaction strategy you are using?) leading to multiple 'small' SSTables alongside the single large SSTable, which results in increased read latency. You will incur the operational overhead of having to manage compactions if you wish to compact these smaller SSTables. For all these reasons it is generally advised to stay away from running compactions manually. Assuming that this is a production environment and you want to keep everything running as smoothly as possible I would reduce the gc_grace on the CF, allow automatic minor compactions to kick in and then increase the gc_grace once again after the tombstones have been removed. On Fri, Apr 11, 2014 at 3:44 PM, William Oberman ober...@civicscience.com wrote: So, if I was impatient and just wanted to make this happen now, I could: 1.) Change GCGraceSeconds of the CF to 0 2.) run nodetool compact (*) 3.) Change GCGraceSeconds of the CF back to 10 days Since I have ~900M tombstones, even if I miss a few due to impatience, I don't care *that* much as I could re-run my clean up tool against the now much smaller CF. (*) A long long time ago I seem to recall reading advice about don't ever run nodetool compact, but I can't remember why. Is there any bad long term consequence? Short term there are several: -a heavy operation -temporary 2x disk space -one big SSTable afterwards But moving forward, everything is ok right? CommitLog/MemTable-SStables, minor compactions that merge SSTables, etc... The only flaw I can think of is it will take forever until the SSTable minor compactions build up enough to consider including the big SSTable in a compaction, making it likely I'll have to self manage compactions. On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy mark.re...@boxever.comwrote: Correct, a tombstone will only be removed after gc_grace period has elapsed. The default value is set to 10 days which allows a great deal of time for consistency to be achieved prior to deletion. If you are operationally confident that you can achieve consistency via anti-entropy repairs within a shorter period you can always reduce that 10 day interval. Mark On Fri, Apr 11, 2014 at 3:16 PM, William Oberman ober...@civicscience.com wrote: I'm seeing a lot of articles about a dependency between removing tombstones and GCGraceSeconds, which might be my problem (I just checked, and this CF has GCGraceSeconds of 10 days). On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli tbarbu...@gmail.com wrote: compaction should take care of it; for me it never worked so I run nodetool compaction on every node; that does it. 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com: I'm wondering what will clear tombstoned rows? nodetool cleanup, nodetool repair, or time (as in just wait)? I had a CF that was more or less storing session information. After some time, we decided that one piece of this information was pointless to track (and was 90%+ of the columns, and in 99% of those cases was ALL columns for a row). I wrote a process to remove all of those columns (which again in a vast majority of cases had the effect of removing the whole row). This CF had ~1 billion rows, so I expect to be left with ~100m rows. After I did this mass delete, everything was the same size on disk (which I expected, knowing how tombstoning works). It wasn't 100% clear to me what to poke to cause compactions to clear the tombstones. First I tried nodetool cleanup on a candidate node. But, afterwards the disk usage was the same. Then I tried nodetool repair on that same node. But again, disk usage is still the same. The CF has no snapshots. So, am I misunderstanding something? Is there another operation to try? Do I have to just wait? I've only done cleanup/repair on one node. Do I have to run one or the other over all nodes to clear tombstones? Cassandra 1.2.15 if it matters, Thanks! will -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: clearing tombstones?
This thread is really informative, thanks for the good feedback. My question is : Is there a way to force tombstones to be clared with LCS? Does scrub help in any case? Or the only solution would be to create a new CF and migrate all the data if you intend to do a large CF cleanup? Cheers, On Fri, Apr 11, 2014 at 2:02 PM, Mark Reddy mark.re...@boxever.com wrote: Thats great Will, if you could update the thread with the actions you decide to take and the results that would be great. Mark On Fri, Apr 11, 2014 at 5:53 PM, William Oberman ober...@civicscience.com wrote: I've learned a *lot* from this thread. My thanks to all of the contributors! Paulo: Good luck with LCS. I wish I could help there, but all of my CF's are SizeTiered (mostly as I'm on the same schema/same settings since 0.7...) will On Fri, Apr 11, 2014 at 12:14 PM, Mina Naguib mina.nag...@adgear.comwrote: Levelled Compaction is a wholly different beast when it comes to tombstones. The tombstones are inserted, like any other write really, at the lower levels in the leveldb hierarchy. They are only removed after they have had the chance to naturally migrate upwards in the leveldb hierarchy to the highest level in your data store. How long that takes depends on: 1. The amount of data in your store and the number of levels your LCS strategy has 2. The amount of new writes entering the bottom funnel of your leveldb, forcing upwards compaction and combining To give you an idea, I had a similar scenario and ran a (slow, throttled) delete job on my cluster around December-January. Here's a graph of the disk space usage on one node. Notice the still-diclining usage long after the cleanup job has finished (sometime in January). I tend to think of tombstones in LCS as little bombs that get to explode much later in time: http://mina.naguib.ca/images/tombstones-cassandra-LCS.jpg On 2014-04-11, at 11:20 AM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: I have a similar problem here, I deleted about 30% of a very large CF using LCS (about 80GB per node), but still my data hasn't shrinked, even if I used 1 day for gc_grace_seconds. Would nodetool scrub help? Does nodetool scrub forces a minor compaction? Cheers, Paulo On Fri, Apr 11, 2014 at 12:12 PM, Mark Reddy mark.re...@boxever.comwrote: Yes, running nodetool compact (major compaction) creates one large SSTable. This will mess up the heuristics of the SizeTiered strategy (is this the compaction strategy you are using?) leading to multiple 'small' SSTables alongside the single large SSTable, which results in increased read latency. You will incur the operational overhead of having to manage compactions if you wish to compact these smaller SSTables. For all these reasons it is generally advised to stay away from running compactions manually. Assuming that this is a production environment and you want to keep everything running as smoothly as possible I would reduce the gc_grace on the CF, allow automatic minor compactions to kick in and then increase the gc_grace once again after the tombstones have been removed. On Fri, Apr 11, 2014 at 3:44 PM, William Oberman ober...@civicscience.com wrote: So, if I was impatient and just wanted to make this happen now, I could: 1.) Change GCGraceSeconds of the CF to 0 2.) run nodetool compact (*) 3.) Change GCGraceSeconds of the CF back to 10 days Since I have ~900M tombstones, even if I miss a few due to impatience, I don't care *that* much as I could re-run my clean up tool against the now much smaller CF. (*) A long long time ago I seem to recall reading advice about don't ever run nodetool compact, but I can't remember why. Is there any bad long term consequence? Short term there are several: -a heavy operation -temporary 2x disk space -one big SSTable afterwards But moving forward, everything is ok right? CommitLog/MemTable-SStables, minor compactions that merge SSTables, etc... The only flaw I can think of is it will take forever until the SSTable minor compactions build up enough to consider including the big SSTable in a compaction, making it likely I'll have to self manage compactions. On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy mark.re...@boxever.comwrote: Correct, a tombstone will only be removed after gc_grace period has elapsed. The default value is set to 10 days which allows a great deal of time for consistency to be achieved prior to deletion. If you are operationally confident that you can achieve consistency via anti-entropy repairs within a shorter period you can always reduce that 10 day interval. Mark On Fri, Apr 11, 2014 at 3:16 PM, William Oberman ober...@civicscience.com wrote: I'm seeing a lot of articles about a dependency between removing tombstones and GCGraceSeconds, which might be my problem (I just checked, and this CF has GCGraceSeconds of 10 days). On Fri, Apr 11
Blog post with Cassandra upgrade tips
Hey, Some months ago (last year!!) during our previous major upgrade from 1.1 - 1.2 I started writing a blog post with some tips for a smooth rolling upgrade, but for some reason I forgot to finish the post. I found it recently and decided it to publish anyway, as some of the info may be helpful for future major upgrades: http://monkeys.chaordic.com.br/operation/zero-downtime-cassandra-upgrade/ Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: using hadoop + cassandra for CF mutations (delete)
You said you have tried the Pig URL split_size, but have you actually tried decreasing the value of cassandra.input.split.size hadoop property? The default is 65536, so you may want to decrease that to see if the number of mappers increase. But at some point, even if you lower that value it will stop decreasing the number of mappers but I don't know exactly why, probably because it hits the minimum number of rows per token. Another suggestion is to decrease the number of simultaneous mappers of your job, so it doesn't hit cassandra too hard, and you'll get less TimedOutExceptions, but your job will take longer to complete. On Fri, Apr 4, 2014 at 1:24 PM, William Oberman ober...@civicscience.comwrote: Hi, I have some history with cassandra + hadoop: 1.) Single DC + integrated hadoop = Was ok until I needed steady performance (the single DC was used in a production environment) 2.) Two DC's + integrated hadoop on 1 of 2 DCs = Was ok until my data grew and in AWS compute is expensive compared to data storage... e.g. running a 24x7 DC was a lot more expensive than the following solution... 3.) Single DC + a constant ETL to S3 = Is still ok, I can spawn an arbitrarily large EMR cluster. And 24x7 data storage + transient EMR is cost effective. But, one of my CF's has had a change of usage pattern making a large %, but not all of the data, fairly pointless to store. I thought I'd write a Pig UDF that could peek at a row of data and delete if it fails my criteria. And it works in terms of logic, but not in terms of practical execution. The CF in question has O(billion) keys, and afterwards it will have ~10% of that at most. I basically keep losing the jobs due to too many task failures, all rooted in: Caused by: TimedOutException() at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:13020) And yes, I've messed around with: -Number of failures for map/reduce/tracker (in the hadoop confs) -split_size (on the URL) -cassandra.range.batch.size But it hasn't helped. My failsafe is to roll my own distributed process, rather than falling into a pit of internal hadoop settings. But I feel like I'm close. The problem in my opinion, watching how things are going, is the correlation of splits - tasks. I'm obviously using Pig, so this part of the process is fairly opaque to me at the moment. But, something somewhere is picking 20 tasks for my job, and this is fairly independent of the # of task slots (I've booted EMR cluster with different #'s and always get 20). Why does this matter? When a task fails, it retries from the start, which is a killer for me as I delete as I go, making that pointless work and massively increasing the odds of an overall job failure. If hadoop/pig chose a large number of tasks, the retries would be much less of a burden. But, I don't see where/what lets me mess with that logic. Pig gives the ability to mess with reducers (PARALLEL), but I'm in the load path, which is all mappers. I've never jumped to the lower, raw hadoop level before. But, I'm worried that will be the falling into a pit issue... I'm using Cassandra 1.2.15. will -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314
sstable partitioner converter tool
Hello, We wanted to migrate our data from a RandomPartitioner cluster to a Murmur3Partitioner cluster via sstableloader, but it does not support directly loading sstables to a cluster with a different partitioner. We didn't find any tool that performs the conversion between sstables from different partitioners, so we put together some C* code and built our own. After the sstable conversion is done it's possible to bulk load the data into the new cluster with sstableloader. The tool supports sstables from C* 1.2 and 2.0 and is available on github, so feel free to use it and contribute: https://github.com/chaordic/sstableconverter Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/*
Re: Dead node seen as UP by replacement node
Hmm, we considered that option but if the old node is assassinated, his range will be assigned to a neighbor that doesn't have the data, what will cause empty reads. What we did to solve the problem was to do a safe removal via nodetool removenode deadNodeId, wait some hours for neighbors to stream that node's data, and then bootstrap the replacement node. However this procedure takes double the time, because data needs to be streamed twice, which is not really optimal. It would be really nice to know if this is expected behavior of if I should fill a bug request. On Fri, Mar 14, 2014 at 11:59 AM, Rahul Menon ra...@apigee.com wrote: Since the older node is not available i would ask you to assassinate the old node and then get the node new node to bootstrap. On Thu, Mar 13, 2014 at 10:56 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Yes, exactly. On Thu, Mar 13, 2014 at 1:27 PM, Rahul Menon ra...@apigee.com wrote: And the token value as suggested is tokenvalueoddeadnode-1 ? On Thu, Mar 13, 2014 at 9:29 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Nope, they have different IPs. I'm using the procedure described here to replace a dead node: http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node Dead node token: X (IP: Y) Replacement node token: X-1 (IP: Z) So, as soon as the replacement node (Z) is started, it sees the dead node (Y) as UP, and tries to stream data from it during the join process. About 10 minutes later, the failure detector of Z detects Y as down, but since it was trying to fetch data from him, it fails the join/bootstrap process altogether. -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314
Cannot bootstrap replacement node
Hello, I'm having some trouble during bootstrap of a replacement node and I'm suspecting it could be a bug in Cassandra. I'm using C* 1.2.13, RF=2, with Vnodes disabled. Below is a simplified version of my ring: * n1 : token 100 * n2 : token 200 (DEAD) * n3 : token 300 * n4 : token 0 n2 has died, so I tried bootstraping a new replacement node: * x : token 199 (n2.token-1) Even though n2 was terminated, and being seen as DOWN by n1, n3 and n4, the replacement node x was seeing n2 as UP, immediately trying to stream data from it during bootstrap. After about 10 minutes, when x detected n2 as DOWN, the bootstrap failed for obvious reasons. Since the previous procedure did not work, I tried the next procedure for replacing n2: - Remove n2 from the ring. This makes n3 stream n2's data to n1. - After the leave is complete, try to bootstrap X again. Ideally, x would stream data from n1 and n3, but it always streams data only from n3. The problem is that at some point n3 is seen as DOWN by x, failing the bootstrap process again. I suspect there is some kind of inconsistency in the gossip information of n2 that is preventing x from streaming data from both n1 and n3. I tried purging n2 from gossip, using Gossiper.unsafeAssassinateEndpoint() via JMX, but I'm getting the following error: *Problem invoking unsafeAssassinateEndpoint : java.lang.IndexOutOfBoundsException: Index: 0, Size: 0* My next and last approach is to manually copy the sstables via rsync from n3 and start x with auto_bootstrap=false, but I really didn't want to use this approach. Is it so hard to bootstrap a new node when not using Vnodes in C* 1.2, or this could be hiding some kind of bug? Any feedback would be greatly appreciated. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/*
Re: Dead node seen as UP by replacement node
Yes, exactly. On Thu, Mar 13, 2014 at 1:27 PM, Rahul Menon ra...@apigee.com wrote: And the token value as suggested is tokenvalueoddeadnode-1 ? On Thu, Mar 13, 2014 at 9:29 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Nope, they have different IPs. I'm using the procedure described here to replace a dead node: http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node Dead node token: X (IP: Y) Replacement node token: X-1 (IP: Z) So, as soon as the replacement node (Z) is started, it sees the dead node (Y) as UP, and tries to stream data from it during the join process. About 10 minutes later, the failure detector of Z detects Y as down, but since it was trying to fetch data from him, it fails the join/bootstrap process altogether. -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314
Dead node seen as UP by replacement node
Hello, I'm trying to replace a dead node using the procedure in [1], but the replacement node initially sees the dead node as UP, and after a few minutes the node is marked as DOWN again, failing the streaming/bootstrap procedure of the replacement node. This dead node is always seen as DOWN by the rest of the cluster. Could this be a bug? I can easily reproduce it in our production environment, but don't know if it's reproducible in a clean environment. Version: 1.2.13 Here is the log from the replacement node (192.168.1.10 is the dead node): INFO [GossipStage:1] 2014-03-12 20:25:41,089 Gossiper.java (line 843) Node /192.168.1.10 is now part of the cluster INFO [GossipStage:1] 2014-03-12 20:25:41,090 Gossiper.java (line 809) InetAddress /192.168.1.10 is now UP INFO [GossipTasks:1] 2014-03-12 20:34:54,238 Gossiper.java (line 823) InetAddress /192.168.1.10 is now DOWN ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java (line 110) Stream failed because /192.168.1.10 died or was restarted/removed (streams may still be active in background, but further streams won't be started) WARN [GossipTasks:1] 2014-03-12 20:34:54,240 RangeStreamer.java (line 246) Streaming from /192.168.1.10 failed ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java (line 110) Stream failed because /192.168.1.10 died or was restarted/removed (streams may still be active in background, but further streams won't be started) WARN [GossipTasks:1] 2014-03-12 20:34:54,241 RangeStreamer.java (line 246) Streaming from /192.168.1.10 failed [1] http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node Cheers, Paulo -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314
Re: Dead node seen as UP by replacement node
Some further info: I'm not using Vnodes, so I'm using the 1.1 replace node trick of setting the initial_token in the cassandra.yaml file to the value of the dead node's token -1, and autobootstrap=true. However, according to the Apache wiki ( https://wiki.apache.org/cassandra/Operations#For_versions_1.2.0_and_above), on 1.2 you should actually remove the dead node from the ring, before adding a replacement node. Does that mean the trick of setting the initial token to the value of the dead node's -1 (described in http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node) is not valid anymore in 1.2 without vnodes? On Wed, Mar 12, 2014 at 5:57 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Hello, I'm trying to replace a dead node using the procedure in [1], but the replacement node initially sees the dead node as UP, and after a few minutes the node is marked as DOWN again, failing the streaming/bootstrap procedure of the replacement node. This dead node is always seen as DOWN by the rest of the cluster. Could this be a bug? I can easily reproduce it in our production environment, but don't know if it's reproducible in a clean environment. Version: 1.2.13 Here is the log from the replacement node (192.168.1.10 is the dead node): INFO [GossipStage:1] 2014-03-12 20:25:41,089 Gossiper.java (line 843) Node /192.168.1.10 is now part of the cluster INFO [GossipStage:1] 2014-03-12 20:25:41,090 Gossiper.java (line 809) InetAddress /192.168.1.10 is now UP INFO [GossipTasks:1] 2014-03-12 20:34:54,238 Gossiper.java (line 823) InetAddress /192.168.1.10 is now DOWN ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java (line 110) Stream failed because /192.168.1.10 died or was restarted/removed (streams may still be active in background, but further streams won't be started) WARN [GossipTasks:1] 2014-03-12 20:34:54,240 RangeStreamer.java (line 246) Streaming from /192.168.1.10 failed ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java (line 110) Stream failed because /192.168.1.10 died or was restarted/removed (streams may still be active in background, but further streams won't be started) WARN [GossipTasks:1] 2014-03-12 20:34:54,241 RangeStreamer.java (line 246) Streaming from /192.168.1.10 failed [1] http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node Cheers, Paulo -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314 -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314
Re: Bootstrap failure on C* 1.2.13
Hello Alain, I solved this with a brute force solution, but didn't understand exactly what happened behind the scenes. What I did was: a) removed the failed node from the ring with the unsafeAssassinate JMX option. b) this caused requests to that node to be routed to the following node which didn't have the data, so in order to fix the problem I inserted a new dummy node with the same token as the failed node, but with autobootstrap=false c) after the node joined the ring again, I did a clean shutdown with nodetool -h localhost disablethrift nodetool -h localhost disablegossip sleep 10 nodetool -h localhost drain d) restart the bootstrap process again in the new node. But in our case, our cluster was not using VNodes, so this workaround will probably not work with VNodes, since you cannot specify the 256 tokens from the old node. This really seem like some kind of metadata inconsistency in gossip, so you probably should check if your nodetool gossipinfo shows a node that's not supposed to be in the ring and unsafeAssassinate it. This post has more info about it: http://nartax.com/2012/09/assassinate-cassandra-node/ But be careful to know what you're doing, as this can be a dangerous operation. Good luck! Cheers, Paulo On Fri, Feb 14, 2014 at 11:17 AM, Alain RODRIGUEZ arodr...@gmail.comwrote: Hi Paulo, Did you find out how to fix this issue ? I am experimenting the exact same issue after trying to help you on this exact subject a few days ago :). Config : 32 C*1.2.11 nodes, Vnodes enabled, RF=3, 1 DC, On AWS EC2 m1.xlarge. We added a few nodes (4) and it seems that this occurs on one node out of two... INFO 12:52:16,889 Finished streaming session d5e4d014-9558-11e3-950d-cd6aba92807e from /xxx.xxx.xxx.xxx java.lang.RuntimeException: Unable to fetch range [(20078703525355016727168231761171377180,20105424945623564908585534414693308183], (129753652951782325468767616123724624016,129754698153613057562227134647005586420], (449910615740630024413140540076738,4524540663392564361402125588359485564], (122461441134035840782923349842361962551,122462803389597917496737056756119104930], (107970238065835199457922160357012606207,107987706615224138615506976884972465320], (129754698153613057562227134647005586420,129760990520285412763184172827801136526], (38338043252657275110873170917842646549,38368318768493907804399955985800320618], (42022774431506526693485667522039962965,42053289032932587102300879230918436885], (66836265760288088017242608238099612345,66844191330959602627129212011239690831], (52540232739182066369547232798226785314,52559117354438503565212218200939569114], (145046787539667961591986998676504957238,145057153206926436867917708334845130444], (108279691586280658015556401795266720050,108305470056478513440634738885678702409], (40039571254531814244837067525035822613,40053379084508254942645157728035688263], (132027653159543236812527609067336099062,132029648290617316887203744857701890860], (52516518106546460227349801041398186304,52540232739182066369547232798226785314], (151797253868519929321029931533765036527,151828244658375264200603444399788004805], (145057153206926436867917708334845130444,145084033851007428646660791831082771964], (107963567982152736714636832273817259428,107970238065835199457922160357012606207]] for keyspace foo_bar from any hosts at org.apache.cassandra.dht.RangeStreamer.fetch(RangeStreamer.java:260) at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:84) at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:973) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:740) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.init(CassandraDaemon.java:381) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:212) Cannot load daemon Service exit with a return value of 3 Hope you'll be able to help me on this one :) 2014-02-07 19:24 GMT+01:00 Robert Coli rc...@eventbrite.com: On Fri, Feb 7, 2014 at 4:41 AM, Alain RODRIGUEZ arodr...@gmail.comwrote: From changelog : 1.2.15 * Move handling of migration event source to solve bootstrap race (CASSANDRA-6648) Maybe should you give this new version a try, if you suspect your issue to be related to CASSANDRA-6648. 6648 appears to have been introduced in 1.2.14, by : https://issues.apache.org/jira/browse/CASSANDRA-6615 So it should only affect 1.2.14. =Rob
non-vnodes own 0.0% of the ring on nodetool status
Hello, After adding a new datacenter with virtual nodes enabled, the output of nodetool status shows that nodes from the non-vnodes datacenter owns 0.0% of the data, as shown below: Datacenter: NonVnodesDC = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN XX.XXX.XX.XXmany GB 1 0.1% myrack UN YY.YYY.YY.YYmany GB 1 0.0% myrack UN ZZ.ZZZ.ZZ.ZZmany GB 1 0.0% myrack Datacenter: VnodesDC == Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN AA.AAA.AA.AAfew KB 256 5.8% myrack UN BB.BBB.BB.BBfew KB 256 6.6% myrack UN CC.CCC.CC.CCfew KB 256 6.9% myrack Is this a presentation issue on nodetool, or could mean a more serious thing? I did exactly the procedure described below to add a new DC: in http://www.datastax.com/documentation/cassandra/1.2/webhelp/cassandra/operations/ops_add_dc_to_cluster_t.html. Thank you, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314
Re: Question: ConsistencyLevel.ONE with multiple datacenters
Cool. I actually changed the consistency level to LOCAL_ONE and things worked as expected. Cheers! On Thu, Feb 6, 2014 at 11:31 AM, Chris Burroughs chris.burrou...@gmail.comwrote: I think the scenario you outlined is correct. The DES handles multiple DCs poorly and the LOCAL_ONE hammer is the best bet. On 01/31/2014 12:40 PM, Paulo Ricardo Motta Gomes wrote: Hey, When adding a new data center to our production C* datacenter using the procedure described in [1], some of our application requests were returning null/empty values. Rebuild was not complete in the new datacenter, so my guess is that some requests were being directed to the brand new datacenter which still didn't have the data. Our Hector client was connected only to the original nodes, with autoDiscoverHosts=false and we use ConsistencyLevel.ONE for reads. The keyspace schema was already configured to use both data centers. My question is: is it possible that the dynamic snitch is choosing the nodes in the new (empty) datacenter when CL=ONE? In this case, it's mandatory to use CL=LOCAL_ONE during bootstrap/rebuild of a new datacenter, otherwise empty data might be returned, correct? Cheers, [1] http://www.datastax.com/documentation/cassandra/1.2/ webhelp/cassandra/operations/ops_add_dc_to_cluster_t.html -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314
Re: Adding datacenter for move to vnodes
We had a similar situation and what we did was first migrate the 1.1 cluster to GossipingPropertyFileSnitch, making sure that for each node we specified the correct availability zone as the rack in the cassandra-rackdc.properties. In this way, the GossipingPropertyFileSnitch is equivalent to the EC2MultiRegionSnitch, so the data location does not change and no repair is needed afterwards. So, if your nodes are located in the us-east-1e AZ, your cassandra-rackdc.properties should look like: dc=us-east rack=1e After this step is complete on all nodes, then you can add a new datacenter specifying different dc and rack on the cassandra-rackdc.properties of the new DC. Make sure you upgrade your initial datacenter to 1.2 before adding a new datacenter with vnodes enabled (of course). Cheers On Sun, Feb 2, 2014 at 6:37 AM, Katriel Traum katr...@google.com wrote: Hello list. I'm upgrading a 1.1 cassandra cluster to 1.2(.13). I've read here and in other places that the best way to migrate to vnodes is to add a new DC, with the same amount of nodes, and run rebuild on each of them. However, I'm faced with the fact that I'm using EC2MultiRegion snitch, which automagically creates the DC and RACK. Any ideas how I can go about adding a new DC with this kind of setup? I need these new machines to be in the same EC2 Region as the current ones, so adding to a new Region is not an option. TIA, Katriel -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314
Question: ConsistencyLevel.ONE with multiple datacenters
Hey, When adding a new data center to our production C* datacenter using the procedure described in [1], some of our application requests were returning null/empty values. Rebuild was not complete in the new datacenter, so my guess is that some requests were being directed to the brand new datacenter which still didn't have the data. Our Hector client was connected only to the original nodes, with autoDiscoverHosts=false and we use ConsistencyLevel.ONE for reads. The keyspace schema was already configured to use both data centers. My question is: is it possible that the dynamic snitch is choosing the nodes in the new (empty) datacenter when CL=ONE? In this case, it's mandatory to use CL=LOCAL_ONE during bootstrap/rebuild of a new datacenter, otherwise empty data might be returned, correct? Cheers, [1] http://www.datastax.com/documentation/cassandra/1.2/webhelp/cassandra/operations/ops_add_dc_to_cluster_t.html -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 +55 83 9690-1314