ApacheCon Europe 2019
Hi, Do we have any plans for dedicated Apache Cassandra track or sessions at ApacheCon Berlin in Oct 2019? CFP closes 26 May, 2019. ThanksAnuj Wadehra
Re: Upgrade to v3.11.3
Hi Shalom, Just a suggestion. Before upgrading to 3.11.3 make sure you are not impacted by any open crtitical defects especially related to RT which may cause data loss e.g.14861. Please find my response below: The upgrade process that I know of is from 2.0.14 to 2.1.x (higher than 2.1.9 I think) and then from 2.1.x to 3.x. Do I need to upgrade first to 3.0.x or can I upgraded directly from 2.1.x to 3.11.3? Response: Yes, you can upgrade from 2.0.14 to some latest stable version of 2.1.x (only 2.1.9+) and then upgrade to 3.11.3. Can I run upgradesstables on several nodes in parallel? Is it crucial to run it one node at a time? Response: Yes, you can run in parallel. When running upgradesstables on a node, does that node still serves writes and reads? Response: Yes. Can I use open JDK 8 (instead of Oracle JDK) with C* 3.11.3? Response: We have not tried but it should be okay. See https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-13916. Is there a way to speed up the upgradesstables process? (besides compaction_throughput) Response: If clearing pending compactions caused by rewriting sstable is a concern,probably you can also try increasing concurrent compactors. Disclaimer: The information provided in above response is my personal opinion based on the best of my knowledge and experience. We do not take any responsibility and we are not liable for any damage caused by actions taken based on above information. ThanksAnuj On Wed, 16 Jan 2019 at 19:15, shalom sagges wrote: Hi All, I'm about to start a rolling upgrade process from version 2.0.14 to version 3.11.3. I have a few small questions: - The upgrade process that I know of is from 2.0.14 to 2.1.x (higher than 2.1.9 I think) and then from 2.1.x to 3.x. Do I need to upgrade first to 3.0.x or can I upgraded directly from 2.1.x to 3.11.3? - Can I run upgradesstables on several nodes in parallel? Is it crucial to run it one node at a time? - When running upgradesstables on a node, does that node still serves writes and reads? - Can I use open JDK 8 (instead of Oracle JDK) with C* 3.11.3? - Is there a way to speed up the upgradesstables process? (besides compaction_throughput) Thanks!
Cassandra Upgrade with Different Protocol Version
Hi, I woud like to know how people are doing rolling upgrade of Casandra clustes when there is a change in native protocol version say from 2.1 to 3.11. During rolling upgrade, if client application is restarted on nodes, the client driver may first contact an upgraded Cassandra node with v4 and permanently mark all old Casandra nodes on v3 as down. This may lead to request failures. Datastax recommends two ways to deal with this: 1. Before upgrade, set protocol version to lower protocol version. And move to higher version once entire cluster is upgraded.2. Make sure driver only contacts upraded Cassandra nodes during rolling upgrade. Second workaround will lead to failures as you may not be able to meet required consistency for some time. Lets consider first workaround. Now imagine an application where protocol version is not configurable and code uses default protocol version. You can not apply first workaroud because you have to upgrade your application on all nodes to first make the protocol version configurable. How would you upgrade such a cluster without downtime? Thoughts? ThanksAnuj
Re: [External] Re: Whch version is the best version to run now?
We evaluated both 3.0.x and 3.11.x. +1 for 3.11.2 as we faced major performance issues with 3.0.x. We have NOT evaluated new features on 3.11.x. Anuj Sent from Yahoo Mail on Android On Tue, 6 Mar 2018 at 19:35, Alain RODRIGUEZwrote: Hello Tom, It's good to hear this kind of feedbacks, Thanks for sharing. 3.11.x seems to get more love from the community wrt patches. This is why I'd recommend 3.11.x for new projects. I also agree with this analysis. Stay away from any of the 2.x series, they're going EOL soonish and the newer versions are very stable. +1 here as well. Maybe add that 3.11.x, that is described as 'very stable' above, aims at stabilizing Cassandra after the tick-tock releases and is a 'bug fix' series and brings features developed during this period, even though it is needed to be careful with of some the new features, even in latest 3.11.x versions. I did not work that much with it yet, but I think I would pick 3.11.2 as well for a new cluster at the moment. C*heers, ---Alain Rodriguez - @arodream - alain@thelastpickle.comFrance / Spain The Last Pickle - Apache Cassandra Consultinghttp://www.thelastpickle.com 2018-03-05 12:39 GMT+00:00 Tom van der Woerdt : We run on the order of a thousand Cassandra nodes in production. Most of that is 3.0.16, but new clusters are defaulting to 3.11.2 and some older clusters have been upgraded to it as well. All of the bugs I encountered in 3.11.x were also seen in 3.0.x, but 3.11.x seems to get more love from the community wrt patches. This is why I'd recommend 3.11.x for new projects. Stay away from any of the 2.x series, they're going EOL soonish and the newer versions are very stable. Tom van der WoerdtSite Reliability Engineer Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL NetherlandsThe world's #1 accommodation site 43 languages, 198+ offices worldwide, 120,000+ global destinations, 1,550,000+ room nights booked every day No booking fees, best price always guaranteed Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) On Sat, Mar 3, 2018 at 12:25 AM, Jeff Jirsa wrote: I’d personally be willing to run 3.0.16 3.11.2 or 3 whatever should also be similar, but I haven’t personally tested it at any meaningful scale -- Jeff Jirsa On Mar 2, 2018, at 2:37 PM, Kenneth Brotman wrote: Seems like a lot of people are running old versions of Cassandra. What is the best version, most reliable stable version to use now? Kenneth Brotman
Re: LWT and non-LWT mixed
Hi Daniel, What is the RF and CL for Delete?Are you using asynchronous writes?Are you firing both statements from same node sequentially?Are you firing these queries in a loop such that more than one delete and LWT is fired for same partition? I think if you have the same client executing both statements sequentially in same thread i.e. one after another and delete is synchronous, it should work fine. LWT will be executed after Cassandra has written on Quorum of nodes and will see the data. Paxos of LWT shall only be initiated when delete completes. I think, LWT should not be mixed with normal write when you have such writes fired from multiple nodes/threads on the same partition. ThanksAnuj Sent from Yahoo Mail on Android On Tue, 10 Oct 2017 at 14:10, Daniel Woowrote: The document explains you cannot mix themhttp://docs.datastax.com/en/archived/cassandra/2.2/cassandra/dml/dmlLtwtTransactions.html But what happens under the hood if I do? e.g, DELETE INSERT ... IF NOT EXISTS The coordinator has 4 steps to do the second statement (INSERT)1. prepare/promise a ballot2. read current row from replicas3. propose new value along with the ballot to replicas4. commit and wait for ack from replicas My question is, once the row is DELETed, the next INSERT LWT should be able to see that row's tombstone in step 2, then successfully inserts the new value. But my tests shows that this often fails, does anybody know why? -- Thanks & Regards, Daniel
Re: 回复: 回复: tolerate how many nodes down in the cluster
Hi Peng, Racks can be logical (as defined with RAC attribute in Cassandra configuration files) or physical (racks in server rooms). In my view, for leveraging racks in your case, its important to understand the implication of following decisions: 1. Number of distinct logical RACs defined in Cassandra:If you want to leverage RACs optimally for operational efficiencies (like Brooke explained), you need to make sure that logical RACs are ALWAYS equal to RF irrespective of the fact whether physical Racks are equal to or greater than RF. Keeping logical Racks=RF, ensures that nodes allocated to a logical rack have exactly 1 replicas of the entire 100% data set. So, if your have RF=3 and you use QUORUM for read/write, you can bring down ALL nodes allocated to a logical rack for maintenance activity and still achieve 100% availability. This makes operations faster and cuts down the risk involved. For example, imagine taking a Cassandra restart of entire cluster. If one node takes 3 minutes, a rolling restart of 30 nodes would take 90 minutes. But, if you use 3 logical RACs with RF=3 and assign 10 nodes to each logical RAC, you can restart 10 nodes within a RAC simultaneously (of course in off-peak hours so that remaining 20 nodes can take the load). Staring Cassandra on all RACs one by one will just take 9 minutes rather than 90 minutes. If there are any issues during restart/maintenance, you can take all the nodes on a Logical RAC down, fix them and bring them back without affecting availability 2.Number of physical Racks : As per historical data, there are instances when more than one nodes in a physical rack fail together. When you are using VMs, there are three levels instead of two. VMs on a single physical machine are likely to fail together too due to hardware failure. Physical Racks > Physical Machines > VMs Ensure that all VMs on a physical machine map to single logical RAC. If you want to afford failure of physical racks in the server room, you also need to ensure that all physical servers on a physical rack must map to just one logical RAC. This way, you can afford failure of ALL VMs on ALL physical machines mapped to a single logical RAC and still be 100% available. For Example: RF=3 , 6 physical racks, 2 physical servers per physical rack and 3 VMs per physical server.Setup would be- Physical Rack1 = [Physical1 (3 VM) + Physical2 (3 VM) ]= LogicalRAC1Physical Rack2 = [Physical3 (3 VM) + Physical4 (3 VM) ]= LogicalRAC1 Physical Rack3 = [Physical5 (3 VM) + Physical6 (3 VM) ]= LogicalRAC2Physical Rack4 = [Physical7 (3 VM) + Physical8 (3 VM) ]= LogicalRAC2 Physical Rack5 = [Physical9 (3 VM) + Physical10 (3 VM) ]= LogicalRAC3Physical Rack6 = [Physical11 (3 VM) + Physical12 (3 VM) ]= LogicalRAC3 Problem with this approach is scaling. What if you want to add a single physical server? If you do that and allocate it to one existing logical RAC, your cluster wont be balanced properly because the logical RAC to which the server is added will have additional capacity for same data as other two logical RACs.To keep your cluster balanced, you need to add at least 3 physical servers in 3 different physical Racks and assign each physical server to different logical RAC. This is wastage of resources and hard to digest. If you have physical machines < logical RACs, every physical machine may have more than 1 replica. If entire physical machine fails, you will NOT have 100% availability as more than 1 replica may be unavailable. Similarly, if you have physical racks < logical RACs, every physical rack may have more than 1 replica. If entire physical rack fails, you will NOT have 100% availability as more than 1 replica may be unavailable. Coming back to your example: RF=3 per DC (total RF=6), CL=QUORUM, 2 DCs, 6 physical machines, 8 VMs per physical machine: My Recommendation :1. In each DC, assign 3 physical machines in a DC to 3 logical RACs in Cassandra configuration . 2 DCs can have same RAC names as RACs are uniquely identified with their DC names. So, these are 6 different logical RACs (multiple of RF) (i.e. 1 physical machine per logical RAC) 2. Add 6 physical machines (3 physical machines per DC) to scale the cluster and assign every machine to different logical RAC within the DC. This way, even if you have Active-Passive DC setup, you can afford failure of any physical machine or physical rack in Active DC and still ensure 100% availability. You would also achieve operational benefits explained above. In multi-DC setup, you can also choose to do away with RACs and achieve operational benefits by doing maintenance on one entire DC at a time and leveraging other DC to handle client requests during that time. That will make your life simpler. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 27 Jul 2017 at 12:03, kurt greaveswrote: Note that if you use more racks than RF you lose some of the
Re: 回复: tolerate how many nodes down in the cluster
Hi Brooke, Very nice presentation:https://www.youtube.com/watch?v=QrP7G1eeQTI !! Good to know that you are able to leverage Racks for gainingoperational efficiencies. I think vnodes have made life easier. I still see some concerns with Racks: 1. Usually scaling needs are driven by business requirements. Customerswant value for every penny they spend. Adding 3 or 5 servers (because you haveRF=3 or 5) instead of 1 server costs them dearly. It's difficult to justify theadditional cost as fault tolerance can only be improved but not guaranteed with racks. 2. You need to maintain mappings of Logical Racks (=RF) andphysical racks (multiple of RFs) for large clusters. 3. Using racks tightlycouples your hardware (rack size, rack count) / virtualization decisions (VMSize, VM count per physical node) with application RF. Thanks Anuj On Tuesday, 25 July 2017 3:56 AM, Brooke Thorley <bro...@instaclustr.com> wrote: Hello Peng. I think spending the time to set up your nodes into racks is worth it for the benefits that it brings. With RF3 and NTS you can tolerate the loss of a whole rack of nodes without losing QUORUM as each rack will contain a full set of data. It makes ongoing cluster maintenance easier, as you can perform upgrades, repairs and restarts on a whole rack of nodes at once. Setting up racks or adding nodes is not difficult particularly if you are using vnodes. You would simply add nodes in multiples of to keep the racks balanced. This is how we run all our managed clusters and it works very well. You may be interested to watch my Cassandra Summit presentation from last year in which I discussed this very topic: https://www.youtube.com/watch?v=QrP7G1eeQTI (from 4:00) If you were to consider changing your rack topology, I would recommend that you do this by DC migration rather than "in place". Kind Regards,Brooke ThorleyVP Technical Operations & Customer servicessupp...@instaclustr.com | support.instaclustr.com Read our latest technical blog posts here.This email has been sent on behalf of Instaclustr Pty. Limited (Australia) and Instaclustr Inc (USA).This email and any attachments may contain confidential and legally privileged information. If you are not the intended recipient, do not copy or disclose its content, but please reply to this email immediately and highlight the error to the sender and then immediately delete the message. On 25 July 2017 at 03:06, Anuj Wadehra <anujw_2...@yahoo.co.in.invalid> wrote: Hi Peng, Three things are important when you are evaluating fault tolerance and availability for your cluster: 1. RF2. CL3. Topology - how data is replicated in racks. If you assume that N nodes from ANY rack may fail at the same time, then you can afford failure of RF-CL nodes and still be 100% available. E. g. If you are reading at quorum and RF=3, you can only afford one (3-2) node failure. Thus, even if you have a 30 node cluster, 10 node failure can not provide you 100% availability. RF impacts availability rather than total number of nodes in a cluster. If you assume that N nodes failing together will ALWAYS be from the same rack, you can spread your servers in RF physical racks and use NetworkTopologyStrategy. While allocating replicas for any data, Cassandra will ensure that 3 replicas are placed in 3 different racks E.g. you can have 10 nodes in 3 racks and then even a 10 node failure within SAME rack shall ensure that you have 100% availability as two replicas are there for 100% data and CL=QUORUM can be met. I have not tested this but that how the rack concept is expected to work. I agree, using racks generally makes operations tougher. ThanksAnuj On Mon, 24 Jul 2017 at 20:10, Peng Xiao<2535...@qq.com> wrote: Hi Bhuvan,From the following link,it doesn't suggest us to use RAC and it looks reasonable.http://www.datastax.com/dev/ blog/multi-datacenter- replication Defining one rack for the entire cluster is the simplest and most common implementation. Multiple racks should be avoided for the following reasons: • Most users tend to ignore or forget rack requirements that state racks should be in an alternating order to allow the data to get distributed safely and appropriately. • Many users are not using the rack information effectively by using a setup with as many racks as they have nodes, or similar non-beneficial scenarios. • When using racks correctly, each rack should typically have the same number of nodes. • In a scenario that requires a cluster expansion while using racks, the expansion procedure can be tedious since it typically involves several node moves and has has to ensure to ensure that racks will be distributing data correctly and evenly. At times when clusters need immediate expansion, racks should be the last things to worry about. -- 原始邮件 -- 发件人: "Bhuvan Rawal";<bhu1ra...@gmail.com&g
Re: 回复: tolerate how many nodes down in the cluster
Hi Peng, Three things are important when you are evaluating fault tolerance and availability for your cluster: 1. RF2. CL3. Topology - how data is replicated in racks. If you assume that N nodes from ANY rack may fail at the same time, then you can afford failure of RF-CL nodes and still be 100% available. E. g. If you are reading at quorum and RF=3, you can only afford one (3-2) node failure. Thus, even if you have a 30 node cluster, 10 node failure can not provide you 100% availability. RF impacts availability rather than total number of nodes in a cluster. If you assume that N nodes failing together will ALWAYS be from the same rack, you can spread your servers in RF physical racks and use NetworkTopologyStrategy. While allocating replicas for any data, Cassandra will ensure that 3 replicas are placed in 3 different racks E.g. you can have 10 nodes in 3 racks and then even a 10 node failure within SAME rack shall ensure that you have 100% availability as two replicas are there for 100% data and CL=QUORUM can be met. I have not tested this but that how the rack concept is expected to work. I agree, using racks generally makes operations tougher. ThanksAnuj On Mon, 24 Jul 2017 at 20:10, Peng Xiao<2535...@qq.com> wrote: Hi Bhuvan,From the following link,it doesn't suggest us to use RAC and it looks reasonable.http://www.datastax.com/dev/blog/multi-datacenter-replication Defining one rack for the entire cluster is the simplest and most common implementation. Multiple racks should be avoided for the following reasons: • Most users tend to ignore or forget rack requirements that state racks should be in an alternating order to allow the data to get distributed safely and appropriately. • Many users are not using the rack information effectively by using a setup with as many racks as they have nodes, or similar non-beneficial scenarios. • When using racks correctly, each rack should typically have the same number of nodes. • In a scenario that requires a cluster expansion while using racks, the expansion procedure can be tedious since it typically involves several node moves and has has to ensure to ensure that racks will be distributing data correctly and evenly. At times when clusters need immediate expansion, racks should be the last things to worry about. -- 原始邮件 --发件人: "Bhuvan Rawal";;发送时间: 2017年7月24日(星期一) 晚上7:17收件人: "user" ; 主题: Re: tolerate how many nodes down in the cluster Hi Peng , This really depends on how you have configured your topology. Say if you have segregated your dc into 3 racks with 10 servers each. With RF of 3 you can safely assume your data to be available if one rack goes down. But if different servers amongst the racks fail then i guess you are not guaranteeing data integrity with RF of 3 in that case you can at max lose 2 servers to be available. Best idea would be to plan failover modes appropriately and letting cassandra know of the same. Regards,Bhuvan On Mon, Jul 24, 2017 at 3:28 PM, Peng Xiao <2535...@qq.com> wrote: Hi, Suppose we have a 30 nodes cluster in one DC with RF=3,how many nodes can be down?can we tolerate 10 nodes down?it seems that we are not able to avoid the data distribution 3 replicas in the 10 nodes?,then we can only tolerate 1 node down even we have 30 nodes?Could anyone please advise? Thanks
RE: MUTATION messages were dropped in last 5000 ms for cross node timeout
Hi Asad, You can increase it by 2 at a time. For example if its currently 2, try increasing it to 4 and retest. We flush 5-6 tables at a time and use 3 memtable_flush_writers. It works great!! There were dropped mutations when it was set to one. The idea is to make sure that writes are not blocked. ThanksAnuj Sent from Yahoo Mail on Android On Fri, 21 Jul 2017 at 20:04, ZAIDI, ASAD A<az1...@att.com> wrote: #yiv0831784205 #yiv0831784205 -- _filtered #yiv0831784205 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv0831784205 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv0831784205 #yiv0831784205 p.yiv0831784205MsoNormal, #yiv0831784205 li.yiv0831784205MsoNormal, #yiv0831784205 div.yiv0831784205MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv0831784205 a:link, #yiv0831784205 span.yiv0831784205MsoHyperlink {color:blue;text-decoration:underline;}#yiv0831784205 a:visited, #yiv0831784205 span.yiv0831784205MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv0831784205 p.yiv0831784205msonormal, #yiv0831784205 li.yiv0831784205msonormal, #yiv0831784205 div.yiv0831784205msonormal {margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv0831784205 p.yiv0831784205msochpdefault, #yiv0831784205 li.yiv0831784205msochpdefault, #yiv0831784205 div.yiv0831784205msochpdefault {margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv0831784205 span.yiv0831784205msohyperlink {}#yiv0831784205 span.yiv0831784205msohyperlinkfollowed {}#yiv0831784205 span.yiv0831784205emailstyle17 {}#yiv0831784205 p.yiv0831784205msonormal1, #yiv0831784205 li.yiv0831784205msonormal1, #yiv0831784205 div.yiv0831784205msonormal1 {margin:0in;margin-bottom:.0001pt;font-size:11.0pt;}#yiv0831784205 span.yiv0831784205msohyperlink1 {color:#0563C1;text-decoration:underline;}#yiv0831784205 span.yiv0831784205msohyperlinkfollowed1 {color:#954F72;text-decoration:underline;}#yiv0831784205 span.yiv0831784205emailstyle171 {color:windowtext;}#yiv0831784205 p.yiv0831784205msochpdefault1, #yiv0831784205 li.yiv0831784205msochpdefault1, #yiv0831784205 div.yiv0831784205msochpdefault1 {margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv0831784205 span.yiv0831784205EmailStyle27 {color:#1F497D;}#yiv0831784205 .yiv0831784205MsoChpDefault {} _filtered #yiv0831784205 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv0831784205 div.yiv0831784205WordSection1 {}#yiv0831784205 Thank you for your reply. I’ll increase memTable_flush_writes and report back if it helps. Is there any formula that we can use to arrive at correct number of memTable_flush_writers ? or the exercise would windup be like “try and error” taking much time to arrive at some number that may not be optimal. Thank you again. From: Anuj Wadehra [mailto:anujw_2...@yahoo.co.in] Sent: Thursday, July 20, 2017 12:17 PM To: ZAIDI, ASAD A <az1...@att.com>; user@cassandra.apache.org Subject: Re: MUTATION messages were dropped in last 5000 ms for cross node timeout Hi Asad, You can do following things: 1.Increase memtable_flush_writers especially if you have a write heavy load. 2.Make sure there are no big gc pauses on your nodes. If yes, go for heap tuning. Please let us know whether above measures fixed your problem or not. Thanks Anuj Sent from Yahoo Mail on Android On Thu, 20 Jul 2017 at 20:57, ZAIDI, ASAD A <az1...@att.com> wrote: Hello Folks – I’m using apache-cassandra 2.2.8. I see many messages like below in my system.log file. In Cassandra.yaml file [ cross_node_timeout: true] is set and NTP server is also running correcting clock drift on 16node cluster. I do not see pending or blocked HintedHandoff in tpstats output though there are bunch of MUTATIONS dropped observed. INFO [ScheduledTasks:1] 2017-07-20 08:02:52,511 MessagingService.java:946 - MUTATION messages were dropped in last 5000 ms: 822 for internal timeout and 2152 for cross node timeout I’m seeking help here if you please let me know what I need to check in order to address these cross node timeouts. Thank you, Asad
Re: MUTATION messages were dropped in last 5000 ms for cross node timeout
Hi Asad, You can do following things: 1.Increase memtable_flush_writers especially if you have a write heavy load. 2.Make sure there are no big gc pauses on your nodes. If yes, go for heap tuning. Please let us know whether above measures fixed your problem or not. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 20 Jul 2017 at 20:57, ZAIDI, ASAD Awrote: Hello Folks – I’m using apache-cassandra 2.2.8. I see many messages like below in my system.log file. In Cassandra.yaml file [ cross_node_timeout: true] is set and NTP server is also running correcting clock drift on 16node cluster. I do not see pending or blocked HintedHandoff in tpstats output though there are bunch of MUTATIONS dropped observed. INFO [ScheduledTasks:1] 2017-07-20 08:02:52,511 MessagingService.java:946 - MUTATION messages were dropped in last 5000 ms: 822 for internal timeout and 2152 for cross node timeout I’m seeking help here if you please let me know what I need to check in order to address these cross node timeouts. Thank you, Asad
Re: "nodetool repair -dc"
Hi, I have not used dc local repair specifically but generally repair syncs all local tokens of the node with other replicas (full repair) or a subset of local tokens (-pr and subrange). Full repair with - Dc option should only sync data for all the tokens present on the node where the command is run with other replicas in local dc. You should run full repair on all nodes of the DC unless RF of all keyspaces in local DC =number of nodes in DC. E.g if you have 3 nodes in dc1 and RF is DC1:3, repairing single node should sync all data within a DC. This doesnt hold true if you have 5 nodes and no node holds 100% data. Running full repair on all nodes in a dc may lead to repairing every data RF times. Inefficient!! And you cant use pr with dc option. Even if its allowed it wont repair entire ring as a dc owns a subset of entire token ring. ThanksAnuj On Tue, 11 Jul 2017 at 20:08, vasu gunjawrote: Hi , My Question specific to -dc option Do we need to run this on all nodes that belongs to that DC ?Or only on one of the nodes that belongs to that DC then it will repair all nodes ? On Sat, Jul 8, 2017 at 10:56 PM, Varun Gupta wrote: I do not see the need to run repair, as long as cluster was in healthy state on adding new nodes. On Fri, Jul 7, 2017 at 8:37 AM, vasu gunja wrote: Hi , I have a question regarding "nodetool repair -dc" option. recently we added multiple nodes to one DC center, we want to perform repair only on current DC. Here is my question. Do we need to perform "nodetool repair -dc" on all nodes belongs to that DC ? or only one node of that DC? thanks,V
Re: private interface for interdc messaging
Hi, I am not sure why you would want to connect clients on public interface. Are you making db calls from clients outside the DC? Also, not sure why you expect two DCs to communicate on private networks unless they are two logical DCs within same physical DC. Generally, you configure multi dc setup in yaml as follows: -use GossipingPropertyFileSnitch and set prefer_local to true in cassandra-rackdc.properties. This would ensure that local node to node communication within a dc happens on private. -set Rpc_address to private ip so that clients connect to private interface. -set listen_address to private IP. Cassandra would communicate to nodes in local dc using this address. -set broadcast_address to public ip.Cassandra would communicate to nodes in other dc using this address. -set listen_on_broadcast_address to true ThanksAnuj On Fri, 7 Jul 2017 at 22:58, CPCwrote: Hi, We are building 2 datacenters with each machine have one public(for native client connections) and one for private(internode communication). What we noticed that nodes in one datacenters trying to communicate with other nodes in other dc over their public interfaces. I mean:DC1 Node1 public interface -> DC2 Node1 private interfaceBut what we perefer is:DC1 Node1 private interface -> DC2 Node1 private interface Is there any configuration so a node make interdc connection over its private network? Thank you...
Re: Merkle trees requests hanging
Hi Jean, Ensure that your firewall is not timing out idle connections. Nodes should time out idle connections first (using tcp keep alive settings before firewall does it). Please refer http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html. ThanksAnuj Sent from Yahoo Mail on Android On Tue, 4 Jul 2017 at 19:41, Jean Carlowrote: Hello. What if a node send a merkle tree to its replica but this one would never received by any network issues. The repair will be hanging eternally ? or Should I modify the parameter # streaming_socket_timeout_in_ms: 0 to avoid this ? Saludos Jean Carlo "The best way to predict the future is to invent it" Alan Kay
Re: Cassandra Cluster Expansion Criteria
Hi Asad, First, you need to understand the factors impacting cluster capacity. Some of the important factors to be considered while doing capacity planning of Cassandra are: 1. Compaction strategy: It impacts disk space requirements and IO/CPU/memory overhead for compactions. 2. Replication Factor: Impacts disk space. 3. Business SLAs and Data Access patterns (read/write) 4. Type of storage: SSD will ensure that IO is rarely a problem. You may become CPU bound first. Some trigger points for expanding your cluster: 1. Disk crunch. Unable to meet free disk requirements for various compaction strategies. 2. Overloaded nodes: tpstats/logs show frequent dropped mutations. Consistently high CPU load. 3. Business SLAs not being met due to increase in reads/writes per second. Please note that this is not an exhaustive list. ThanksAnuj Sent from Yahoo Mail on Android On Thu, Jun 29, 2017 at 7:15 PM, ZAIDI, ASAD Awrote: Hello Folks, I’m on Cassandra 2.2.8 cluster with 14 nodes , each with around 2TB of data volume. I’m looking for a criteria /or data points that can help me decide when or if I should add more nodes to the cluster and by how many nodes. I’ll really appreciate if you guys can share your insights. Thanks/Asad
Re: Restore Snapshot
Thanks Kurt. I think the main scenario which MUST be addressed by snapshot is Backup/Restore so that a node can be restored with minimal time and the lengthy procedure of boottsrapping with join_ring=false followed by full repair can be avoided. The plain restore snapshot + repair scenario seems to be broken. The situation is less critical when you use join_ring=false. . Changing consistency level to ALL is not an optimal solution or workaround because it may impact performance. Moreover, it is an unreasonable and unstated assumption that Cassandra users can dynamically change CL and then revert it back after the repair. Ideal restore process should be :1. Restore Snapshot2. Start the node with join_ring=false 3. Cassandra should ACCEPT writes in this phase just like bootstrap with join_ring=false.4. Repair the node5. Join the node. Point 3 seems to be missing in current implementation of join_ring. Thus, at step 5 when the node joins the ring, it will NOT lead to inconsistent writes as all the data updates after the snapshot was taken and before the snapshot was restored are consistent on all the nodes. BUT now, the node has missed on important updates done while the repair was going on. So, full repair didn't synced entire data. It fixed inconsistencies and prevented inconsistent reeads and lead to NEW inconsistencies. You need another full repair on the node :( I will conduct a test to be 100% sure that join_ring is not accepting writes and if I get same results, I will create a JIRA. We are updating file system on nodes and doing it one node a time to avoid downtime. Snapshot cuts down on excessive streaming and lengthy procedure (boostrap+repair), so we were evaluating snapshot restore as an option. ThanksAnuj On Wednesday, 28 June 2017 5:56 PM, kurt greaveswrote: There are many scenarios where it can be useful, but to address what seems to be your main concern; you could simply restore and then only read at ALL until your repair completes. If you use snapshot restore with commitlog archiving you're in a better state, but granted the case you described can still occur. To some extent, if you have to restore a snapshot you will have to perform some kind of repair. It's not really possible to restore to an older point and expect strong consistency. Snapshots are also useful for creating a clone of a cluster/node. But really why are you only restoring a snapshot on one node? If you lost all the data, it would be much simpler to just replace the node.
Re: Linux version update on DSE
Also, if you restore exactly same data with different IP, you may need to clear gossip state on the node. Anuj Sent from Yahoo Mail on Android On Tue, Jun 27, 2017 at 11:56 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi Nitan, I asked for adding autobootstrap false to avoid streaming. Generally, replace_address is used for bootstrapping new node with new ip but same token range. As you already had data, I asked you to try it with autoboostrap false and see if that works for you. If you can bring it back without replace_address option, good to go. ThanksAnuj Sent from Yahoo Mail on Android On Tue, Jun 27, 2017 at 11:23 PM, Nitan Kainth<ni...@bamlabs.com> wrote: Anuj, We did that in past, even if data was not removed, replace_node caused data streaming. So changing IP is simplest and safest option. On Jun 27, 2017, at 12:43 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: replace_address_first_boot
Re: Linux version update on DSE
Hi Nitan, I think it would be simpler to take one node down at a time and replace it by bringing the new node up after linux upgrade, doing same cassandra setup, using replace_address option and setting autobootstrap=false ( as data is already there). No downtime as it would be a rolling upgrade. No streaming as same tokens would work. If you have latest C*, use replace_address_first_boot. If option not available, use replace_address and make sure you remove it once new node is up. Try it and let us know if it works for you. ThanksAnuj On Tue, Jun 27, 2017 at 4:56 AM, Nitan Kainthwrote: Right, we are just upgrading Linux on AWS. C* will remain at same version. On Jun 26, 2017, at 6:05 PM, Hannu Kröger wrote: I understood he is updating linux, not C* Hannu On 27 June 2017 at 02:04:34, Jonathan Haddad (j...@jonhaddad.com) wrote: It sounds like you're suggesting adding new nodes in to replace existing ones. You can't do that because it requires streaming between versions, which isn't supported. You need to take a node down, upgrade the C* version, then start it back up. Jon On Mon, Jun 26, 2017 at 3:56 PM Nitan Kainth wrote: It's vnodes. We will add to replace new ip in yaml as well. Thank you. Sent from my iPhone > On Jun 26, 2017, at 4:47 PM, Hannu Kröger wrote: > > Looks Ok. Step 1.5 would be to stop cassandra on existing node but apart from > that looks fine. Assuming you are using same configs and if you have hard > coded the token(s), you use the same. > > Hannu > >> On 26 Jun 2017, at 23.24, Nitan Kainth wrote: >> >> Hi, >> >> We are planning to update linux for C* nodes version 3.0. Anybody has steps >> who did it recent past. >> >> Here are draft steps, we are thinking: >> 1. Create new node. It might have a different IP address. >> 2. Detach mounts from existing node >> 3. Attach mounts to new Node >> 4. Start C* - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Restore Snapshot
Hi, I am curious to know how people practically use Snapshot restore provided that snapshot restore may lead to inconsistent reads until full repair is run on the node being restored ( if you have dropped mutations in your cluster). Example: 9 am snapshot taken on all 3 nodes10 am mutation drop on node 3 11 am snapshot restore on node 1. Now the data is only on node 2 if we are writing at quorum and we will observe inconsistent reads till we repair node 1. If you use restore snapshot with join_ring equal to false, repair the node and then join the restored node when repair completes, the node will not lead to inconsistent reads but will miss writes during the time its being repaired as simply booting the node with join_ring=false would also stop pushing writes to the node ( unlike boostrap with join_ring=false where writes are pushed to the node being bootstrapped) and thus you would need another full repair to make data of the node restored via snapshot in sync with other nodes. Its hard to believe that a simple snapshot restore scenario is still broken and people are not complaining. So, I thought of asking the community members..how do you practically use snapshot restore while addressing the read inconsistency issue. ThanksAnuj
Re: Hints files are not getting truncated
Hi Meg, max_hint_window_in_ms =3 hrs means that if a node is down/unresponsive for more than 3 hrs, hints would not be stored for it any further until it becomes responsive again. It should not mean that already stored hints would be truncated after 3 hours. Regarding connection timeouts between DCs, please check your firewall settings and tcp settings on node. Firewall between the DC must not kill an idle connection which is still considered to be usable by Cassandra. Please see http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html . In multi dc setup, documentation recommends to increase number of threads. You can try increasing it and check whether it improves the situation. ThanksAnuj On Tue, Jun 27, 2017 at 9:47 PM, Meg Marawrote: Hello, I am facing an issue with Hinted Handoff files in Cassandra v3.0.10. A DC1 node is storing large number of hints for DC2 nodes (we are facing connection timeout issues). The problem is that the hint files which are created on DC1 are not getting deleted after the 3 hour window. Hints are now being stored as flat files in the Cassandra home directory and I can see that old hints are being deleted but at a very slow pace. It still contains hints from May. max_hint_window_in_ms: 1080 max_hints_delivery_threads: 2 Why do you suppose this is happening? Any suggestions or recommendations would be much appreciated. Thanks for your time. Meg Mara
Re: Re-adding Decommissioned node
Hi Mark, Please ensure that the node is not defined as seed node in the yaml. Seed nodes don't bootstrap. ThanksAnuj On Tue, Jun 27, 2017 at 9:56 PM, Mark Furlongwrote: I have a node that has been decommissioned and it showed ‘UL’, the data volume and the commitlogs have been removed, and I now want to add that node back into my ring. When I add this node, (bootstrap=true, start cassandra service) it comes back up in the ring as an existing node and shows as ‘UN’ instead of ‘UJ’. Why is this? It has no data. | Mark Furlong | | Sr. Database Administrator | | mfurl...@ancestry.com M: 801-859-7427 O: 801-705-7115 1300 W Traverse Pkwy Lehi, UT 84043 | | | | | | - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Partition range incremental repairs
Hi Chris, Can your share following info: 1. Exact repair commands you use for inc repair and pr repair 2. Repair time should be measured at cluster level for inc repair. So, whats the total time it takes to run repair on all nodes for incremental vs pr repairs? 3. You are repairing one dc DC3. How many DCs are there in total and whats the RF for keyspaces? Running pr on a specific dc would not repair entire data. 4. 885 ranges? From where did you get this number? Logs? Can you share the number ranges printed in logs for both inc and pr case? ThanksAnuj Sent from Yahoo Mail on Android On Tue, Jun 6, 2017 at 9:33 PM, Chris Stokesmore<chris.elsm...@demandlogic.co> wrote: Thank you for the excellent and clear description of the different versions of repair Anuj, that has cleared up what I expect to be happening. The problem now is in our cluster, we are running repairs with options (parallelism: parallel, primary range: false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [DC3], hosts: [], # of ranges: 885) and when we do our repairs are taking over a day to complete when previously when running with the partition range option they were taking more like 8-9 hours. As I understand it, using incremental should have sped this process up as all three sets of data on each repair job should be marked as repaired however this does not seem to be the case. Any ideas? Chris On 6 Jun 2017, at 16:08, Anuj Wadehra <anujw_2...@yahoo.co.in.INVALID> wrote: Hi Chris, Using pr with incremental repairs does not make sense. Primary range repair is an optimization over full repair. If you run full repair on a n node cluster with RF=3, you would be repairing each data thrice. E.g. in a 5 node cluster with RF=3, a range may exist on node A,B and C . When full repair is run on node A, the entire data in that range gets synced with replicas on node B and C. Now, when you run full repair on nodes B and C, you are wasting resources on repairing data which is already repaired. Primary range repair ensures that when you run repair on a node, it ONLY repairs the data which is owned by the node. Thus, no node repairs data which is not owned by it and must be repaired by other node. Redundant work is eliminated. Even in pr, each time you run pr on all nodes, you repair 100% of data. Why to repair complete data in each cycle?? ..even data which has not even changed since the last repair cycle? This is where Incremental repair comes as an improvement. Once repaired, a data would be marked repaired so that the next repair cycle could just focus on repairing the delta. Now, lets go back to the example of 5 node cluster with RF =3.This time we run incremental repair on all nodes. When you repair entire data on node A, all 3 replicas are marked as repaired. Even if you run inc repair on all ranges on the second node, you would not re-repair the already repaired data. Thus, there is no advantage of repairing only the data owned by the node (primary range of the node). You can run inc repair on all the data present on a node and Cassandra would make sure that when you repair data on other nodes, you only repair unrepaired data. ThanksAnuj Sent from Yahoo Mail on Android On Tue, Jun 6, 2017 at 4:27 PM, Chris Stokesmore<chris.elsm...@demandlogic.co> wrote: Hi all, Wondering if anyone had any thoughts on this? At the moment the long running repairs cause us to be running them on two nodes at once for a bit of time, which obivould increases the cluster load. On 2017-05-25 16:18 (+0100), Chris Stokesmore <c...@demandlogic.co> wrote: > Hi,> > > We are running a 7 node Cassandra 2.2.8 cluster, RF=3, and had been running > repairs with the -pr option, via a cron job that runs on each node once per > week.> > > We changed that as some advice on the Cassandra IRC channel said it would > cause more anticompaction and > http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsRepair.html > says 'Performing partitioner range repairs by using the -pr option is > generally considered a good choice for doing manual repairs. However, this > option cannot be used with incremental repairs (default for Cassandra 2.2 and > later)' > > Only problem is our -pr repairs were taking about 8 hours, and now the non-pr > repair are taking 24+ - I guess this makes sense, repairing 1/7 of data > increased to 3/7, except I was hoping to see a speed up after the first loop > through the cluster as each repair will be marking much more data as > repaired, right?> > > > Is running -pr with incremental repairs really that bad? > - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Partition range incremental repairs
Hi Chris, Using pr with incremental repairs does not make sense. Primary range repair is an optimization over full repair. If you run full repair on a n node cluster with RF=3, you would be repairing each data thrice. E.g. in a 5 node cluster with RF=3, a range may exist on node A,B and C . When full repair is run on node A, the entire data in that range gets synced with replicas on node B and C. Now, when you run full repair on nodes B and C, you are wasting resources on repairing data which is already repaired. Primary range repair ensures that when you run repair on a node, it ONLY repairs the data which is owned by the node. Thus, no node repairs data which is not owned by it and must be repaired by other node. Redundant work is eliminated. Even in pr, each time you run pr on all nodes, you repair 100% of data. Why to repair complete data in each cycle?? ..even data which has not even changed since the last repair cycle? This is where Incremental repair comes as an improvement. Once repaired, a data would be marked repaired so that the next repair cycle could just focus on repairing the delta. Now, lets go back to the example of 5 node cluster with RF =3.This time we run incremental repair on all nodes. When you repair entire data on node A, all 3 replicas are marked as repaired. Even if you run inc repair on all ranges on the second node, you would not re-repair the already repaired data. Thus, there is no advantage of repairing only the data owned by the node (primary range of the node). You can run inc repair on all the data present on a node and Cassandra would make sure that when you repair data on other nodes, you only repair unrepaired data. ThanksAnuj Sent from Yahoo Mail on Android On Tue, Jun 6, 2017 at 4:27 PM, Chris Stokesmorewrote: Hi all, Wondering if anyone had any thoughts on this? At the moment the long running repairs cause us to be running them on two nodes at once for a bit of time, which obivould increases the cluster load. On 2017-05-25 16:18 (+0100), Chris Stokesmore wrote: > Hi,> > > We are running a 7 node Cassandra 2.2.8 cluster, RF=3, and had been running > repairs with the -pr option, via a cron job that runs on each node once per > week.> > > We changed that as some advice on the Cassandra IRC channel said it would > cause more anticompaction and > http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsRepair.html > says 'Performing partitioner range repairs by using the -pr option is > generally considered a good choice for doing manual repairs. However, this > option cannot be used with incremental repairs (default for Cassandra 2.2 and > later)' > > Only problem is our -pr repairs were taking about 8 hours, and now the non-pr > repair are taking 24+ - I guess this makes sense, repairing 1/7 of data > increased to 3/7, except I was hoping to see a speed up after the first loop > through the cluster as each repair will be marking much more data as > repaired, right?> > > > Is running -pr with incremental repairs really that bad? > - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Weird error: InvalidQueryException: unconfigured table table2
Ensure that all the nodes are on same schema version such that table2 schema is replicated properly on all the nodes. ThanksAnuj Sent from Yahoo Mail on Android On Sat, Mar 25, 2017 at 3:19 AM, S Gwrote: Hi, I have a keyspace with two tables. I run a different query for each table: Table 1: Select * from table1 where id = ? Table 2: Select * from table2 where id1 = ? and id = ? My code using datastax fires above two queries one after the other.While it never fails for table 1, it never succeeds for table 2And gives an error: com.datastax.driver.core.exceptions.InvalidQueryException: unconfigured table table2 at com.datastax.driver.core.Responses$Error.asException(Responses.java:136) at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:179) at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:177) at com.datastax.driver.core.RequestHandler.access$2500(RequestHandler.java:46) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:799) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:633) at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1070) at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:993) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1280) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:890) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:564) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:505) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:419) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:391) Any idea what might be wrong? I have confirmed that all table-names and columns names are lowercase.Datastax java version tried : 3.1.2 and 3.1.4Cassandra version: 3.10 ThanksSG
Repair while upgradesstables is running
Hi, What is the implication of running inc repair when all nodes have upgraded to new Cassandra rpm but parallel upgradesstables is still running on one or more of the nodes? So upgrade is like:1. Rolling upgrade of all nodes (rpm install)2. Parallel upgrade sstable on all nodes ( no issues with IO. We can afford it)3. Repair inc while step is running?? ThanksAnuj
Incremental Repair
Hi, Our setup is as follows: 2 DCS with N nodes, RF=DC1:3,DC2:3, Hinted Handoff=3 hours, Incremental Repair scheduled once on every node (ALL DCs) within the gc grace period. I have following queries regarding incremental repairs: 1. When a node is down for X hours (where x > hinted handoff hours and less than gc grace time), I think incremental repair is sufficient rather than doing the full repair. Is the understanding correct ? 2. DataStax recommends "Run incremental repair daily, run full repairs weekly to monthly". Does that mean that I have to run full repairs every week to month EVEN IF I do daily incremental repairs? If yes, whats the reasoning of running full repair when inc repair is already run? Reference: https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesWhen.html 3. We run inc repair at least once in gc grace instead of the general recommendation that inc repair should be run daily. Do you see any problem with the approach? As per my understanding, if we run inc repair less frequently, compaction between unrepaired and repaired data wont happen on a node until some node runs inc repair on the unrepaired data range. Thus, there can be some impact on disk space and read performance but immediate compaction is anyways never guaranteed by Cassandra. So, I see minimal impact on performance and that too just on the reads of delta data generated between repairs. Thanks Anuj
Re: Is it possible to recover a deleted-in-future record?
DISCLAIMER: This is only my personal opinion. Evaluate the situation carefully and if you find below suggestions useful, follow them at your own risk. If I have understood the problem correctly, malicious deletes would actually lead to deletion of data. I am not sure how everything is normal after the deletes? If data is critical,you could: 1. Take a database snapshot immediately so that you dont lose information if delete entrues in sstables are compacted together with original data. 2. Transfer snapshot to suitable place and Run some utility such as sstabletojson to get the keys impacted by the deletes and original data for keys. Data has to be consolidated from all the nodes. 3. Devise a strategy to restore deleted data. ThanksAnuj On Tue, Mar 7, 2017 at 8:44 AM, Michael Fongwrote: Hi, all, We recently encountered an issue in production that some records were mysteriously deleted with a timestamp 100+ years from now. Everything is normal as of now, and how the deletion happened and accuracy of system timestamp at that moment are unknown. We were wondering if there is a general way to recover the mysteriously-deleted data when the timestamp meta is screwed up. Thanks in advanced, Regards, Michael Fong
Re: Read after Write inconsistent at times
Hi Charulata, Please share details on how data is being inserted and read. Is the client which is reading the data same as the one which inserted it? Is the read happening only when insertion is successful? Are you using client timestamps? How did you verify that NTP is working properly? How NTP is configured in your cluster- sample NTP conf ? ThanksAnuj Sent from Yahoo Mail on Android On Sat, 25 Feb, 2017 at 2:02 AM, Charulata Sharma (charshar)wrote: Hi All, Thanks for your replies. I do not see an issue with NTP or with dropped messages.However the tombstones count on the specific CF shows me this. This essentially indicates that there are as many tombstones as Live cells in the CF isin't?Now is that an issue and can this cause inconsistent read ? Average live cells per slice (last five minutes): 0.8631938498408056 Maximum live cells per slice (last five minutes): 1.0 Average tombstones per slice (last five minutes): 1.1477603751799115E-5 Maximum tombstones per slice (last five minutes): 1.0 Thanks, Charu From: Jonathan Haddad Reply-To: "user@cassandra.apache.org" Date: Friday, February 24, 2017 at 9:42 AM To: "user@cassandra.apache.org" Subject: Re: Read after Write inconsistent at times WRT to NTP, I first encountered this issue on my first cluster. The problem with ntp isn't just if you're doing inserts, it's if you're doing inserts in combination with deletes, and using server timestamps with a greater variance than the period between the delete and the insert. Basically, you end up with a delete in the future and an insert in the past, and the delete timestamp > insert timestamp. +1 to Jan's recommendation on checking for dropped messages. On Fri, Feb 24, 2017 at 9:35 AM Petrus Gomes wrote: Hi, Check the tombstone count, If is it to high, your query will be impacted. If tombstone is a problem, you can try to reduce your "gc_grace_seconds" to reduce tombstone count(Carefully because you use crossdata centers). Tchau, Petrus Silva On Fri, Feb 24, 2017 at 12:07 AM, Jan Kesten wrote: Hi, are your nodes at high load? Are there any dropped messages (nodetool tpstats) on any node? Also have a look at your system clocks. C* needs them in thight sync - via ntp for example. Side hint: if you use ntp use the same set of upstreams on all of your nodes - ideal your own one. Usingpool.ntp.org might lead to minimal dirfts in time across your cluster. Another thing that could help you out is using client side timestamps: https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/ (of course only when you are using a single client or all clients are in sync via ntp). Am 24.02.2017 um 07:29 schrieb Charulata Sharma (charshar): Hi All, In my application sometimes I cannot read data that just got inserted. This happens very intermittently. Both write and read use LOCAL QUOROM. We have a cluster of 12 nodes which spans across 2 Data Centers and a RF of 3. Has anyone encountered this problem and if yes what steps have you taken to solve it Thanks, Charu -- Jan Kesten, mailto:j.kes...@enercast.de Tel.: +49 561/4739664-0 FAX: -9 Mobil: +49 160 / 90 98 41 68 enercast GmbH Universitätsplatz 12 D-34127 Kassel HRB15471 http://www.enercast.de Online-Prognosen für erneuerbare Energien Geschäftsführung: Thomas Landgraf (CEO), Bernd Kratz (CTO), Philipp Rinder (CSO) Diese E-Mail und etwaige Anhänge können vertrauliche und/oder rechtlich geschützte Informationen enthalten. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, benachrichtigen Sie uns bitte sofort durch Antwort-E-Mail und löschen Sie diese E-Mail nebst etwaigen Anlagen von Ihrem System. Ebenso dürfen Sie diese E-Mail oder ihre Anlagen nicht kopieren oder an Dritte weitergeben. Vielen Dank. This e-mail and any attachment may contain confidential and/or privileged information. If you are not the named addressee or if this transmission has been addressed to you in error, please notify us immediately by reply e-mail and then delete this e-mail and any attachment from your system. Please understand that you must not copy this e-mail or any attachment or disclose the contents to any other person. Thank you for your cooperation.
Re: Cluster scaling
Hi Branislav, I quickly went through the code and noticed that you are updating RF from code and expecting that Cassandra would automatically distribute replicas as per the new RF. I think this is not how it works. After updating the RF, you need to run repair on all the nodes to make sure that data replicas are as per the new RF. Please refer to https://docs.datastax.com/en/cql/3.1/cql/cql_using/update_ks_rf_t.html . This would give you reliable results. It would be good if you explain the exact purpose of your exercise. Tests seem more in academic interest. You are adding several variables in your tests but each of these params have entirely different purpose: 1. Batch/No Batch depends on business atomicity needs. 2. Read/ No read is dependent on business requirement 3. RF depends on fault tolerance needed ThanksAnuj On Wed, 8 Feb, 2017 at 9:09 PM, Branislav Janosik -T (bjanosik - AAP3 INC at Cisco)wrote: Hi all, I have a cluster of three nodes and would like to ask some questions about the performance. I wrote a small benchmarking tool in java that mirrors (read, write) operations that we do in the real project. Problem is that it is not scaling like it should. The program runs two tests: one using batch statement and one without using the batch. The operation sequence is: optional select, insert, update, insert. I run the tool on my server with 128 threads (# of threads has no influence on the performance), creating usually 100K resources for testing purposes. The average results (operations per second) with the use of batch statement are: Replication Factor = 1 with reading without reading 1-node cluster 37K 46K 2-node cluster 37K 47K 3-node cluster 39K 70K Replication Factor = 2 with reading without reading 2-node cluster 21K 40K 3-node cluster 30K 48K The average results (operations per second) without the use of batch statement are: Replication Factor = 1 with reading without reading 1-node cluster 31K 20K 2-node cluster 38K 39K 3-node cluster 45K 87K Replication Factor = 2 with reading without reading 2-node cluster 19K 22K 3-node cluster 26K 36K The Cassandra VMs specs are: 16 CPUs, 16GB and two 32GB of RAM, at least 30GB of disk space for each node. Non SSD, each VM is on separate physical server. The code is available herehttps://github.com/bjanosik/CassandraBenchTool.git . It can be built with Maven and then you can use jar in target directory withjava -jar target/cassandra-test-1.0-SNAPSHOT-jar-with-dependencies.jar. Thank you for any help.
Re: Disc size for cluster
Adding to what Benjamin said.. It is hard to estimate disk space if you are using STCS for a table where rows are updated frequently leading to lot of fragmentation. STCS may also lead to scenarios where tombstones are not evicted for long times. You may go live and everything goes well for months. Then gradually you realize that large sstables are holding on to tombstones as they are not getting compacted. It is not easy to test disk space requirements with precision upfront unless you test your system with data patterns for some time. Your life can be easy much easier if you take care of following points with STCS: 1. If you can afford some extra IO, go for slightly aggressive STCS strategy using one or more of following settings: min_threshold=2, bucket_high=2,unchecked_tombstone_compactions=true. Which one of these to use depends on your use case.Study these settings. 2. Estimate free disk required for compactions at any point of time. For example, suppose you have 5 tables with 3 TB data in total and you estimate that data distribution will be as follows:A: 800 gb B:700gb C:600gb D:500gb E:400gb If you have concurrent_compactors=3 and 90% data of your largest tables are getting compacted simultaneously, you will need 90/100*(800+700+600)gb =1.9 TB free disk space. So you wont need 6 TB disk for 3 TB data. Only 4.9 TB would do. 3. Take 10-15% buffer for future schema changes and calculation errors. Better safe than sorry :) Thanks Anuj On Thu, 26 Jan, 2017 at 2:41 PM, Benjamin Rothwrote: Hi! This is basically right, but:1. How do you know the 3TB storage will be 3TB on cassandra? This depends how the data is serialized, compressed and how often it changes and it depends on your compaction settings2. 50% free space on STCS is only required if you do a full compaction of a single CF that takes all the space. Normally you need as much free space as the target SSTable of a compaction will take. If you split your data across more CFs, its unlikely you really hit this value. .. probably you should do some tests. But in the end it is always good to have some headroom. I personally would scale out if free space is < 30% but that always depends on your model. 2017-01-26 9:56 GMT+01:00 Raphael Vogel : HiJust want to validate my estimation for a C* cluster which should have around 3 TB of usable storage.Assuming a RF of 3 and SizeTiered Compaction Strategy.Is it correct, that SizeTiered Compaction Strategy needs (in the worst case) 50% free disc space during compaction? So this would then result in a cluster of 3TB x 3 x 2 == 18 TB of raw storage? Thanks and RegardsRaphael Vogel -- Benjamin Roth Prokurist Jaumo GmbH · www.jaumo.com Wehrstraße 46 · 73035 Göppingen · Germany Phone +49 7161 304880-6 · Fax +49 7161 304880-1 AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
Re: Join_ring=false Use Cases
Thanks All !! I think the intent of the JIRA https://issues.apache.org/ jira/browse/CASSANDRA-6961 was to primarily deal with stale information after outages and give opportunity for repairing the data before a node joins the cluster. If a node started with join_ring=false doesn't accept writes while the repair is happening, the purpose of JIRA is defeated as it will anyways lead to stale information. Seems to be a defect. ThanksAnuj On Wednesday, 21 December 2016 2:53 AM, kurt Greaves <k...@instaclustr.com> wrote: It seems that you're correct in saying that writes don't propagate to a node that has join_ring set to false, so I'd say this is a flaw. In reality I can't see many actual use cases in regards to node outages with the current implementation. The main usage I'd think would be to have additional coordinators for CPU heavy workloads. It seems to make it actually useful for repairs/outages we'd need to have another option to turn on writes so that it behaved similarly to write survey mode (but on already bootstrapped nodes). Is there a reason we don't have this already? Or does it exist somewhere I'm not aware of? On 20 December 2016 at 17:40, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: No responses yet :) Any C* expert who could help on join_ring use case and the concern raised? Thanks Anuj On Tue, 13 Dec, 2016 at 11:31 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi, I need to understand the use case of join_ring=false in case of node outages. As per https://issues.apache.org/ jira/browse/CASSANDRA-6961, you would want join_ring=false when you have to repair a node before bringing a node back after some considerable outage. The problem I see with join_ring=false is that unlike autobootstrap, the node will NOT accept writes while you are running repair on it. If a node was down for 5 hours and you bring it back with join_ring=false, repair the node for 7 hours and then make it join the ring, it will STILL have missed writes because while the time repair was running (7 hrs), writes only went to other others. So, if you want to make sure that reads served by the restored node at CL ONE will return consistent data after the node has joined, you wont get that as writes have been missed while the node is being repaired. And if you work with Read/Write CL=QUORUM, even if you bring back the node without join_ring=false, you would anyways get the desired consistency. So, how join_ring would provide any additional consistency in this case ?? I can see join_ring=false useful only when I am restoring from Snapshot or bootstrapping and there are dropped mutations in my cluster which are not fixed by hinted handoff. For Example: 3 nodes A,B,C working at Read/Write CL QUORUM. Hinted Handoff=3 hrs.10 AM Snapshot taken on all 3 nodes11 AM: Node B goes down for 4 hours3 PM: Node B comes up but data is not repaired. So, 1 hr of dropped mutations (2-3 PM) not fixed via Hinted Handoff.5 PM: Node A crashes.6 PM: Node A restored from 10 AM Snapshot, Node A started with join_ring=false, repaired and then joined the cluster. In above restore snapshot example, updates from 2-3 PM were outside hinted handoff window of 3 hours. Thus, node B wont get those updates. Node A data for 2-3 PM is already lost. So, 2-3 PM updates are only on one replica i.e. node C and minimum consistency needed is QUORUM so join_ring=false would help. But this is very specific use case. ThanksAnuj
Re: Join_ring=false Use Cases
No responses yet :) Any C* expert who could help on join_ring use case and the concern raised? Thanks Anuj On Tue, 13 Dec, 2016 at 11:31 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi, I need to understand the use case of join_ring=false in case of node outages. As per https://issues.apache.org/jira/browse/CASSANDRA-6961, you would want join_ring=false when you have to repair a node before bringing a node back after some considerable outage. The problem I see with join_ring=false is that unlike autobootstrap, the node will NOT accept writes while you are running repair on it. If a node was down for 5 hours and you bring it back with join_ring=false, repair the node for 7 hours and then make it join the ring, it will STILL have missed writes because while the time repair was running (7 hrs), writes only went to other others. So, if you want to make sure that reads served by the restored node at CL ONE will return consistent data after the node has joined, you wont get that as writes have been missed while the node is being repaired. And if you work with Read/Write CL=QUORUM, even if you bring back the node without join_ring=false, you would anyways get the desired consistency. So, how join_ring would provide any additional consistency in this case ?? I can see join_ring=false useful only when I am restoring from Snapshot or bootstrapping and there are dropped mutations in my cluster which are not fixed by hinted handoff. For Example: 3 nodes A,B,C working at Read/Write CL QUORUM. Hinted Handoff=3 hrs.10 AM Snapshot taken on all 3 nodes11 AM: Node B goes down for 4 hours3 PM: Node B comes up but data is not repaired. So, 1 hr of dropped mutations (2-3 PM) not fixed via Hinted Handoff.5 PM: Node A crashes.6 PM: Node A restored from 10 AM Snapshot, Node A started with join_ring=false, repaired and then joined the cluster. In above restore snapshot example, updates from 2-3 PM were outside hinted handoff window of 3 hours. Thus, node B wont get those updates. Node A data for 2-3 PM is already lost. So, 2-3 PM updates are only on one replica i.e. node C and minimum consistency needed is QUORUM so join_ring=false would help. But this is very specific use case. ThanksAnuj
Re: Join_ring=false Use Cases
Can anyone help me with join_ring and address my concerns? Thanks Anuj On Tue, 13 Dec, 2016 at 11:31 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi, I need to understand the use case of join_ring=false in case of node outages. As per https://issues.apache.org/jira/browse/CASSANDRA-6961, you would want join_ring=false when you have to repair a node before bringing a node back after some considerable outage. The problem I see with join_ring=false is that unlike autobootstrap, the node will NOT accept writes while you are running repair on it. If a node was down for 5 hours and you bring it back with join_ring=false, repair the node for 7 hours and then make it join the ring, it will STILL have missed writes because while the time repair was running (7 hrs), writes only went to other others. So, if you want to make sure that reads served by the restored node at CL ONE will return consistent data after the node has joined, you wont get that as writes have been missed while the node is being repaired. And if you work with Read/Write CL=QUORUM, even if you bring back the node without join_ring=false, you would anyways get the desired consistency. So, how join_ring would provide any additional consistency in this case ?? I can see join_ring=false useful only when I am restoring from Snapshot or bootstrapping and there are dropped mutations in my cluster which are not fixed by hinted handoff. For Example: 3 nodes A,B,C working at Read/Write CL QUORUM. Hinted Handoff=3 hrs.10 AM Snapshot taken on all 3 nodes11 AM: Node B goes down for 4 hours3 PM: Node B comes up but data is not repaired. So, 1 hr of dropped mutations (2-3 PM) not fixed via Hinted Handoff.5 PM: Node A crashes.6 PM: Node A restored from 10 AM Snapshot, Node A started with join_ring=false, repaired and then joined the cluster. In above restore snapshot example, updates from 2-3 PM were outside hinted handoff window of 3 hours. Thus, node B wont get those updates. Node A data for 2-3 PM is already lost. So, 2-3 PM updates are only on one replica i.e. node C and minimum consistency needed is QUORUM so join_ring=false would help. But this is very specific use case. ThanksAnuj
Re: Configure NTP for Cassandra
Thanks Martin. Agree, setting up our own internal servers will help save some firewall traffic, simplify security management and reduce load on public servers which is an ethical thing to do. As the blog recommended setting up own internal servers for Cassandra, I wanted to make sure that there are no Cassandra specific benefits e.g. better relative time synchronization achieved with an internal setup. So, I would conclude it this way : Even though its not a good practice to directly access external NTP servers via Cassandra nodes, Cassandra can still achieve tight relative time synchronization using reliable external servers. There is no madate to setup your own pool of internal NTP servers for BETTER time synchronization. Thanks for your inputs.Anuj On Wed, 14 Dec, 2016 at 3:22 AM, Martin Schröder<mar...@oneiros.de> wrote: 2016-11-26 20:20 GMT+01:00 Anuj Wadehra <anujw_2...@yahoo.co.in>: > 1. If my ISP provider is providing me a pool of reliable NTP servers, should > I setup my own internal servers anyway or can I sync Cassandra nodes > directly to the ISP provided servers and select one of the servers as > preferred for relative clock synchronization? Set up three ntp servers which uses the provider servers _and_ pool servers and sync your other machines from these servers (and maybe get gps receivers for your ntp servers). This reduces ntp traffic at your firewall (your servers act as proxies) and reduces load on public servers. > 2. As per my understanding, peer association is ONLY for backup scenario . > If a peer loses time synchronization source, then other peers can be used > for time synchronization. Thus providing a HA service. But when everything > is ok (happy path), does defining NTP servers synced from different sources > as peers lead them to converge time as mentioned in some forums? Maybe; but the difference will be negligible (sub milliseconds). I wouldn't worry about that. Best Martin
Re: Configure NTP for Cassandra
Thanks for the NTP link. Most of us are Cassandra users and must be using NTP (or other time synchronization methods) for ensuring relative time synchronization in our Cassandra clusters. I hope there are people on the mailing list who can answer these questions with respect to Cassandra. There is just one detailed blog on NTP best practices for Cassandra and I think answering these questions is important rather than just creating an internal NTP pool with recommended settings. Thanks Anuj On Wed, 14 Dec, 2016 at 12:07 AM, Jim Witschey<jim.witsc...@datastax.com> wrote: You might find more NTP experts on the NTP questions mailing list: http://lists.ntp.org/listinfo/questions On Tue, Dec 13, 2016 at 1:25 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: > Any NTP experts willing to take up these questions? > > Thanks > Anuj > > On Sun, 27 Nov, 2016 at 12:52 AM, Anuj Wadehra > <anujw_2...@yahoo.co.in> wrote: > Hi, > > One popular NTP setup recommended for Cassandra users is described at > Thankshttps://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-2-solutions/ > . > > Summary of article is: > Setup recommends a dedicated pool of internal NTP servers which are > associated as peers to provide a HA NTP service. Cassandra nodes sync to > this dedicated pool but define one internal NTP server as preferred server > to ensure relative clock synchronization. Internal NTP servers sync to > external NTP servers. > > My questions: > > 1. If my ISP provider is providing me a pool of reliable NTP servers, should > I setup my own internal servers anyway or can I sync Cassandra nodes > directly to the ISP provided servers and select one of the servers as > preferred for relative clock synchronization? > > > I agree. If you have to rely on public NTP pool which selects random servers > for sync, having an internal NTP server pool is justified for getting tight > relative sync as described in the blog > > 2. As per my understanding, peer association is ONLY for backup scenario . > If a peer loses time synchronization source, then other peers can be used > for time synchronization. Thus providing a HA service. But when everything > is ok (happy path), does defining NTP servers synced from different sources > as peers lead them to converge time as mentioned in some forums? > > e.g. if A and B are peers and thier times are 9:00:00 and 9:00:10 after > syncing with respective time sources, then will they converge their clocks > as 9:00:05? > > I doubt the above claim regarding time converge. Also no formal doc says > that. Comments? > > > Thanks > Anuj >
Re: Configure NTP for Cassandra
Any NTP experts willing to take up these questions? Thanks Anuj On Sun, 27 Nov, 2016 at 12:52 AM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi, One popular NTP setup recommended for Cassandra users is described at Thankshttps://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-2-solutions/ . Summary of article is:Setup recommends a dedicated pool of internal NTP servers which are associated as peers to provide a HA NTP service. Cassandra nodes sync to this dedicated pool but define one internal NTP server as preferred server to ensure relative clock synchronization. Internal NTP servers sync to external NTP servers. My questions: 1. If my ISP provider is providing me a pool of reliable NTP servers, should I setup my own internal servers anyway or can I sync Cassandra nodes directly to the ISP provided servers and select one of the servers as preferred for relative clock synchronization? I agree. If you have to rely on public NTP pool which selects random servers for sync, having an internal NTP server pool is justified for getting tight relative sync as described in the blog 2. As per my understanding, peer association is ONLY for backup scenario . If a peer loses time synchronization source, then other peers can be used for time synchronization. Thus providing a HA service. But when everything is ok (happy path), does defining NTP servers synced from different sources as peers lead them to converge time as mentioned in some forums? e.g. if A and B are peers and thier times are 9:00:00 and 9:00:10 after syncing with respective time sources, then will they converge their clocks as 9:00:05? I doubt the above claim regarding time converge. Also no formal doc says that. Comments? ThanksAnuj
Join_ring=false Use Cases
Hi, I need to understand the use case of join_ring=false in case of node outages. As per https://issues.apache.org/jira/browse/CASSANDRA-6961, you would want join_ring=false when you have to repair a node before bringing a node back after some considerable outage. The problem I see with join_ring=false is that unlike autobootstrap, the node will NOT accept writes while you are running repair on it. If a node was down for 5 hours and you bring it back with join_ring=false, repair the node for 7 hours and then make it join the ring, it will STILL have missed writes because while the time repair was running (7 hrs), writes only went to other others. So, if you want to make sure that reads served by the restored node at CL ONE will return consistent data after the node has joined, you wont get that as writes have been missed while the node is being repaired. And if you work with Read/Write CL=QUORUM, even if you bring back the node without join_ring=false, you would anyways get the desired consistency. So, how join_ring would provide any additional consistency in this case ?? I can see join_ring=false useful only when I am restoring from Snapshot or bootstrapping and there are dropped mutations in my cluster which are not fixed by hinted handoff. For Example: 3 nodes A,B,C working at Read/Write CL QUORUM. Hinted Handoff=3 hrs.10 AM Snapshot taken on all 3 nodes11 AM: Node B goes down for 4 hours3 PM: Node B comes up but data is not repaired. So, 1 hr of dropped mutations (2-3 PM) not fixed via Hinted Handoff.5 PM: Node A crashes.6 PM: Node A restored from 10 AM Snapshot, Node A started with join_ring=false, repaired and then joined the cluster. In above restore snapshot example, updates from 2-3 PM were outside hinted handoff window of 3 hours. Thus, node B wont get those updates. Node A data for 2-3 PM is already lost. So, 2-3 PM updates are only on one replica i.e. node C and minimum consistency needed is QUORUM so join_ring=false would help. But this is very specific use case. ThanksAnuj
Re: Single cluster node restore
Hi Petr, If data corruption means accidental data deletions via Cassandra commands, you have to restore entire cluster with latest snapshots. This may lead to data loss as there may be valid updates after the snapshot was taken but before the data deletion. Restoring single node with snapshot wont help as Cassandra replicated the accidental deletes to all nodes. If data corruption means accidental deletion of some sstable files from file system of a node, repair would fix it. If data corruption means unreadable data due to hardware issues etc, you will have two options after replacing the disk: bootstrap or restore snapshot on the single affected node. If you have huge data per node e.g. 300Gb , you may want to restore from Snapshot followed by repair. Restoring snapshot on single node is faster than streaming all data via bootstrap. If the node is not recoverable and must be replaced, you should be able to do auto-boostrap or restore from snapshot with auto-bootstrap set to false. I havent replaced a dead node with snapshot but there should not be any issues as token ranges dont change when you replace a node. Thanks Anuj On Tue, 29 Nov, 2016 at 11:08 PM, Petr Malikwrote: Hi. I have a question about Cassandra backup-restore strategies. As far as I understand Cassandra has been designed to survive hardware failures by relying on data replication. It seems like people still want backup/restore for case when somebody accidentally deletes data or the data gets otherwise corrupted. In that case restoring all keyspace/table snapshots on all nodes should bring it back. I am asking because I often read directions on restoring a single node in a cluster. I am just wondering under what circumstances could this be done safely. Please correct me if i am wrong but restoring just a single node does not really roll back the data as the newer (corrupt) data will be served by other replicas and eventually propagated to the restored node. Right? In fact by doing so one may end up reintroducing deleted data back... Also since Cassandra distributes the data throughout the cluster it is not clear on which mode any particular (corrupt) data resides and hence which to restore. I guess this is a long way of asking whether there is an advantage of trying to restore just a single node in a Cassandra cluster as opposed to say replacing the dead node and letting Cassandra handle the replication. Thanks.
Configure NTP for Cassandra
Hi, One popular NTP setup recommended for Cassandra users is described at Thankshttps://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-2-solutions/ . Summary of article is:Setup recommends a dedicated pool of internal NTP servers which are associated as peers to provide a HA NTP service. Cassandra nodes sync to this dedicated pool but define one internal NTP server as preferred server to ensure relative clock synchronization. Internal NTP servers sync to external NTP servers. My questions: 1. If my ISP provider is providing me a pool of reliable NTP servers, should I setup my own internal servers anyway or can I sync Cassandra nodes directly to the ISP provided servers and select one of the servers as preferred for relative clock synchronization? I agree. If you have to rely on public NTP pool which selects random servers for sync, having an internal NTP server pool is justified for getting tight relative sync as described in the blog 2. As per my understanding, peer association is ONLY for backup scenario . If a peer loses time synchronization source, then other peers can be used for time synchronization. Thus providing a HA service. But when everything is ok (happy path), does defining NTP servers synced from different sources as peers lead them to converge time as mentioned in some forums? e.g. if A and B are peers and thier times are 9:00:00 and 9:00:10 after syncing with respective time sources, then will they converge their clocks as 9:00:05? I doubt the above claim regarding time converge. Also no formal doc says that. Comments? ThanksAnuj
Re: Some questions to updating and tombstone
Hi Boying, I agree with Vladimir.If compaction is not compacting the two sstables with updates soon, disk space issues will be wasted. For example, if the updates are not closer in time, first update might be in a big table by the time second update is being written in a new small table. STCS wont compact them together soon. Just adding column values with new timestamp shouldnt create any tombstones. But if data is not merged for long, disk space issues may arise. If you are STCS,just yo get an idea about the extent of the problem you can run major compaction and see the amount of disk space created with that( dont do this in production as major compaction has its own side effects). Which compaction strategy are you using? Are these updates done with TTL? Thanks Anuj On Mon, 14 Nov, 2016 at 1:54 PM, Vladimir Yudovinwrote: Hi Boying, UPDATE write new value with new time stamp. Old value is not tombstone, but remains until compaction. gc_grace_period is not related to this. Best regards, Vladimir Yudovin, Winguzone - Hosted Cloud Cassandra Launch your cluster in minutes. On Mon, 14 Nov 2016 03:02:21 -0500Lu, Boying wrote Hi, All, Will the Cassandra generates a new tombstone when updating a column by using CQL update statement? And is there any way to get the number of tombstones of a column family since we want to void generating too many tombstones within gc_grace_period? Thanks Boying
Re: Handle Leap Seconds with Cassandra
Thanks Ben for taking out time for the detailed reply !! We dont need strict ordering for all operations but we are looking for scenarios where 2 quick updates to same column of same row are possible. By quick updates, I mean >300 ms. Configuring NTP properly (as mentioned in some blogs in your link) should give fair relative accuracy between the Cassandra nodes. But leap second takes the clock back for an ENTIRE one sec (huge) and the probability of old write overwriting the new one increases drastically. So, we want to be proactive with things. I agree that you should avoid such scebaruos with design (if possible). Good to know that you guys have setup your own NTP servers as per the recommendation. Curious..Do you also do some monitoring around NTP? Thanks Anuj On Fri, 28 Oct, 2016 at 12:25 AM, Ben Bromhead<b...@instaclustr.com> wrote: If you need guaranteed strict ordering in a distributed system, I would not use Cassandra, Cassandra does not provide this out of the box. I would look to a system that uses lamport or vector clocks. Based on your description of how your systems runs at the moment (and how close your updates are together), you have either already experienced out of order updates or there is a real possibility you will in the future. Sorry to be so dire, but if you do require causal consistency / strict ordering, you are not getting it at the moment. Distributed systems theory is really tricky, even for people that are "experts" on distributed systems over unreliable networks (I would certainly not put myself in that category). People have made a very good name for themselves by showing that the vast majority of distributed databases have had bugs when it comes to their various consistency models and the claims these databases make. So make sure you really do need guaranteed causal consistency/strict ordering or if you can design around it (e.g. using conflict free replicated data types) or choose a system that is designed to provide it. Having said that... here are some hacky things you could do in Cassandra to try and get this behaviour, which I in no way endorse doing :) - Cassandra counters do leverage a logical clock per shard and you could hack something together with counters and lightweight transactions, but you would want to do your homework on counters accuracy during before diving into it... as I don't know if the implementation is safe in the context of your question. Also this would probably require a significant rework of your application plus a significant performance hit. I would invite a counter guru to jump in here... - You can leverage the fact that timestamps are monotonic if you isolate writes to a single node for a single shared... but you then loose Cassandra's availability guarantees, e.g. a keyspace with an RF of 1 and a CL of > ONE will get monotonic timestamps (if generated on the server side). - Continuing down the path of isolating writes to a single node for a given shard you could also isolate writes to the primary replica using your client driver during the leap second (make it a minute either side of the leap), but again you lose out on availability and you are probably already experiencing out of ordered writes given how close your writes and updates are. A note on NTP: NTP is generally fine if you use it to keep the clocks synced between the Cassandra nodes. If you are interested in how we have implemented NTP at Instaclustr, see our blogpost on it https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/. Ben On Thu, 27 Oct 2016 at 10:18 Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi Ben, Thanks for your reply. We dont use timestamps in primary key. We rely on server side timestamps generated by coordinator. So, no functions at client side would help. Yes, drifts can create problems too. But even if you ensure that nodes are perfectly synced with NTP, you will surely mess up the order of updates during the leap second(interleaving). Some applications update same column of same row quickly (within a second ) and reversing the order would corrupt the data. I am interested in learning how people relying on strict order of updates handle leap second scenario when clock goes back one second(same second is repeated). What kind of tricks people use to ensure that server side timestamps are monotonic ? As per my understanding NTP slew mode may not be suitable for Cassandra as it may cause unpredictable drift amongst the Cassandra nodes. Ideas ?? ThanksAnuj Sent from Yahoo Mail on Android On Thu, 20 Oct, 2016 at 11:25 PM, Ben Bromhead <b...@instaclustr.com> wrote: http://www.datastax.com/dev/blog/preparing-for-the-leap-second gives a pretty good overview If you are using a timestamp as part of your primary key, this is the situation where you could end up overwriting data. I would suggest using t
Re: Handle Leap Seconds with Cassandra
Hi Ben, Thanks for your reply. We dont use timestamps in primary key. We rely on server side timestamps generated by coordinator. So, no functions at client side would help. Yes, drifts can create problems too. But even if you ensure that nodes are perfectly synced with NTP, you will surely mess up the order of updates during the leap second(interleaving). Some applications update same column of same row quickly (within a second ) and reversing the order would corrupt the data. I am interested in learning how people relying on strict order of updates handle leap second scenario when clock goes back one second(same second is repeated). What kind of tricks people use to ensure that server side timestamps are monotonic ? As per my understanding NTP slew mode may not be suitable for Cassandra as it may cause unpredictable drift amongst the Cassandra nodes. Ideas ?? ThanksAnuj Sent from Yahoo Mail on Android On Thu, 20 Oct, 2016 at 11:25 PM, Ben Bromhead<b...@instaclustr.com> wrote: http://www.datastax.com/dev/blog/preparing-for-the-leap-second gives a pretty good overview If you are using a timestamp as part of your primary key, this is the situation where you could end up overwriting data. I would suggest using timeuuid instead which will ensure that you get different primary keys even for data inserted at the exact same timestamp. The blog post also suggests using certain monotonic timestamp classes in Java however these will not help you if you have multiple clients that may overwrite data. As for the interleaving or out of order problem, this is hard to address in Cassandra without resorting to external coordination or LWTs. If you are relying on a wall clock to guarantee order in a distributed system you will get yourself into trouble even without leap seconds (clock drift, NTP inaccuracy etc). On Thu, 20 Oct 2016 at 10:30 Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi, I would like to know how you guys handle leap seconds with Cassandra. I am not bothered about the livelock issue as we are using appropriate versions of Linux and Java. I am more interested in finding an optimum answer for the following question: How do you handle wrong ordering of multiple writes (on same row and column) during the leap second? You may overwrite the new value with old one (disaster). And Downtime is no option :) I can see that CASSANDRA-9131 is still open.. FYI..we are on 2.0.14 .. ThanksAnuj -- Ben BromheadCTO | Instaclustr+1 650 284 9692Managed Cassandra / Spark on AWS, Azure and Softlayer
Handle Leap Seconds with Cassandra
Hi, I would like to know how you guys handle leap seconds with Cassandra. I am not bothered about the livelock issue as we are using appropriate versions of Linux and Java. I am more interested in finding an optimum answer for the following question: How do you handle wrong ordering of multiple writes (on same row and column) during the leap second? You may overwrite the new value with old one (disaster). And Downtime is no option :) I can see that CASSANDRA-9131 is still open.. FYI..we are on 2.0.14 .. ThanksAnuj
Re: Cassandra installation best practices
Hi Mehdi, You can refer https://docs.datastax.com/en/landing_page/doc/landing_page/recommendedSettings.html . ThanksAnuj On Mon, 17 Oct, 2016 at 10:20 PM, Mehdi Badawrote: Hi all, It is exist some best practices when installing Cassandra on production environment? Some standard to follow? For instance, the file system type etc..
Re: Repair in Multi Datacenter - Should you use -dc Datacenter repair or repair with -pr
Hi Leena, Do you have a firewall between the two DCs? If yes, connection reset can be caused by Cassandra trying to use a TCP connection which is already closed by the firewall. Please make sure that you set high connection timeout at firewall. Also, make sure your servers are not overloaded. Please see https://developer.ibm.com/answers/questions/231996/why-do-we-get-the-error-connection-reset-by-peer-d.html for general causes of connection reset. Also, as I told earlier, Cassandra troubleshooting explains it well https://docs.datastax.com/en/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html . Make sure firewall and node tcp settings are in sync such that nodes close a tcp connection before firewall does that. With firewall timeout, we generally see merkle tree request/response failing between nodes in two DCs and then repair is hung for ever. Not sure how merkle tree creation which is node specific would get impacted by multi dc setup. Are repairs with -local options completing without problems? Thanks Anuj
Re: Repair in Multi Datacenter - Should you use -dc Datacenter repair or repair with -pr
Hi Leena, First thing you should be concerned about is : Why the repair -pr operation doesnt complete ? Second comes the question : Which repair option is best? One probable cause of stuck repairs is : if the firewall between DCs is closing TCP connections and Cassandra is trying to use such connections, repairs will hang. Please refer https://docs.datastax.com/en/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html . We faced that. Also make sure you comply with basic bandwidth requirement between DCs. Recommended is 1000 Mb/s (1 gigabit) or greater. Answers for specific questions: 1.As per my understanding, all replicas will not participate in dc local repairs and thus repair would be ineffective. You need to make sure that all replicas of a data in all dcs are in sync. 2. Every DC is not a ring. All DCs together form a token ring. So, I think yes you should run repair -pr on all nodes. 3. Yes. I dont have experience with incremental repairs. But you can run repair -pr on all nodes of all DCs. Regarding Best approach of repair, you should see some repair presentations of Cassandra Summit 2016. All are online now. I attended the summit and people using large clusters generally use sub range repairs to repair their clusters. But such large deployments are on older Cassandra versions and these deployments generally dont use vnodes. So people know easily which nodes hold which token range. Thanks Anuj
Re: Multiple Network Interfaces in non-EC2
Hi Amir, I would like to understand your requirement first. Why do you need multiface interface configuration mentioned at http://docs.datastax.com/en/cassandra/3.x/cassandra/configuration/configMultiNetworks.html with single DC setup? As per my understanding, you could simply set listen address to private IP and dont set broadcast_address and listen_on_broadcast address properties at all. You could use your private IP everywhere because you dont have any other DC which would connect using public IP. In multiple DCs, you need public IP for communicating with nodes in other DCs and thats where you need private IP for internal communication and public IP for across DC communication. Let me know if using private IP solves your problem. Also, if you have a specific use case for using multiple interface configuration, you could add a NAT rule to route your traffic on public IP to your private IP (route traffic on Cassandra port only). This could act as a workaround till the JIRA is fixed. Let me know if you see any issues with the workaround. Thanks Anuj
Re: Open File Handles for Deleted sstables
Restarting may be a temporary workaround but cant be a permanent solution. After some days, the problem will come back again. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 29 Sep, 2016 at 12:54 AM, sai krishnam raju potturi<pskraj...@gmail.com> wrote: restarting the cassandra service helped get rid of those files in our situation. thanksSai On Wed, Sep 28, 2016 at 3:15 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi, We are facing an issue where Cassandra has open file handles for deleted sstable files. These open file handles keep on increasing with time and eventually lead to disk crisis. This is visible via lsof command. There are no Exceptions in logs.We are suspecting a race condition where compactions/repairs and reads are done on same sstable. I have gone through few JIRAs but somehow not able to coorelate the issue with those tickets. We are using 2.0.14. OS is Red Hat Linux. Any suggestions? ThanksAnuj
Open File Handles for Deleted sstables
Hi, We are facing an issue where Cassandra has open file handles for deleted sstable files. These open file handles keep on increasing with time and eventually lead to disk crisis. This is visible via lsof command. There are no Exceptions in logs.We are suspecting a race condition where compactions/repairs and reads are done on same sstable. I have gone through few JIRAs but somehow not able to coorelate the issue with those tickets. We are using 2.0.14. OS is Red Hat Linux. Any suggestions? ThanksAnuj
Re: Preferred IP is NULL
Finally, multiple DC setup is working as expected in 2.0. So the recipe for configuring multiple interfaces in 2.0 is as follows: use GossipingPropertyFileSnitch so that preferred IP (private IP) is used for intra DC communication. Morever, you need to have a NAT rule to route traffic on public interface to private interface. NAT rule is needed due to CASSANDRA-9748 (No process listens on broadcast address). ThanksAnuj Sent from Yahoo Mail on Android On Mon, 22 Aug, 2016 at 11:55 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: We are using PropertyFileSnitch. ThanksAnuj Sent from Yahoo Mail on Android On Mon, 22 Aug, 2016 at 7:09 PM, Paulo Motta<pauloricard...@gmail.com> wrote: What snitch are you using? If GPFS you need to enable the prefer_local=true flag (this is automatic on EC2MRS). 2016-08-21 22:24 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>: Hi Paulo, I am aware of CASSANDRA-9748. It says that Cassandra only listens at listen_address and not broadcast_address. To overcome that I can add a NAT rule to route all traffiic on public IP to private IP. But, why preferred IP is set to null in peers table? Whats the expected value? Even if I add a NAT rule as workaround for CASSANDRA-9748, what if my public interface is down on a node? My traffic would still fail. I want that at least nodes in my local DC should contact at each other on private IP. I thought preferred IP is for that purpose so focussing on fixing the null value of preferred IPs. ThanksAnuj On Sun, 21 Aug, 2016 at 7:10 PM, Paulo Motta<pauloricard...@gmail.com> wrote: See CASSANDRA-9748, I think it might be related. 2016-08-20 15:20 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>: Hi, We use multiple interfaces in multi DC setup.Broadcast address is public IP while listen address is private IP. I dont understand why prefeerred IP in peers table is null for all rows. There is very little documentation on the role of preferred IP and when it is set. As per code TCP connections use preferred IP. According to my understanding, nodes in local dc should use preferrably private IP to connect and that's referred as preferred IP for a node. Setup: Cassandra 2.0.14 ThanksAnuj
Re: Preferred IP is NULL
We are using PropertyFileSnitch. ThanksAnuj Sent from Yahoo Mail on Android On Mon, 22 Aug, 2016 at 7:09 PM, Paulo Motta<pauloricard...@gmail.com> wrote: What snitch are you using? If GPFS you need to enable the prefer_local=true flag (this is automatic on EC2MRS). 2016-08-21 22:24 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>: Hi Paulo, I am aware of CASSANDRA-9748. It says that Cassandra only listens at listen_address and not broadcast_address. To overcome that I can add a NAT rule to route all traffiic on public IP to private IP. But, why preferred IP is set to null in peers table? Whats the expected value? Even if I add a NAT rule as workaround for CASSANDRA-9748, what if my public interface is down on a node? My traffic would still fail. I want that at least nodes in my local DC should contact at each other on private IP. I thought preferred IP is for that purpose so focussing on fixing the null value of preferred IPs. ThanksAnuj On Sun, 21 Aug, 2016 at 7:10 PM, Paulo Motta<pauloricard...@gmail.com> wrote: See CASSANDRA-9748, I think it might be related. 2016-08-20 15:20 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>: Hi, We use multiple interfaces in multi DC setup.Broadcast address is public IP while listen address is private IP. I dont understand why prefeerred IP in peers table is null for all rows. There is very little documentation on the role of preferred IP and when it is set. As per code TCP connections use preferred IP. According to my understanding, nodes in local dc should use preferrably private IP to connect and that's referred as preferred IP for a node. Setup: Cassandra 2.0.14 ThanksAnuj
Re: Preferred IP is NULL
Hi Paulo, I am aware of CASSANDRA-9748. It says that Cassandra only listens at listen_address and not broadcast_address. To overcome that I can add a NAT rule to route all traffiic on public IP to private IP. But, why preferred IP is set to null in peers table? Whats the expected value? Even if I add a NAT rule as workaround for CASSANDRA-9748, what if my public interface is down on a node? My traffic would still fail. I want that at least nodes in my local DC should contact at each other on private IP. I thought preferred IP is for that purpose so focussing on fixing the null value of preferred IPs. ThanksAnuj On Sun, 21 Aug, 2016 at 7:10 PM, Paulo Motta<pauloricard...@gmail.com> wrote: See CASSANDRA-9748, I think it might be related. 2016-08-20 15:20 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>: Hi, We use multiple interfaces in multi DC setup.Broadcast address is public IP while listen address is private IP. I dont understand why prefeerred IP in peers table is null for all rows. There is very little documentation on the role of preferred IP and when it is set. As per code TCP connections use preferred IP. According to my understanding, nodes in local dc should use preferrably private IP to connect and that's referred as preferred IP for a node. Setup: Cassandra 2.0.14 ThanksAnuj
Preferred IP is NULL
Hi, We use multiple interfaces in multi DC setup.Broadcast address is public IP while listen address is private IP. I dont understand why prefeerred IP in peers table is null for all rows. There is very little documentation on the role of preferred IP and when it is set. As per code TCP connections use preferred IP. According to my understanding, nodes in local dc should use preferrably private IP to connect and that's referred as preferred IP for a node. Setup: Cassandra 2.0.14 ThanksAnuj
Re: Public Interface Failure in Multiple DC setup
Hi Can someone take these questions? ThanksAnuj On Thu, 11 Aug, 2016 at 8:30 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi, Setup: Cassandra 2.0.14 with PropertyFileSnitch. 2 Data Centers. Every node has broadcast address= Public IP (bond0) & listen address=Private IP (bond1). As per DataStax docs, (https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html), "For intra-network or region traffic, Cassandra switches to the private IP after establishing a connection". This means that even for traffic within a DC, Cassandra would contact a node by Broadcast Address i.e. Public IP and then switch to Private IP. Query: If we shut down bond0 (public interface) on a node: 1. Will read/write requests from coordinators in local DC be routed to the node on listen address/private IP/bond1 or will it be treated as DOWN ? 2. Will gossip be able to discover the node. If the node is using its private interface (bond1) to send Gossip messages to other nodes on public/broadcast address, will other nodes in local and remote DC see the node (with bond0 down) as UP? I am aware that https://issues.apache.org/jira/browse/CASSANDRA-9748 is an open issue in 2.0.14. But even in later releases, I am interested in the behavior when public interface is down and PropertyFileSnitch is used. Thanks Anuj
Public Interface Failure in Multiple DC setup
Hi, Setup: Cassandra 2.0.14 with PropertyFileSnitch. 2 Data Centers. Every node has broadcast address= Public IP (bond0) & listen address=Private IP (bond1). As per DataStax docs, (https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html), "For intra-network or region traffic, Cassandra switches to the private IP after establishing a connection". This means that even for traffic within a DC, Cassandra would contact a node by Broadcast Address i.e. Public IP and then switch to Private IP. Query: If we shut down bond0 (public interface) on a node: 1. Will read/write requests from coordinators in local DC be routed to the node on listen address/private IP/bond1 or will it be treated as DOWN ? 2. Will gossip be able to discover the node. If the node is using its private interface (bond1) to send Gossip messages to other nodes on public/broadcast address, will other nodes in local and remote DC see the node (with bond0 down) as UP? I am aware that https://issues.apache.org/jira/browse/CASSANDRA-9748 is an open issue in 2.0.14. But even in later releases, I am interested in the behavior when public interface is down and PropertyFileSnitch is used. Thanks Anuj
Re: (C)* stable version after 3.5
Hi Alain, This caught my attention: "Also I am not sure if the 2.2 major version is something you can skip while upgrading through a rolling restart. I believe you can, but it is not what is recommended." Why do you think that skipping 2.2 is not recommended when NEWS.txt suggests otherwise? Can you elaborate? ThanksAnuj On Tue, 12 Jul, 2016 at 7:31 PM, Alain RODRIGUEZwrote: Hi, The only "fix" release after 3.5 is 3.7. Yet hard to say if it is more stable, we can hope so. For Tic-Toc releases (on 3.X) Odd numbers are fix releases. Even numbers are feature releases. Not sure why you want something above 3.5, but take care, those versions are really recent, and less tested so maybe not that "stable". If you want something more stable, I believe you can go with 3.0.8. Yet I am not telling you not to do that, some people need to start testing new things right... So if you choose 3.7 because you want some feature from there, it is perfectly ok, just move carefully, maybe read some opened tickets and previous experiences from the community and test the upgrade process first on a dev cluster. Also I am not sure if the 2.2 major version is something you can skip while upgrading through a rolling restart. I believe you can, but it is not what is recommended. Testing will let you know anyway. Good luck and tell us how it went :-). C*heers,---Alain Rodriguez - alain@thelastpickle.comFrance The Last Pickle - Apache Cassandra Consultinghttp://www.thelastpickle.com 2016-07-12 11:05 GMT+02:00 Varun Barala : Hi all users, Currently we are using cassandra-2.1.13 but we want to upgrade 2.1.13 to 3.x in production. Could anyone please tell me which is the most stable cassandra version after 3.5. Thanking You!! Regards, Varun Barala
Evict Tombstones with STCS
Hi, We are using C* 2.0.x . What options are available if disk space is too full to do compaction on huge sstables formed by STCS (created around long ago but not getting compacted due to min_compaction_threshold being 4). We suspect that huge space will be released when 2 largest sstables get compacted together such that tombstone eviction is possible. But there is not enough space for compacting them together assuming that compaction would need at least free disk=size of sstable1 + size of sstable 2 ?? I read STCS code and if no sstables are available for compactions, it should pick individual sstable for compaction. But somehow, huge sstables are not participating in individual compaction.. is it due to default 20% tombstone threshold?? And if it so, forceUserdefinedcompaction or setting unchecked_tombstone_compactions to true wont help either as tombstones are less than 20% and not much disk would be recovered. It is not possible to add additional disks too. We see huge difference in disk utilization of different nodes. May be some nodes were able to get away with tombstones while others didnt manage to evict tombstones. Would be good to know more alternatives from community. ThanksAnuj Sent from Yahoo Mail on Android
Re: Production Ready/Stable DataStax Java Driver
Thanks Alex !! We are starting to use CQL for the first time (using Thrift till now), so I think it makes sense to directly use Java driver 3.0.1 instead of 2.1.10. As 3.x driver supports all 1.2+ Cassandra versions, I would also like to better understand the motivation of having 2.1 releases simultaneously with 3.x releases of Java driver. One obvious reason should be the "Breaking changes" in 3.x. So, 2.1.x bug fix releases give some breathing time to existing 2.1 users for getting ready for accomodating those breaking changes in their code instead of forcing them to do those changes at short notice and upgrade to 3.x immediately. Is that understanding correct? ThanksAnuj Sent from Yahoo Mail on Android On Sun, 8 May, 2016 at 9:01 PM, Alex Popescu<al...@datastax.com> wrote: Hi Anuj, All released versions of the DataStax Java driver are production ready: 1. they all go through the complete QA cycle2. we merge all bug fixes and improvements upstream. Now, if you are asking which is currently the most deployed version, that's 2.1 (latest version 2.1.10.1 [1]). If you want to be ready for future Cassandra upgrades and benefit of the latest features of the Java driver, thenthat's the 3.0 branch (latest version 3.0.1 [2]). Last but not least, you should also consider when making the decision that our current focus and main development goes into the 3.x branch and that the 2.1 is in maintenance mode (meaning that no new features will be added and itwill only see critical bug fixes). Bottom line, if your application is not already developed against the 2.1 version of the Java driver, you should use the latest 3.0 release. [1]: https://groups.google.com/a/lists.datastax.com/d/msg/java-driver-user/bYQSUvKQm5k/JduPTt7cGAAJ [2]: https://groups.google.com/a/lists.datastax.com/d/msg/java-driver-user/tOWZm4RVbm4/5E_aDAc8IAAJ On Sun, May 8, 2016 at 7:39 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi, Which DataStax Java Driver release is most stable (production ready) for Cassandra 2.1? ThanksAnuj -- Bests, Alex Popescu | @al3xandruSen. Product Manager @ DataStax » DataStax Enterprise - the database for cloud applications. «
RE: Inconsistent Reads after Restoring Snapshot
Sean, I meant commit log archival was never part of "restoring snapshot" DataStax documentation. How commitlog archival is related to my concern? Please elaborate. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 28 Apr, 2016 at 9:24 PM, sean_r_dur...@homedepot.com<sean_r_dur...@homedepot.com> wrote: https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configLogArchive_t.html Sean Durity From: Anuj Wadehra [mailto:anujw_2...@yahoo.co.in] Sent: Wednesday, April 27, 2016 10:44 PM To: user@cassandra.apache.org Subject: RE: Inconsistent Reads after Restoring Snapshot No.We are not saving them.I have never read that in DataStax documentation. Thanks Anuj Sent from Yahoo Mail on Android On Thu, 28 Apr, 2016 at 12:45 AM, sean_r_dur...@homedepot.com <sean_r_dur...@homedepot.com> wrote: What about the commitlogs? Are you saving those off anywhere in between the snapshot and the crash? Sean Durity From: Anuj Wadehra [mailto:anujw_2...@yahoo.co.in] Sent: Monday, April 25, 2016 10:26 PM To: User Subject: Inconsistent Reads after Restoring Snapshot Hi, We have 2.0.14. We use RF=3 and read/write at Quorum. Moreover, we dont use incremental backups. As per the documentation at https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html , if i need to restore a Snapshot on SINGLE node in a cluster, I would run repair at the end. But while the repair is going on, reads may get inconsistent. Consider following scenario: 10 AM Daily Snapshot taken of node A and moved to backup location 11 AM A record is inserted such that node A and B insert the record but there is a mutation drop on node C. 1 PM Node A crashes and data is restored from latest 10 AM snapshot. Now, only Node B has the record. Now, my question is: Till the repair is completed on node A,a read at Quorum may return inconsistent result based on the nodes from which data is read.If data is read from node A and node C, nothing is returned and if data is read from node A and node B, record is returned. This is a vital point which is not highlighted anywhere. Please confirm my understanding.If my understanding is right, how to make sure that my reads are not inconsistent while a node is being repair after restoring a snapshot. I think, autobootstrapping the node without joining the ring till the repair is completed, is an alternative option. But snapshots save lot of streaming as compared to bootstrap. Will incremental backups guarantee that Thanks Anuj Sent from Yahoo Mail on Android The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment. The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.
RE: Inconsistent Reads after Restoring Snapshot
No.We are not saving them.I have never read that in DataStax documentation. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 28 Apr, 2016 at 12:45 AM, sean_r_dur...@homedepot.com<sean_r_dur...@homedepot.com> wrote: What about the commitlogs? Are you saving those off anywhere in between the snapshot and the crash? Sean Durity From: Anuj Wadehra [mailto:anujw_2...@yahoo.co.in] Sent: Monday, April 25, 2016 10:26 PM To: User Subject: Inconsistent Reads after Restoring Snapshot Hi, We have 2.0.14. We use RF=3 and read/write at Quorum. Moreover, we dont use incremental backups. As per the documentation at https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html , if i need to restore a Snapshot on SINGLE node in a cluster, I would run repair at the end. But while the repair is going on, reads may get inconsistent. Consider following scenario: 10 AM Daily Snapshot taken of node A and moved to backup location 11 AM A record is inserted such that node A and B insert the record but there is a mutation drop on node C. 1 PM Node A crashes and data is restored from latest 10 AM snapshot. Now, only Node B has the record. Now, my question is: Till the repair is completed on node A,a read at Quorum may return inconsistent result based on the nodes from which data is read.If data is read from node A and node C, nothing is returned and if data is read from node A and node B, record is returned. This is a vital point which is not highlighted anywhere. Please confirm my understanding.If my understanding is right, how to make sure that my reads are not inconsistent while a node is being repair after restoring a snapshot. I think, autobootstrapping the node without joining the ring till the repair is completed, is an alternative option. But snapshots save lot of streaming as compared to bootstrap. Will incremental backups guarantee that Thanks Anuj Sent from Yahoo Mail on Android The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.
Re: Inconsistent Reads after Restoring Snapshot
Thanks Romain !! So just to clarify, you are suggesting following steps: 10 AM Daily Snapshot taken of node A and moved to backup location 11 AM A record is inserted such that node A and B insert the record but there is a mutation drop on node C. 1 PM Node A crashes 1 PM Follow following steps to restore the 10 AM snapshot on node A: 1. Restore the data as mentioned in https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html with ONE EXCEPTION >> start node A with -Dcassandra.join_ring=false . 2. Run repair 3. Retstart the node A with -Dcassandra.join_ring=true Please confirm. I was not aware that join_ring can also be used using a normal reboot. I thought it was only an option during autobootstrap :) Thanks Anuj On Tue, 26/4/16, Romain Hardouin <romainh...@yahoo.fr> wrote: Subject: Re: Inconsistent Reads after Restoring Snapshot To: "user@cassandra.apache.org" <user@cassandra.apache.org> Date: Tuesday, 26 April, 2016, 12:47 PM You can make a restore on the new node A (don't forget to set the token(s) in cassandra.yaml), start the node with -Dcassandra.join_ring=false and then run a repair on it. Have a look at https://issues.apache.org/jira/browse/CASSANDRA-6961 Best, Romain Le Mardi 26 avril 2016 4h26, Anuj Wadehra <anujw_2...@yahoo.co.in> a écrit : Hi, We have 2.0.14. We use RF=3 and read/write at Quorum. Moreover, we dont use incremental backups. As per the documentation at , if i need to restore a Snapshot on SINGLE node in a cluster, I would run repair at the end. But while the repair is going on, reads may get inconsistent. Consider following scenario:10 AM Daily Snapshot taken of node A and moved to backup location11 AM A record is inserted such that node A and B insert the record but there is a mutation drop on node C.1 PM Node A crashes and data is restored from latest 10 AM snapshot. Now, only Node B has the record. Now, my question is: Till the repair is completed on node A,a read at Quorum may return inconsistent result based on the nodes from which data is read.If data is read from node A and node C, nothing is returned and if data is read from node A and node B, record is returned. This is a vital point which is not highlighted anywhere. Please confirm my understanding.If my understanding is right, how to make sure that my reads are not inconsistent while a node is being repair after restoring a snapshot. I think, autobootstrapping the node without joining the ring till the repair is completed, is an alternative option. But snapshots save lot of streaming as compared to bootstrap. Will incremental backups guarantee that ThanksAnuj Sent from Yahoo Mail on Android
Re: Upgrading to SSD
Thanks All !! Anuj On Mon, 25/4/16, Alain RODRIGUEZ <arodr...@gmail.com> wrote: Subject: Re: Upgrading to SSD To: user@cassandra.apache.org Date: Monday, 25 April, 2016, 2:45 PM Hi Anuj, You could do the following instead to minimize server downtime: 1. rsync while the server is running2. rsync again to get any new files3. shut server down4. rsync for the 3rd time 5. change directory in yaml and start back up +1 Here are some more details about that process and a script doing most of the job: thelastpickle.com/blog/2016/02/25/removing-a-disk-mapping-from-cassandra.html Hope it will be useful to you C*heers,---Alain Rodriguez - alain@thelastpickle.comFrance The Last Pickle - Apache Cassandra Consultinghttp://www.thelastpickle.com 2016-04-23 21:47 GMT+02:00 Jonathan Haddad <j...@jonhaddad.com>: You could do the following instead to minimize server downtime: 1. rsync while the server is running2. rsync again to get any new files3. shut server down4. rsync for the 3rd time5. change directory in yaml and start back up On Sat, Apr 23, 2016 at 12:23 PM Clint Martin <clintlmar...@coolfiretechnologies.com> wrote: As long as you shut down the node before you start copying and moving stuff around it shouldn't matter if you take backups or snapshots or whatever. When you add the filesystem for the ssd will you be removing the existing filesystem? Or will you be able to keep both filesystems mounted at the same time for the migration? If you can keep them both at the same time then an of system backup isn't strictly necessary. Just change your data dir config in your yaml. Copy the data and commit from the old dir to the new ssd and restart the node. If you can't keep both filesystems mounted concurrently then a remote system is necessary to copy the data to. But the steps and procedure are the same. Running repair before you do the migration isn't strictly necessary. Not a bad idea if you don't mind spending the time. Definitely run repair after you restart the node, especially if you take longer than the hint interval to perform the work. As for your filesystems, there is really nothing special to worry about. Depending on which filesystem you use there are recommendations for tuning and configuration that you should probably follow. (Datastax's recommendations as well as AL tobey's tuning guide are great resources. https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html ) Clint On Apr 23, 2016 3:05 PM, "Anuj Wadehra" <anujw_2...@yahoo.co.in> wrote: Hi We have a 3 node cluster of 2.0.14. We use Read/Write Qorum and RF is 3. We want to move data and commitlog directory from a SATA HDD to SSD. We have planned to do a rolling upgrade. We plan to run repair -pr on all nodes to sync data upfront and then execute following steps on each server one by one: 1. Take backup of data/commitlog directory to external server.2. Change mount points so that Cassandra data/commitlog directory now points to SSD.3. Copy files from external backup to SSD.4. Start Cassandra.5. Run full repair on the node before starting step 1 on next node. Questions:1. Is copying commitlog and data directory good enough or we should go for taking snapshot of each node and restoring data from that snapshot? 2. Any precautions we need to take while moving data to new SSD? File system format of two disks etc. 3. Should we drain data before taking backup? We are also restoring commitlog directory from backup. 4. I have added repair to sync full data upfront and a repair after restoring data on each node. Sounds safe and logical? 5. Any problems you see with mentioned approach? Any better approach? ThanksAnuj Sent from Yahoo Mail on Android
Inconsistent Reads after Restoring Snapshot
Hi, We have 2.0.14. We use RF=3 and read/write at Quorum. Moreover, we dont use incremental backups. As per the documentation at https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html , if i need to restore a Snapshot on SINGLE node in a cluster, I would run repair at the end. But while the repair is going on, reads may get inconsistent. Consider following scenario:10 AM Daily Snapshot taken of node A and moved to backup location11 AM A record is inserted such that node A and B insert the record but there is a mutation drop on node C.1 PM Node A crashes and data is restored from latest 10 AM snapshot. Now, only Node B has the record. Now, my question is: Till the repair is completed on node A,a read at Quorum may return inconsistent result based on the nodes from which data is read.If data is read from node A and node C, nothing is returned and if data is read from node A and node B, record is returned. This is a vital point which is not highlighted anywhere. Please confirm my understanding.If my understanding is right, how to make sure that my reads are not inconsistent while a node is being repair after restoring a snapshot. I think, autobootstrapping the node without joining the ring till the repair is completed, is an alternative option. But snapshots save lot of streaming as compared to bootstrap. Will incremental backups guarantee that ThanksAnuj Sent from Yahoo Mail on Android
Re: Unable to reliably count keys on a thrift CF
Hi Carlos, Please check if the JIRA : https://issues.apache.org/jira/browse/CASSANDRA-11467 fixes your problem. We had been facing row count issue with thrift cf / compact storage and this fixed it. Above is fixed in latest 2.1.14. Its a two line fix. So, you can also prepare a custom jar and check if that works. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 21 Apr, 2016 at 9:29 PM, Carlos Alonsowrote: Hi guys. I've been struggling for the last days to find a reliable and stable way to count keys in a thrift column family. My idea is to basically iterate the whole ring using the token function, as documented here: https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html in batches of 1 records The only corner case is that if there were more than 1 records in a single partition (not the case, but the program should still handle it) it explores the partition in depth by getting all records for that particular token (see below). In the end, all keys are saved into a hash to guarantee uniqueness. The count of unique keys is always different (and random, sometimes more keys, sometimes less are retrieved) and, of course, I'm sure no activity is going on in that cf. I'm running Cassandra 2.1.11 with MurMur3 partitioner. RF=3 and CL=QUORUM the column family structure is CREATE TABLE tbl ( key blob, column1 ascii, value blob, PRIMARY KEY(key, column1)) and I'm running the following script connection = open_cql_connectionresults = connection.execute("SELECT token(key), key FROM tbl LIMIT 1") keys_hash = {} // Hash to save the keys to guarantee uniquenesslast_token = niltoken = nil while results != nil results.each do |row| keys_hash[row['key']] = true token = row['token(key)'] end if token == last_token results = connection.execute("SELECT token(key), key FROM tbl WHERE token(key) = #{token}") else results = connection.execute("SELECT token(key), key FROM tbl WHERE token(key) >= #{token} LIMIT 1") end last_token = tokenend puts keys.keys.count What am I missing? Thanks! Carlos Alonso | Software Engineer | @calonso
Re: StatusLogger is logging too many information
Hi, You can set the property gc_warn_threshold_in_ms in yaml.For example, if your application is ok with a 2000ms pause, you can set the value to 2000 such that only gc pauses greater than 2000ms will lead to gc and status log. Please refer https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-8907 for details. ThanksAnuj Sent from Yahoo Mail on Android On Mon, 25 Apr, 2016 at 3:20 PM, jason zhao yangwrote: Hi, Currently StatusLogger will log info when there are dropped messages or GC more than 200 ms. In my use case, there are about 1000 tables. The status-logger is logging too many information for each tables. I wonder is there a way to reduce this log? for example, only print the thread pool information. Thanks.
Upgrading to SSD
Hi We have a 3 node cluster of 2.0.14. We use Read/Write Qorum and RF is 3. We want to move data and commitlog directory from a SATA HDD to SSD. We have planned to do a rolling upgrade. We plan to run repair -pr on all nodes to sync data upfront and then execute following steps on each server one by one: 1. Take backup of data/commitlog directory to external server.2. Change mount points so that Cassandra data/commitlog directory now points to SSD.3. Copy files from external backup to SSD.4. Start Cassandra.5. Run full repair on the node before starting step 1 on next node. Questions:1. Is copying commitlog and data directory good enough or we should go for taking snapshot of each node and restoring data from that snapshot? 2. Any precautions we need to take while moving data to new SSD? File system format of two disks etc. 3. Should we drain data before taking backup? We are also restoring commitlog directory from backup. 4. I have added repair to sync full data upfront and a repair after restoring data on each node. Sounds safe and logical? 5. Any problems you see with mentioned approach? Any better approach? ThanksAnuj Sent from Yahoo Mail on Android
Re: Efficient Paging Option in Wide Rows
Hi, Can anyone take this question? ThanksAnuj Sent from Yahoo Mail on Android On Sat, 23 Apr, 2016 at 2:30 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: I think I complicated the question..so I am trying to put the question crisply.. We have a table defined with clustering key/column. We have 5 different clustering key values. If we want to fetch all 5 rowd,Which query option would be faster and why? 1. Given a single primary key/partition key with 5 clustering keys..we will page through the single partition 500 records at a time.Thus, we will do 5/500=100 db hits but for same partition key. 2. Given 100 different primary keys with each primary key having just 500 clustering key columns. Here also we will need 100 db hits but for different partitions. Basically I want to understand any optimizations built into CQL/Cassandra which make paging through a single partition more efficient than querying data from different partitions. ThanksAnuj Sent from Yahoo Mail on Android On Fri, 22 Apr, 2016 at 8:27 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi, I have a wide row index table so that I can fetch all row keys corresponding to a column value. Row of index_table will look like: ColValue1:bucket1 >> rowkey1, rowkey2.. rowkeyn..ColValue1:bucketn>> rowkey1, rowkey2.. rowkeyn We will have buckets to avoid hotspots. Row keys of main table are random numbers and we will never do column slice like: Select * from index_table where key=xxx and Col > rowkey1 and col < rowkey10 Also, we will ALWAYS fetch all data for a given value of index column. Thus all buckets havr to be read. Each index column value can map to thousands-millions of row keys in main table. Based on our use case, there are two design choices in front of me: 1. Have large number of buckets/rows for an index column value and have lesser data ( around few thousands) in each row. Thus, every time we want to fetch all row keys for an index col value, we will query more rows and for each row we will have to page through data 500 records at a time. 2. Have fewer buckets/rows for an index column value. Every time we want to fetch all row keys for an index col value, we will query data less numner of wider rows and then page through each wide row reading 500 columns at a time. Which approach is more efficient? Approach1: More number of rows with less data in each row. OR Approach 2: less number of rows with more data in each row Either ways, we are fetching only 500 records at a time in a query. Even in approach 2 (wider rows) , we can query only small data of 500 at a time. ThanksAnuj
Re: Efficient Paging Option in Wide Rows
I think I complicated the question..so I am trying to put the question crisply.. We have a table defined with clustering key/column. We have 5 different clustering key values. If we want to fetch all 5 rowd,Which query option would be faster and why? 1. Given a single primary key/partition key with 5 clustering keys..we will page through the single partition 500 records at a time.Thus, we will do 5/500=100 db hits but for same partition key. 2. Given 100 different primary keys with each primary key having just 500 clustering key columns. Here also we will need 100 db hits but for different partitions. Basically I want to understand any optimizations built into CQL/Cassandra which make paging through a single partition more efficient than querying data from different partitions. ThanksAnuj Sent from Yahoo Mail on Android On Fri, 22 Apr, 2016 at 8:27 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi, I have a wide row index table so that I can fetch all row keys corresponding to a column value. Row of index_table will look like: ColValue1:bucket1 >> rowkey1, rowkey2.. rowkeyn..ColValue1:bucketn>> rowkey1, rowkey2.. rowkeyn We will have buckets to avoid hotspots. Row keys of main table are random numbers and we will never do column slice like: Select * from index_table where key=xxx and Col > rowkey1 and col < rowkey10 Also, we will ALWAYS fetch all data for a given value of index column. Thus all buckets havr to be read. Each index column value can map to thousands-millions of row keys in main table. Based on our use case, there are two design choices in front of me: 1. Have large number of buckets/rows for an index column value and have lesser data ( around few thousands) in each row. Thus, every time we want to fetch all row keys for an index col value, we will query more rows and for each row we will have to page through data 500 records at a time. 2. Have fewer buckets/rows for an index column value. Every time we want to fetch all row keys for an index col value, we will query data less numner of wider rows and then page through each wide row reading 500 columns at a time. Which approach is more efficient? Approach1: More number of rows with less data in each row. OR Approach 2: less number of rows with more data in each row Either ways, we are fetching only 500 records at a time in a query. Even in approach 2 (wider rows) , we can query only small data of 500 at a time. ThanksAnuj
Efficient Paging Option in Wide Rows
Hi, I have a wide row index table so that I can fetch all row keys corresponding to a column value. Row of index_table will look like: ColValue1:bucket1 >> rowkey1, rowkey2.. rowkeyn..ColValue1:bucketn>> rowkey1, rowkey2.. rowkeyn We will have buckets to avoid hotspots. Row keys of main table are random numbers and we will never do column slice like: Select * from index_table where key=xxx and Col > rowkey1 and col < rowkey10 Also, we will ALWAYS fetch all data for a given value of index column. Thus all buckets havr to be read. Each index column value can map to thousands-millions of row keys in main table. Based on our use case, there are two design choices in front of me: 1. Have large number of buckets/rows for an index column value and have lesser data ( around few thousands) in each row. Thus, every time we want to fetch all row keys for an index col value, we will query more rows and for each row we will have to page through data 500 records at a time. 2. Have fewer buckets/rows for an index column value. Every time we want to fetch all row keys for an index col value, we will query data less numner of wider rows and then page through each wide row reading 500 columns at a time. Which approach is more efficient? Approach1: More number of rows with less data in each row. OR Approach 2: less number of rows with more data in each row Either ways, we are fetching only 500 records at a time in a query. Even in approach 2 (wider rows) , we can query only small data of 500 at a time. ThanksAnuj
Re: Migrating to CQL and Non Compact Storage
Thanks Jim. I think you understand the pain of migrating TBs of data to new tables. There is no command to change from compact to non compact storage and the fastest solution to migrate data using Spark is too slow for production systems. And the pain gets bigger when your performance dips after moving to non compact storage table. Thats because non compact storage is quite inefficient storage format till 3.x and its incurs heavy penalty on Row Scan performance in Analytics workload.Please go throught the link to understand how old Compact storage gives much better performance than non compact storage as far as Row Scans are concerned: https://www.oreilly.com/ideas/apache-cassandra-for-analytics-a-performance-and-storage-analysis The flexibility of Cql comes at heavy cost until 3.x. ThanksAnujSent from Yahoo Mail on Android On Mon, 11 Apr, 2016 at 10:35 PM, Jim Ancona<j...@anconafamily.com> wrote: Jack, the Datastax link he posted (http://www.datastax.com/dev/blog/thrift-to-cql3) says that for column families with mixed dynamic and static columns: "The only solution to be able to access the column family fully is to remove the declared columns from the thrift schema altogether..." I think that page describes the problem and the potential solutions well. I haven't seen an answer to Anuj's question about why the native CQL solution using collections doesn't perform as well. Keep in mind that some of us understand CQL just fine but have working pre-CQL Thrift-based systems storing hundreds of terabytes of data and with requirements that mean that saying "bite the bullet and re-model your data" is not really helpful. Another quote from that Datastax link: "Thrift isn't going anywhere." Granted that that link is three-plus years old, but Thrift now *is* now going away, so it's not unexpected that people will be trying to figure out how to deal with that. It's bad enough that we need to rewrite our clients to use CQL instead of Thrift. It's not helpful to say that we should also re-model and migrate all our data. Jim On Mon, Apr 11, 2016 at 11:29 AM, Jack Krupansky <jack.krupan...@gmail.com> wrote: Sorry, but your message is too confusing - you say "reading dynamic columns in CQL" and "make the table schema less", but neither has any relevance to CQL! 1. CQL tables always have schemas. 2. All columns in CQL are statically declared (even maps/collections are statically declared columns.) Granted, it is a challenge for Thrift users to get used to the terminology of CQL, but it is required. If necessary, review some of the free online training videos for data modeling. Unless your data model is very simply and does directly translate into CQL, you probably do need to bite the bullet and re-model your data to exploit the features of CQL rather than fight CQL trying to mimic Thrift per se. In any case, take another shot at framing the problem and then maybe people here can help you out. -- Jack Krupansky On Mon, Apr 11, 2016 at 10:39 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Any comments or suggestions on this one? ThanksAnuj Sent from Yahoo Mail on Android On Sun, 10 Apr, 2016 at 11:39 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi We are on 2.0.14 and Thrift. We are planning to migrate to CQL soon but facing some challenges. We have a cf with a mix of statically defined columns and dynamic columns (created at run time). For reading dynamic columns in CQL, we have two options: 1. Drop all columns and make the table schema less. This way, we will get a Cql row for each column defined for a row key--As mentioned here: http://www.datastax.com/dev/blog/thrift-to-cql3 2.Migrate entire data to a new non compact storage table and create collections for dynamic columns in new table. In our case, we have observed that approach 2 causes 3 times slower performance in Range scan queries used by Spark. This is not acceptable. Cassandra 3 has optimized storage engine but we are not comfortable moving to 3.x in production. Moreover, data migration to new table using Spark takes hours. Any suggestions for the two issues? ThanksAnuj Sent from Yahoo Mail on Android
Re: Migrating to CQL and Non Compact Storage
Any comments or suggestions on this one? ThanksAnuj Sent from Yahoo Mail on Android On Sun, 10 Apr, 2016 at 11:39 PM, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi We are on 2.0.14 and Thrift. We are planning to migrate to CQL soon but facing some challenges. We have a cf with a mix of statically defined columns and dynamic columns (created at run time). For reading dynamic columns in CQL, we have two options: 1. Drop all columns and make the table schema less. This way, we will get a Cql row for each column defined for a row key--As mentioned here: http://www.datastax.com/dev/blog/thrift-to-cql3 2.Migrate entire data to a new non compact storage table and create collections for dynamic columns in new table. In our case, we have observed that approach 2 causes 3 times slower performance in Range scan queries used by Spark. This is not acceptable. Cassandra 3 has optimized storage engine but we are not comfortable moving to 3.x in production. Moreover, data migration to new table using Spark takes hours. Any suggestions for the two issues? ThanksAnuj Sent from Yahoo Mail on Android
Migrating to CQL and Non Compact Storage
Hi We are on 2.0.14 and Thrift. We are planning to migrate to CQL soon but facing some challenges. We have a cf with a mix of statically defined columns and dynamic columns (created at run time). For reading dynamic columns in CQL, we have two options: 1. Drop all columns and make the table schema less. This way, we will get a Cql row for each column defined for a row key--As mentioned here: http://www.datastax.com/dev/blog/thrift-to-cql3 2.Migrate entire data to a new non compact storage table and create collections for dynamic columns in new table. In our case, we have observed that approach 2 causes 3 times slower performance in Range scan queries used by Spark. This is not acceptable. Cassandra 3 has optimized storage engine but we are not comfortable moving to 3.x in production. Moreover, data migration to new table using Spark takes hours. Any suggestions for the two issues? ThanksAnuj Sent from Yahoo Mail on Android
Re: DataStax OpsCenter with Apache Cassandra
Thanks Jeff. If one needs to use OpsCenter with 2.2 or earlier versions of Apache Cassandra, is he required to buy license for it separately? What are the options if someone wants to use OpsCenter with Apache Distributed 3.x (commercial use)? ThanksAnuj Sent from Yahoo Mail on Android On Sun, 10 Apr, 2016 at 10:42 PM, Jeff Jirsa<jeff.ji...@crowdstrike.com> wrote: It is possible to use OpsCenter for open source / community versions up to 2.2.x. It will not be possible in 3.0+ From: Anuj Wadehra Reply-To: "user@cassandra.apache.org" Date: Sunday, April 10, 2016 at 9:28 AM To: User Subject: DataStax OpsCenter with Apache Cassandra
DataStax OpsCenter with Apache Cassandra
Hi, Is it possible to use DataStax OpsCenter for monitoring Apache distributed Cassandra in Production? OR Is it possible to use DataStax OpsCenter if you are not using DataStax Enterprise in production? ThanksAnuj
Re: Does saveToCassandra work with Cassandra Lucene plugin ?
I used it with Java and there, every field of Pojo must map to column names of the table. I think someone with Scala syntax knowledge can help you better. ThanksAnuj Sent from Yahoo Mail on Android On Mon, 28 Mar, 2016 at 11:47 pm, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: With my limited experience with Spark, I can tell you that you need to make sure that all columns mentioned in somecolumns must be part of CQL schema of table. ThanksAnuj Sent from Yahoo Mail on Android On Mon, 28 Mar, 2016 at 11:38 pm, Cleosson José Pirani de Souza<cso...@daitangroup.com> wrote: Hello, I am implementing the example on the github (https://github.com/Stratio/cassandra-lucene-index) and when I try to save the data using saveToCassandra I get the exception NoSuchElementException. If I use CassandraConnector.withSessionDo I am able to add elements into Cassandra and no exception is raised. The code :import org.apache.spark.{SparkConf, SparkContext, Logging}import com.datastax.spark.connector.cql.CassandraConnectorimport com.datastax.spark.connector._ object App extends Logging{ def main(args: Array[String]) { // Get the cassandra IP and create the spark context val cassandraIP = System.getenv("CASSANDRA_IP"); val sparkConf = new SparkConf(true) .set("spark.cassandra.connection.host", cassandraIP) .set("spark.cleaner.ttl", "3600") .setAppName("Simple Spark Cassandra Example") val sc = new SparkContext(sparkConf) // Works CassandraConnector(sparkConf).withSessionDo { session => session.execute("INSERT INTO demo.tweets(id, user, body, time, latitude, longitude) VALUES (19, 'Name', 'Body', '2016-03-19 09:00:00-0300', 39, 39)") } // Does not work val demo = sc.parallelize(Seq((9, "Name", "Body", "2016-03-29 19:00:00-0300", 29, 29))) // Raises the exception demo.saveToCassandra("demo", "tweets", SomeColumns("id", "user", "body", "time", "latitude", "longitude")) } } The exception:16/03/28 14:15:41 INFO CassandraConnector: Connected to Cassandra cluster: Test ClusterException in thread "main" java.util.NoSuchElementException: Column not found in demo.tweetsat com.datastax.spark.connector.cql.StructDef$$anonfun$columnByName$2.apply(Schema.scala:60)at com.datastax.spark.connector.cql.StructDef$$anonfun$columnByName$2.apply(Schema.scala:60)at scala.collection.Map$WithDefault.default(Map.scala:52)at scala.collection.MapLike$class.apply(MapLike.scala:141)at scala.collection.AbstractMap.apply(Map.scala:58)at com.datastax.spark.connector.cql.TableDef$$anonfun$9.apply(Schema.scala:153)at com.datastax.spark.connector.cql.TableDef$$anonfun$9.apply(Schema.scala:152)at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)at com.datastax.spark.connector.cql.TableDef.(Schema.scala:152)at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:283)at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:271)at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)at scala.collection.immutable.Set$Set4.foreach(Set.scala:137)at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:271)at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:295)at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:294)at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:294)at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:307)at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:304)at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)at com.datastax.spark.connector.cql.
Re: Does saveToCassandra work with Cassandra Lucene plugin ?
With my limited experience with Spark, I can tell you that you need to make sure that all columns mentioned in somecolumns must be part of CQL schema of table. ThanksAnuj Sent from Yahoo Mail on Android On Mon, 28 Mar, 2016 at 11:38 pm, Cleosson José Pirani de Souzawrote: Hello, I am implementing the example on the github (https://github.com/Stratio/cassandra-lucene-index) and when I try to save the data using saveToCassandra I get the exception NoSuchElementException. If I use CassandraConnector.withSessionDo I am able to add elements into Cassandra and no exception is raised. The code :import org.apache.spark.{SparkConf, SparkContext, Logging}import com.datastax.spark.connector.cql.CassandraConnectorimport com.datastax.spark.connector._ object App extends Logging{ def main(args: Array[String]) { // Get the cassandra IP and create the spark context val cassandraIP = System.getenv("CASSANDRA_IP"); val sparkConf = new SparkConf(true) .set("spark.cassandra.connection.host", cassandraIP) .set("spark.cleaner.ttl", "3600") .setAppName("Simple Spark Cassandra Example") val sc = new SparkContext(sparkConf) // Works CassandraConnector(sparkConf).withSessionDo { session => session.execute("INSERT INTO demo.tweets(id, user, body, time, latitude, longitude) VALUES (19, 'Name', 'Body', '2016-03-19 09:00:00-0300', 39, 39)") } // Does not work val demo = sc.parallelize(Seq((9, "Name", "Body", "2016-03-29 19:00:00-0300", 29, 29))) // Raises the exception demo.saveToCassandra("demo", "tweets", SomeColumns("id", "user", "body", "time", "latitude", "longitude")) } } The exception:16/03/28 14:15:41 INFO CassandraConnector: Connected to Cassandra cluster: Test ClusterException in thread "main" java.util.NoSuchElementException: Column not found in demo.tweetsat com.datastax.spark.connector.cql.StructDef$$anonfun$columnByName$2.apply(Schema.scala:60)at com.datastax.spark.connector.cql.StructDef$$anonfun$columnByName$2.apply(Schema.scala:60)at scala.collection.Map$WithDefault.default(Map.scala:52)at scala.collection.MapLike$class.apply(MapLike.scala:141)at scala.collection.AbstractMap.apply(Map.scala:58)at com.datastax.spark.connector.cql.TableDef$$anonfun$9.apply(Schema.scala:153)at com.datastax.spark.connector.cql.TableDef$$anonfun$9.apply(Schema.scala:152)at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)at com.datastax.spark.connector.cql.TableDef.(Schema.scala:152)at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:283)at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:271)at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)at scala.collection.immutable.Set$Set4.foreach(Set.scala:137)at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:271)at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:295)at com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:294)at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)at com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:294)at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:307)at com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:304)at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:120)at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)at
Expiring Columns
Hi, I want to understand how Expiring columns work in Cassandra. Query:Documentation says that once TTL of a column expires, tombstones are created/ marked when the sstable gets compacted. Is there a possibility that a query (range scan/ row query) returns expired column data just because the sstable never participated in a compaction after TTL of the column expired? For Example: 10 AM Data inserted with ttl=60 seconds 10:05 AM A query is run on inserted data 10:07 AM sstable is compacted and column is marked tombstone. Will the query return expired data in above scenario? If yes/no, why? ThanksAnuj Sent from Yahoo Mail on Android
IOException: MkDirs Failed to Create in Spark
Hi We are using Spark with Cassandra. While using rdd.saveAsTextFile("/tmp/dr"), we are getting following error when we run the application with root access. Spark is able to create two level of directories but fails after that with Exception: 16/03/01 22:59:48 WARN TaskSetManager: Lost task 73.3 in stage 0.0 (TID 144, host1): java.io.IOException: Mkdirs failed to create file:/tmp/dr/_temporary/0/_temporary/attempt_201603012259__m_73_144 at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Permissions on /tmp: chmod -R 777 /tmp has been executed and permissions look like: drwxrwxrwx. 31 root root 1.2K Mar 1 22:54 tmp Forgive me for raising this question in Cassandra Mailing list. I think Spark & Cassandra user base is overlapping, so I expected help here. I am not yet part of Spark mailing list. Thanks Anuj
Re: how to read parent_repair_history table?
Hi Jimmy, We are on 2.0.x. We are planning to use JMX notifications for getting repair status. To repair database, we call forceTableRepairPrimaryRange JMX operation from our Java client application on each node. You can call other latest JMX methods for repair. I would be keen in knowing the pros/cons of handling repair status via JMX notifications Vs via database tables. We are planning to implement it as follows: 1. Before repairing each keyspace via JMX, register two listeners: one for listening to StorageService MBean notifications about repair status and other the connection listener for detecting connection failures and lost JMX notifications. 2. We ensure that if 256 success session notifications are received, keyspace repair is successful. We have 256 ranges on each node. 3.If there are connection closed notifications, we will re-register the Mbean listener and retry repair once. 4. If there are Lost Notifications we retry the repair once before failing it. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 25 Feb, 2016 at 7:18 pm, Paulo Mottawrote: Hello Jimmy, The parent_repair_history table keeps track of start and finish information of a repair session. The other table repair_history keeps track of repair status as it progresses. So, you must first query the parent_repair_history table to check if a repair started and finish, as well as its duration, and inspect the repair_history table to troubleshoot more specific details of a given repair session. Answering your questions below: > Is every invocation of nodetool repair execution will be recorded as one > entry in parent_repair_history CF regardless if it is across DC, local node > repair, or other options ? Actually two entries, one for start and one for finish. > A repair job is done only if "finished" column contains value? and a repair > job is successfully done only if there is no value in exce ption_messages or > exception_stacktrace ? correct > what is the purpose of successful_ranges column? do i have to check they are > all matched with requested_range to ensure a successful run? correct - > Ultimately, how to find out the overall repair health/status in a given > cluster? Check if repair is being executed on all nodes within gc_grace_seconds, and tune that value or troubleshoot problems otherwise. > Scanning through parent_repair_history and making sure all the known > keyspaces has a good repair run in recent days? Sounds good. You can check https://issues.apache.org/jira/browse/CASSANDRA-5839 for more information. 2016-02-25 3:13 GMT-03:00 Jimmy Lin : hi all, few questions regarding how to read or digest the system_distributed.parent_repair_history CF, that I am very intereted to use to find out our repair status... - Is every invocation of nodetool repair execution will be recorded as one entry in parent_repair_history CF regardless if it is across DC, local node repair, or other options ? - A repair job is done only if "finished" column contains value? and a repair job is successfully done only if there is no value in exce ption_messages or exception_stacktrace ? what is the purpose of successful_ranges column? do i have to check they are all matched with requested_range to ensure a successful run? - Ultimately, how to find out the overall repair health/status in a given cluster? Scanning through parent_repair_history and making sure all the known keyspaces has a good repair run in recent days? --- CREATE TABLE system_distributed.parent_repair_history ( parent_id timeuuid PRIMARY KEY, columnfamily_names set, exception_message text, exception_stacktrace text, finished_at timestamp, keyspace_name text, requested_ranges set, started_at timestamp, successful_ranges set )
Re: Restart Cassandra automatically
Hi Subharaj, Cassandra is built to be a Fault tolerant distributed db and suitable for building HA systems. As Cassandra provides multiple replicas for the same data, if a single nide goes down in Production, it wont bring down the cluster. In my opinion, if you target to start one or more failed Cassandra nodes without investigating the issue, you can damage system health rather than preserve it. Please set RF amd CL appropriately to ensure that system can afford node failures. ThanksAnuj Sent from Yahoo Mail on Android On Fri, 5 Feb, 2016 at 9:56 am, Debraj Mannawrote: Hi, What is the best way to keep cassandra running? My requirement is if for some reason cassandra stops then it should get started automatically. I tried to achieve this by adding cassandra to supervisord. My supervisor conf for cassandra looks like below:- [program:cassandra] command=/bin/bash -c 'sleep 10 && bin/cassandra' directory=/opt/cassandra/ autostart=true autorestart=true startretries=3 stderr_logfile=/var/log/cassandra_supervisor.err.log stdout_logfile=/var/log/cassandra_supervisor.out.log But it does not seem to work properly. Even if I stop cassandra from supervisor then the cassandra process seem to be running if I do ps -ef | grep cassandra I also tried the configuration mentioned in this question but still no luck. Can someone let me know what is the best way to keep cassandra running on production environment? Environment - Cassandra 2.2.4 - Debian 8 Thanks,
Re: Debugging write timeouts on Cassandra 2.2.5
Hi Mike, Using batches with many rows puts heavy load on the coordinator and is generally not considered a good practice. With 1500 rows in a batch with different partition keys, even on a large cluster, you will eventually end up waiting for every node in the cluster. This increases the likelyhood of timeouts. As per my understanding of batches, I think you should revisit the batch size. May be people having expertise on batches can comment here. ThanksAnuj Sent from Yahoo Mail on Android On Fri, 19 Feb, 2016 at 8:18 pm, Mike Heffner<m...@librato.com> wrote: Anuj, So we originally started testing with Java8 + G1, however we were able to reproduce the same results with the default CMS settings that ship in the cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses during the runs. Query pattern during our testing was 100% writes, batching (via Thrift mostly) to 5 tables, between 6-1500 rows per batch. Mike On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Whats the GC overhead? Can you your share your GC collector and settings ? Whats your query pattern? Do you use secondary indexes, batches, in clause etc? Anuj Sent from Yahoo Mail on Android On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner<m...@librato.com> wrote: Alain, Thanks for the suggestions. Sure, tpstats are here: https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the metrics across the ring, there were no blocked tasks nor dropped messages. Iowait metrics look fine, so it doesn't appear to be blocking on disk. Similarly, there are no long GC pauses. We haven't noticed latency on any particular table higher than others or correlated around the occurrence of a timeout. We have noticed with further testing that running cassandra-stress against the ring, while our workload is writing to the same ring, will incur similar 10 second timeouts. If our workload is not writing to the ring, cassandra stress will run without hitting timeouts. This seems to imply that our workload pattern is causing something to block cluster-wide, since the stress tool writes to a different keyspace then our workload. I mentioned in another reply that we've tracked it to something between 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was introduced in. Cheers, Mike On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <arodr...@gmail.com> wrote: Hi Mike, What about the output of tpstats ? I imagine you have dropped messages there. Any blocked threads ? Could you paste this output here ? May this be due to some network hiccup to access the disks as they are EBS ? Can you think of anyway of checking this ? Do you have a lot of GC logs, how long are the pauses (use something like: grep -i 'GCInspector' /var/log/cassandra/system.log) ? Something else you could check are local_writes stats to see if only one table if affected or this is keyspace / cluster wide. You can use metrics exposed by cassandra or if you have no dashboards I believe a: 'nodetool cfstats | grep -e 'Table:' -e 'Local'' should give you a rough idea of local latencies. Those are just things I would check, I have not a clue on what is happening here, hope this will help. C*heers,-Alain RodriguezFrance The Last Picklehttp://www.thelastpickle.com 2016-02-18 5:13 GMT+01:00 Mike Heffner <m...@librato.com>: Jaydeep, No, we don't use any light weight transactions. Mike On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <chovatia.jayd...@gmail.com> wrote: Are you guys using light weight transactions in your write path? On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <fabrice.faco...@gmail.com> wrote: Are your commitlog and data on the same disk ? If yes, you should put commitlogs on a separate disk which don't have a lot of IO. Others IO may have great impact impact on your commitlog writing and it may even block. An example of impact IO may have, even for Async writes: https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic 2016-02-11 0:31 GMT+01:00 Mike Heffner <m...@librato.com>: > Jeff, > > We have both commitlog and data on a 4TB EBS with 10k IOPS. > > Mike > > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> > wrote: >> >> What disk size are you using? >> >> >> >> From: Mike Heffner >> Reply-To: "user@cassandra.apache.org" >> Date: Wednesday, February 10, 2016 at 2:24 PM >> To: "user@cassandra.apache.org" >> Cc: Peter Norton >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5 >> >> Paulo, >> >> Thanks for the suggestion, we ran some tests against CMS and saw the same >> timeouts. On that note though, we are going to try doubling the instance >> sizes and testing with double t
Re: Debugging write timeouts on Cassandra 2.2.5
Whats the GC overhead? Can you your share your GC collector and settings ? Whats your query pattern? Do you use secondary indexes, batches, in clause etc? Anuj Sent from Yahoo Mail on Android On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffnerwrote: Alain, Thanks for the suggestions. Sure, tpstats are here: https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the metrics across the ring, there were no blocked tasks nor dropped messages. Iowait metrics look fine, so it doesn't appear to be blocking on disk. Similarly, there are no long GC pauses. We haven't noticed latency on any particular table higher than others or correlated around the occurrence of a timeout. We have noticed with further testing that running cassandra-stress against the ring, while our workload is writing to the same ring, will incur similar 10 second timeouts. If our workload is not writing to the ring, cassandra stress will run without hitting timeouts. This seems to imply that our workload pattern is causing something to block cluster-wide, since the stress tool writes to a different keyspace then our workload. I mentioned in another reply that we've tracked it to something between 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was introduced in. Cheers, Mike On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ wrote: Hi Mike, What about the output of tpstats ? I imagine you have dropped messages there. Any blocked threads ? Could you paste this output here ? May this be due to some network hiccup to access the disks as they are EBS ? Can you think of anyway of checking this ? Do you have a lot of GC logs, how long are the pauses (use something like: grep -i 'GCInspector' /var/log/cassandra/system.log) ? Something else you could check are local_writes stats to see if only one table if affected or this is keyspace / cluster wide. You can use metrics exposed by cassandra or if you have no dashboards I believe a: 'nodetool cfstats | grep -e 'Table:' -e 'Local'' should give you a rough idea of local latencies. Those are just things I would check, I have not a clue on what is happening here, hope this will help. C*heers,-Alain RodriguezFrance The Last Picklehttp://www.thelastpickle.com 2016-02-18 5:13 GMT+01:00 Mike Heffner : Jaydeep, No, we don't use any light weight transactions. Mike On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia wrote: Are you guys using light weight transactions in your write path? On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat wrote: Are your commitlog and data on the same disk ? If yes, you should put commitlogs on a separate disk which don't have a lot of IO. Others IO may have great impact impact on your commitlog writing and it may even block. An example of impact IO may have, even for Async writes: https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic 2016-02-11 0:31 GMT+01:00 Mike Heffner : > Jeff, > > We have both commitlog and data on a 4TB EBS with 10k IOPS. > > Mike > > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa > wrote: >> >> What disk size are you using? >> >> >> >> From: Mike Heffner >> Reply-To: "user@cassandra.apache.org" >> Date: Wednesday, February 10, 2016 at 2:24 PM >> To: "user@cassandra.apache.org" >> Cc: Peter Norton >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5 >> >> Paulo, >> >> Thanks for the suggestion, we ran some tests against CMS and saw the same >> timeouts. On that note though, we are going to try doubling the instance >> sizes and testing with double the heap (even though current usage is low). >> >> Mike >> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta >> wrote: >>> >>> Are you using the same GC settings as the staging 2.0 cluster? If not, >>> could you try using the default GC settings (CMS) and see if that changes >>> anything? This is just a wild guess, but there were reports before of >>> G1-caused instabilities with small heap sizes (< 16GB - see CASSANDRA-10403 >>> for more context). Please ignore if you already tried reverting back to CMS. >>> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner : Hi all, We've recently embarked on a project to update our Cassandra infrastructure running on EC2. We are long time users of 2.0.x and are testing out a move to version 2.2.5 running on VPC with EBS. Our test setup is a 3 node, RF=3 cluster supporting a small write load (mirror of our staging load). We are writing at QUORUM and while p95's look good compared to our staging 2.0.x cluster, we are seeing frequent write operations that time out at the max write_request_timeout_in_ms (10 seconds). CPU across the cluster is < 10% and EBS write load is < 100
Re: Scenarios which need Repair
Hi, Can someone take this? ThanksAnuj On Mon, 8 Feb, 2016 at 11:44 pm, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi, Setup: We are on 2.0.14. We have some deployments with just one DC(RF:3) while others with two DCs (RF:3,RF:3). We ALWAYS use LOCAL_QUORUM for both Reads and Writes. Scenario: - We used to run nodetool repair on all tables every gc_grace_seconds. Recently, we decided to identify tables which really need repair and only run repair on those tables. We have identified basically two kinds of tables which don't need repair: TYPE-1 Tables which have only inserts, no upserts and delete by TTL. TYPE-2 Tables with a counter column. I don't have much experience with counters but I can explain the use case. We use a counter column to keep check on our traffic rate.Values are usually updated numerous times in a minute and we need not be ACCURATE with the value-few values here and there are OK. Questions: -- Can we COMPLETELY avoid maintenance repair on TYPE-1 and TYPE-2 tables? If yes, will there be any side effect of not repairing such data often in case of dropped mutations, failure scenarios etc.? What will be the scenarios when repair would be needed on such tables? Thanks Anuj
Re: Moving Away from Compact Storage
Will it be possible to read dynamic columns data from compact storage and trasform them as collection e.g. map in new table? ThanksAnuj Sent from Yahoo Mail on Android On Wed, 3 Feb, 2016 at 12:28 am, DuyHai Doan<doanduy...@gmail.com> wrote: So there is no "static" (in the sense of CQL static) column in your legacy table. Just define a Scala case class to match this table and use Spark to dump the content to a new non compact CQL table On Tue, Feb 2, 2016 at 7:55 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Our old table looks like this from cqlsh: CREATE TABLE table table1 ( key text, "Col1" blob, "Col2" text, "Col3" text, "Col4" text, PRIMARY KEY (key)) WITH COMPACT STORAGE AND … And it will have some dynamic text data which we are planning to add in collections.. Please let me know if you need more details.. ThanksAnujSent from Yahoo Mail on Android On Wed, 3 Feb, 2016 at 12:14 am, DuyHai Doan<doanduy...@gmail.com> wrote: Can you give the CREATE TABLE script for you old compact storage table ? Or at least the cassandra-client creation script On Tue, Feb 2, 2016 at 3:48 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Thanks DuyHai !! We were also thinking to do it the "Spark" way but I was not sure that its would be so simple :) We have a compact storage cf with each row having some data in staticly defined columns while other data in dynamic columns. Is the approach mentioned in link adaptable to the scenario where we want to migrate the existing data to a Non-Compact CF with static columns and collections ? Thanks Anuj On Tue, 2/2/16, DuyHai Doan <doanduy...@gmail.com> wrote: Subject: Re: Moving Away from Compact Storage To: user@cassandra.apache.org Date: Tuesday, 2 February, 2016, 12:57 AM Use Apache Spark to parallelize the data migration. Look at this piece of code https://github.com/doanduyhai/Cassandra-Spark-Demo/blob/master/src/main/scala/usecases/MigrateAlbumsData.scala#L58-L60 If your source and target tables have the SAME structure (except for the COMPACT STORAGE clause), migration with Spark is a 2 lines of code On Mon, Feb 1, 2016 at 8:14 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi Whats the fastest and reliable way to migrate data from a Compact Storage table to Non-Compact storage table? I was not able to find any command for dropping the compact storage directive..so I think migrating data is the only way...any suggestions? ThanksAnuj
Re: Moving Away from Compact Storage
Our old table looks like this from cqlsh: CREATE TABLE table table1 ( key text, "Col1" blob, "Col2" text, "Col3" text, "Col4" text, PRIMARY KEY (key)) WITH COMPACT STORAGE AND … And it will have some dynamic text data which we are planning to add in collections.. Please let me know if you need more details.. ThanksAnujSent from Yahoo Mail on Android On Wed, 3 Feb, 2016 at 12:14 am, DuyHai Doan<doanduy...@gmail.com> wrote: Can you give the CREATE TABLE script for you old compact storage table ? Or at least the cassandra-client creation script On Tue, Feb 2, 2016 at 3:48 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Thanks DuyHai !! We were also thinking to do it the "Spark" way but I was not sure that its would be so simple :) We have a compact storage cf with each row having some data in staticly defined columns while other data in dynamic columns. Is the approach mentioned in link adaptable to the scenario where we want to migrate the existing data to a Non-Compact CF with static columns and collections ? Thanks Anuj On Tue, 2/2/16, DuyHai Doan <doanduy...@gmail.com> wrote: Subject: Re: Moving Away from Compact Storage To: user@cassandra.apache.org Date: Tuesday, 2 February, 2016, 12:57 AM Use Apache Spark to parallelize the data migration. Look at this piece of code https://github.com/doanduyhai/Cassandra-Spark-Demo/blob/master/src/main/scala/usecases/MigrateAlbumsData.scala#L58-L60 If your source and target tables have the SAME structure (except for the COMPACT STORAGE clause), migration with Spark is a 2 lines of code On Mon, Feb 1, 2016 at 8:14 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi Whats the fastest and reliable way to migrate data from a Compact Storage table to Non-Compact storage table? I was not able to find any command for dropping the compact storage directive..so I think migrating data is the only way...any suggestions? ThanksAnuj
Re: Moving Away from Compact Storage
Thanks DuyHai !! We were also thinking to do it the "Spark" way but I was not sure that its would be so simple :) We have a compact storage cf with each row having some data in staticly defined columns while other data in dynamic columns. Is the approach mentioned in link adaptable to the scenario where we want to migrate the existing data to a Non-Compact CF with static columns and collections ? Thanks Anuj On Tue, 2/2/16, DuyHai Doan <doanduy...@gmail.com> wrote: Subject: Re: Moving Away from Compact Storage To: user@cassandra.apache.org Date: Tuesday, 2 February, 2016, 12:57 AM Use Apache Spark to parallelize the data migration. Look at this piece of code https://github.com/doanduyhai/Cassandra-Spark-Demo/blob/master/src/main/scala/usecases/MigrateAlbumsData.scala#L58-L60 If your source and target tables have the SAME structure (except for the COMPACT STORAGE clause), migration with Spark is a 2 lines of code On Mon, Feb 1, 2016 at 8:14 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi Whats the fastest and reliable way to migrate data from a Compact Storage table to Non-Compact storage table? I was not able to find any command for dropping the compact storage directive..so I think migrating data is the only way...any suggestions? ThanksAnuj
Re: Cassandra's log is full of mesages reset by peers even without traffic
Hi Jean, As mentioned in the DataStax link, your TCP connections will be marked dead after 300+75*9 =975 seconds. Make sure that your firewall idle timeout is more than 975 seconds. Otherwise firewall will drop connections and you may face issues.You can also try setting all three values same as mentioned in the link to see whether the problem gets resolved after doing that. ThanksAnuj Sent from Yahoo Mail on Android On Mon, 1 Feb, 2016 at 9:18 pm, Jean Carlo<jean.jeancar...@gmail.com> wrote: Hello Annuj,, I checked my settings and this what I got. root@node001[SPH][BENCH][PnS3]:~$ sysctl -A | grep net.ipv4 | grep net.ipv4.tcp_keepalive_probes net.ipv4.tcp_keepalive_probes = 9 root@node001[SPH][BENCH][PnS3]:~$ sysctl -A | grep net.ipv4 | grep net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_intvl = 75 root@node001[SPH][BENCH][PnS3]:~$ sysctl -A | grep net.ipv4 | grep net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_time = 300 The tcp_keepalive_time is quite high in comparation to that written on the doc https://docs.datastax.com/en/cassandra/2.1/cassandra/troubleshooting/trblshootIdleFirewall.html Do you think that is ok? Best regards Jean Carlo "The best way to predict the future is to invent it" Alan Kay On Fri, Jan 29, 2016 at 11:02 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi Jean, Please make sure that your Firewall is not dropping TCP connections which are in use. Tcp keep alive on all nodes must be less than the firewall setting. Please refer to https://docs.datastax.com/en/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html for details on TCP settings. ThanksAnuj Sent from Yahoo Mail on Android On Fri, 29 Jan, 2016 at 3:21 pm, Jean Carlo<jean.jeancar...@gmail.com> wrote: Hello guys, I have a cluster cassandra 2.1.12 with 6 nodes. All the logs of my nodes are having this messages marked as INFO INFO [SharedPool-Worker-1] 2016-01-29 10:40:57,745 Message.java:532 - Unexpected exception during request; channel = [id: 0xff15eb8c, /172.16.162.4:9042] java.io.IOException: Error while read(...): Connection reset by peer at io.netty.channel.epoll.Native.readAddress(Native Method) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] This happens either the cluster is stressed or not. Btw it is not production. The ip marked there (172.16.162.4) belongs to a node of the cluster, this is not the only node that appears, acctually we are having all the node's ip having that reset by peer problem. Our cluster is having more reads than writes. like 50 reads per second. Any one got the same problem? Best regards Jean Carlo "The best way to predict the future is to invent it" Alan Kay
Cqlsh hangs & closes automatically
My cqlsh prompt hangs and closes if I try to fetch just 100 rows using select * query. Cassandra-cli does the job. Any solution? ThanksAnuj
Re: Moving Away from Compact Storage
By dynamic columns, I mean columns not defined in schema. In current scenario, every row has some data in columns which are defined in schema while rest of the data is in columns which are not defined in schema. We used Thrift for inserting data. In new schema, we want to create a collection column and put all the data which was there in columns NOT defined in schema to the collection. ThanksAnuj Sent from Yahoo Mail on Android On Wed, 3 Feb, 2016 at 12:36 am, DuyHai Doan<doanduy...@gmail.com> wrote: You 'll need to do the transformation in Spark, although I don't understand what you mean by "dynamic columns". Given the CREATE TABLE script you gave earlier, there is nothing such as dynamic columns On Tue, Feb 2, 2016 at 8:01 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Will it be possible to read dynamic columns data from compact storage and trasform them as collection e.g. map in new table? ThanksAnuj Sent from Yahoo Mail on Android On Wed, 3 Feb, 2016 at 12:28 am, DuyHai Doan<doanduy...@gmail.com> wrote: So there is no "static" (in the sense of CQL static) column in your legacy table. Just define a Scala case class to match this table and use Spark to dump the content to a new non compact CQL table On Tue, Feb 2, 2016 at 7:55 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Our old table looks like this from cqlsh: CREATE TABLE table table1 ( key text, "Col1" blob, "Col2" text, "Col3" text, "Col4" text, PRIMARY KEY (key)) WITH COMPACT STORAGE AND … And it will have some dynamic text data which we are planning to add in collections.. Please let me know if you need more details.. ThanksAnujSent from Yahoo Mail on Android On Wed, 3 Feb, 2016 at 12:14 am, DuyHai Doan<doanduy...@gmail.com> wrote: Can you give the CREATE TABLE script for you old compact storage table ? Or at least the cassandra-client creation script On Tue, Feb 2, 2016 at 3:48 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Thanks DuyHai !! We were also thinking to do it the "Spark" way but I was not sure that its would be so simple :) We have a compact storage cf with each row having some data in staticly defined columns while other data in dynamic columns. Is the approach mentioned in link adaptable to the scenario where we want to migrate the existing data to a Non-Compact CF with static columns and collections ? Thanks Anuj On Tue, 2/2/16, DuyHai Doan <doanduy...@gmail.com> wrote: Subject: Re: Moving Away from Compact Storage To: user@cassandra.apache.org Date: Tuesday, 2 February, 2016, 12:57 AM Use Apache Spark to parallelize the data migration. Look at this piece of code https://github.com/doanduyhai/Cassandra-Spark-Demo/blob/master/src/main/scala/usecases/MigrateAlbumsData.scala#L58-L60 If your source and target tables have the SAME structure (except for the COMPACT STORAGE clause), migration with Spark is a 2 lines of code On Mon, Feb 1, 2016 at 8:14 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi Whats the fastest and reliable way to migrate data from a Compact Storage table to Non-Compact storage table? I was not able to find any command for dropping the compact storage directive..so I think migrating data is the only way...any suggestions? ThanksAnuj
Moving Away from Compact Storage
Hi Whats the fastest and reliable way to migrate data from a Compact Storage table to Non-Compact storage table? I was not able to find any command for dropping the compact storage directive..so I think migrating data is the only way...any suggestions? ThanksAnuj
Re: Cassandra's log is full of mesages reset by peers even without traffic
Hi Jean, Please make sure that your Firewall is not dropping TCP connections which are in use. Tcp keep alive on all nodes must be less than the firewall setting. Please refer to https://docs.datastax.com/en/cassandra/2.0/cassandra/troubleshooting/trblshootIdleFirewall.html for details on TCP settings. ThanksAnuj Sent from Yahoo Mail on Android On Fri, 29 Jan, 2016 at 3:21 pm, Jean Carlowrote: Hello guys, I have a cluster cassandra 2.1.12 with 6 nodes. All the logs of my nodes are having this messages marked as INFO INFO [SharedPool-Worker-1] 2016-01-29 10:40:57,745 Message.java:532 - Unexpected exception during request; channel = [id: 0xff15eb8c, /172.16.162.4:9042] java.io.IOException: Error while read(...): Connection reset by peer at io.netty.channel.epoll.Native.readAddress(Native Method) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] This happens either the cluster is stressed or not. Btw it is not production. The ip marked there (172.16.162.4) belongs to a node of the cluster, this is not the only node that appears, acctually we are having all the node's ip having that reset by peer problem. Our cluster is having more reads than writes. like 50 reads per second. Any one got the same problem? Best regards Jean Carlo "The best way to predict the future is to invent it" Alan Kay
Re: Read operations freeze for a few second while adding a new node
Hi Lorand, Do you see any different gc pattern during these 20 seconds? In 2.0.x, memtable create lot of heap pressure. So in a way, reads are not isolated from writes. Frankly speaking, I would have accepted 20 second slowness as scaling is one time activity. But may be your business case doesnt make that acceptable. Such tough requirements often drive improvements.. ThanksAnuj Sent from Yahoo Mail on Android On Thu, 28 Jan, 2016 at 9:41 pm, Lorand Kaslerwrote: Hi, We are struggling with a problem that when adding nodes around 5% read operations freeze (aka time out after 1 second) for a few seconds (10-20 seconds). It might not seems much, but at the order of 200k requests per second that's quite big of disruption. It is well documented and known that adding nodes *has* impact on the latency or the completion of the requests but is there a way to lessen that? It is completely okay for write operations to fail or get blocked while adding nodes, but having the read path also impacted by this much (going from 30 millisecond 99 percentile latency to above 1 second) is what puzzles us. We have a 36 node cluster, every node owning ~120 GB of data. We are using Cassandra version 2.0.14 with vnodes and we are in the process of increasing capacity of the cluster, by roughly doubling the nodes. They have SSDs and have peak IO usage of ~30%. Apart from the latency metrics only FlushWrites are blocked 18% of the time (based on the tpstats counters), but that can only lead to blocking writes and not reads? Thank you
Re: Using TTL for data purge
On second thought, If you are anyways reading the user table on each website access and can afford extra IO, first option looks more appropriate as it will ease out the pain of manual purging maintenance and wont need full table scans. ThanksAnuj Sent from Yahoo Mail on Android On Sat, 23 Jan, 2016 at 12:16 am, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Give a deep thought on your use case. Different user tables/types may have different purge strategy based on how frequently a user account type is usually accessed, whats the user count for each user type and so on. ThanksAnuj Sent from Yahoo Mail on Android On Fri, 22 Jan, 2016 at 11:37 pm, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi Joseph, I am personally in favour of Second approach because I dont want to do lot of IO just because a user is accessing a site several times a day. Options I see: 1.If you are on SSDs, Test LCS and update TTL of all columns at each access. This will make sure that the system can tolerate the extra IO. Advantage: No scheduling job needed. Deletion is seemless. Improved read performace than STCS. Disadvantage: To reinsert records with new TTL you would do read before write which is an Anti oattern and slow thing. Active users will cause unnecessary IO for just updating TTL.High IO due to LCS too. 2.Create a new table with user id key and last access time instead of relying on inbuilt secondary indexes. Overwrite the last access time at each access. Schedule jobs to read this table at regular intervals may be once a week and manually delete users from the main table based on the last access time. You can test using LCS with new table. Advantage: Light weight writes for updating access time. Flexibility to update deletion logic. Disadvantage: Manual scheduling job and code needs to be implemented. Scheduler would need a slow full table scan of users to know last access time. Full table scans could be done via token based parallel CQL queries for achieving performance. Using a Apache Spark job to find users to be purged would do that at tremendous speeds. Secondary indexes are not suitable and dont scale well. I would suggest dropping them. ThanksAnuj On Tue, 22 Dec, 2015 at 3:06 pm, jaalex.tech<jaalex.t...@gmail.com> wrote: Hi, I'm looking for suggestions/caveats on using TTL as a subsitute for a manual data purge job. We have few tables that hold user information - this could be guest or registered users, and there could be between 500K to 1M records created per day per table. Currently, these tables have a secondary indexed updated_date column which is populated on each update. However, we have been getting timeouts when running queries using updated_date when the number of records are high, so i don't think this would be a reliable option in the long term when we need to purge records that have not been used for the last X days. In this scenario, is it advisable to include a high enough TTL (i.e the amount of time we want these to last, could be 3 to 6 months) when inserting/updating records? There could be cases where the TTL may get reset after couple of days/weeks, when the user visits the site again. The tables have fixed number of columns, except for one which has a clustering key, and may have max 10 entries per partition key. I need to know the overhead of having so many rows with TTL hanging around for a relatively longer duration (weeks/months), and the impacts it could have on performance/storage. If this is not a recommended approach, what would be an alternate design which could be used for a manual purge job, without using secondary indices. We are using Cassandra 2.0.x. Thanks,Joseph
Re: Using TTL for data purge
Give a deep thought on your use case. Different user tables/types may have different purge strategy based on how frequently a user account type is usually accessed, whats the user count for each user type and so on. ThanksAnuj Sent from Yahoo Mail on Android On Fri, 22 Jan, 2016 at 11:37 pm, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: Hi Joseph, I am personally in favour of Second approach because I dont want to do lot of IO just because a user is accessing a site several times a day. Options I see: 1.If you are on SSDs, Test LCS and update TTL of all columns at each access. This will make sure that the system can tolerate the extra IO. Advantage: No scheduling job needed. Deletion is seemless. Improved read performace than STCS. Disadvantage: To reinsert records with new TTL you would do read before write which is an Anti oattern and slow thing. Active users will cause unnecessary IO for just updating TTL.High IO due to LCS too. 2.Create a new table with user id key and last access time instead of relying on inbuilt secondary indexes. Overwrite the last access time at each access. Schedule jobs to read this table at regular intervals may be once a week and manually delete users from the main table based on the last access time. You can test using LCS with new table. Advantage: Light weight writes for updating access time. Flexibility to update deletion logic. Disadvantage: Manual scheduling job and code needs to be implemented. Scheduler would need a slow full table scan of users to know last access time. Full table scans could be done via token based parallel CQL queries for achieving performance. Using a Apache Spark job to find users to be purged would do that at tremendous speeds. Secondary indexes are not suitable and dont scale well. I would suggest dropping them. ThanksAnuj On Tue, 22 Dec, 2015 at 3:06 pm, jaalex.tech<jaalex.t...@gmail.com> wrote: Hi, I'm looking for suggestions/caveats on using TTL as a subsitute for a manual data purge job. We have few tables that hold user information - this could be guest or registered users, and there could be between 500K to 1M records created per day per table. Currently, these tables have a secondary indexed updated_date column which is populated on each update. However, we have been getting timeouts when running queries using updated_date when the number of records are high, so i don't think this would be a reliable option in the long term when we need to purge records that have not been used for the last X days. In this scenario, is it advisable to include a high enough TTL (i.e the amount of time we want these to last, could be 3 to 6 months) when inserting/updating records? There could be cases where the TTL may get reset after couple of days/weeks, when the user visits the site again. The tables have fixed number of columns, except for one which has a clustering key, and may have max 10 entries per partition key. I need to know the overhead of having so many rows with TTL hanging around for a relatively longer duration (weeks/months), and the impacts it could have on performance/storage. If this is not a recommended approach, what would be an alternate design which could be used for a manual purge job, without using secondary indices. We are using Cassandra 2.0.x. Thanks,Joseph
Re: Using TTL for data purge
Hi Joseph, I am personally in favour of Second approach because I dont want to do lot of IO just because a user is accessing a site several times a day. Options I see: 1.If you are on SSDs, Test LCS and update TTL of all columns at each access. This will make sure that the system can tolerate the extra IO. Advantage: No scheduling job needed. Deletion is seemless. Improved read performace than STCS. Disadvantage: To reinsert records with new TTL you would do read before write which is an Anti oattern and slow thing. Active users will cause unnecessary IO for just updating TTL.High IO due to LCS too. 2.Create a new table with user id key and last access time instead of relying on inbuilt secondary indexes. Overwrite the last access time at each access. Schedule jobs to read this table at regular intervals may be once a week and manually delete users from the main table based on the last access time. You can test using LCS with new table. Advantage: Light weight writes for updating access time. Flexibility to update deletion logic. Disadvantage: Manual scheduling job and code needs to be implemented. Scheduler would need a slow full table scan of users to know last access time. Full table scans could be done via token based parallel CQL queries for achieving performance. Using a Apache Spark job to find users to be purged would do that at tremendous speeds. Secondary indexes are not suitable and dont scale well. I would suggest dropping them. ThanksAnuj On Tue, 22 Dec, 2015 at 3:06 pm, jaalex.techwrote: Hi, I'm looking for suggestions/caveats on using TTL as a subsitute for a manual data purge job. We have few tables that hold user information - this could be guest or registered users, and there could be between 500K to 1M records created per day per table. Currently, these tables have a secondary indexed updated_date column which is populated on each update. However, we have been getting timeouts when running queries using updated_date when the number of records are high, so i don't think this would be a reliable option in the long term when we need to purge records that have not been used for the last X days. In this scenario, is it advisable to include a high enough TTL (i.e the amount of time we want these to last, could be 3 to 6 months) when inserting/updating records? There could be cases where the TTL may get reset after couple of days/weeks, when the user visits the site again. The tables have fixed number of columns, except for one which has a clustering key, and may have max 10 entries per partition key. I need to know the overhead of having so many rows with TTL hanging around for a relatively longer duration (weeks/months), and the impacts it could have on performance/storage. If this is not a recommended approach, what would be an alternate design which could be used for a manual purge job, without using secondary indices. We are using Cassandra 2.0.x. Thanks,Joseph
Re: Production with Single Node
And I think in a 3 node cluster, RAID 0 would do the job instead of RAID 5 . So you will need less storage to get same disk space. But you will get protection against disk failures and infact entire node failure. Anuj Sent from Yahoo Mail on Android On Sat, 23 Jan, 2016 at 10:30 am, Anuj Wadehra<anujw_2...@yahoo.co.in> wrote: I think Jonathan said it earlier. You may be happy with the performance for now as you are using the same commitlog settings that you use in large clusters. Test the new setting recommended so that you know the real picture. Or be prepared to lose some data in case of failure. Other than durability, you single node cluster would be Single Point of Failure for your site. RAID 5 will only protect you against a disk failure. But a server may be down for other reasons too. Question is :Are you ok with site going down? I would suggest you to use hardware with smaller configuration to save on cost for smaller sites and go ahead with a 3 node minimum.That ways you will provide all the good features of your design irrespective of the site. Cassandra is known to work on commodity servers too. ThanksAnuj Sent from Yahoo Mail on Android On Sat, 23 Jan, 2016 at 4:23 am, Jack Krupansky<jack.krupan...@gmail.com> wrote: You do of course have the simple technical matters, most of which need to be addressed with a proof of concept implementation, related to memory, storage, latency, and throughput. I mean, with a scaled cluster you can always add nodes to increase capacity and throughput, and reduce latency, but with a single node you have limited flexibility. Just to be clear, Cassandra is still not recommended for "fat nodes" - even if you can fit tons of data on the node, you may not have the computes to satisfy throughput and latency requirements. And if you don't have enough system memory the amount of storage is irrelevant. Back to my original question:How much data (rows, columns), what kind of load pattern (heavy write, heavy update, heavy query), and what types of queries (primary key-only, slices, filtering, secondary indexes, etc.)? I do recall a customer who ran into problems because they had SSD but only a very limited amount so they were running out of storage. Having enough system memory for file system caching and offheap data is important as well. -- Jack Krupansky On Fri, Jan 22, 2016 at 5:07 PM, John Lammers <john.lamm...@karoshealth.com> wrote: Thanks for your response Jack. We are already sold on distributed databases, HA and scaling. We just have some small deployments coming up where there's no money for servers to run multiple Cassandra nodes. So, aside from the lack of HA, I'm asking if a single Cassandra node would be viable in a production environment. (There would be RAID 5 and the RAID controller cache is backed by flash memory). I'm asking because I'm concerned about using Cassandra in a way that it's not designed for. That to me is the unsettling aspect. If this is a bad idea, give me the ammo I need to shoot it down. I need specific technical reasons. Thanks! --John On Fri, Jan 22, 2016 at 4:47 PM, Jack Krupansky <jack.krupan...@gmail.com> wrote: Is single-node Cassandra has the performance (and capacity) you need and the NoSQL data model and API are sufficient for your app, and your dev and ops and support teams are already familiar with and committed to Cassandra, and you don't need HA or scaling, then it sounds like you are set. You asked about risks, and normally lack of HA and scaling are unacceptable risks when people are looking at distributed databases. Most people on this list are dedicated to and passionate about distributed databases, HA, and scaling, so it is distinctly unsettling when somebody comes along who isn't interested in and committed to those same three qualities. But if single-node happens to work for you, then that's great. -- Jack Krupansky
Re: Production with Single Node
I think Jonathan said it earlier. You may be happy with the performance for now as you are using the same commitlog settings that you use in large clusters. Test the new setting recommended so that you know the real picture. Or be prepared to lose some data in case of failure. Other than durability, you single node cluster would be Single Point of Failure for your site. RAID 5 will only protect you against a disk failure. But a server may be down for other reasons too. Question is :Are you ok with site going down? I would suggest you to use hardware with smaller configuration to save on cost for smaller sites and go ahead with a 3 node minimum.That ways you will provide all the good features of your design irrespective of the site. Cassandra is known to work on commodity servers too. ThanksAnuj Sent from Yahoo Mail on Android On Sat, 23 Jan, 2016 at 4:23 am, Jack Krupanskywrote: You do of course have the simple technical matters, most of which need to be addressed with a proof of concept implementation, related to memory, storage, latency, and throughput. I mean, with a scaled cluster you can always add nodes to increase capacity and throughput, and reduce latency, but with a single node you have limited flexibility. Just to be clear, Cassandra is still not recommended for "fat nodes" - even if you can fit tons of data on the node, you may not have the computes to satisfy throughput and latency requirements. And if you don't have enough system memory the amount of storage is irrelevant. Back to my original question:How much data (rows, columns), what kind of load pattern (heavy write, heavy update, heavy query), and what types of queries (primary key-only, slices, filtering, secondary indexes, etc.)? I do recall a customer who ran into problems because they had SSD but only a very limited amount so they were running out of storage. Having enough system memory for file system caching and offheap data is important as well. -- Jack Krupansky On Fri, Jan 22, 2016 at 5:07 PM, John Lammers wrote: Thanks for your response Jack. We are already sold on distributed databases, HA and scaling. We just have some small deployments coming up where there's no money for servers to run multiple Cassandra nodes. So, aside from the lack of HA, I'm asking if a single Cassandra node would be viable in a production environment. (There would be RAID 5 and the RAID controller cache is backed by flash memory). I'm asking because I'm concerned about using Cassandra in a way that it's not designed for. That to me is the unsettling aspect. If this is a bad idea, give me the ammo I need to shoot it down. I need specific technical reasons. Thanks! --John On Fri, Jan 22, 2016 at 4:47 PM, Jack Krupansky wrote: Is single-node Cassandra has the performance (and capacity) you need and the NoSQL data model and API are sufficient for your app, and your dev and ops and support teams are already familiar with and committed to Cassandra, and you don't need HA or scaling, then it sounds like you are set. You asked about risks, and normally lack of HA and scaling are unacceptable risks when people are looking at distributed databases. Most people on this list are dedicated to and passionate about distributed databases, HA, and scaling, so it is distinctly unsettling when somebody comes along who isn't interested in and committed to those same three qualities. But if single-node happens to work for you, then that's great. -- Jack Krupansky
Re: Run Repairs when a Node is Down
Thanks Paulo for sharing the JIRA !! I have added my comments there. "It is not advisable to remain with a down node for a long time without replacing it (with risk of not being able to achieve consistency if another node goes down)." I am referring to a generic scenario where a cluster may afford 2+ node failures based on RF but due to a single node failure, entire system health is in question as gc period is approaching and nodes are not getting repaired. You and others who are interested can join the discussion on the JIRA page :https://issues.apache.org/jira/browse/CASSANDRA-10446 ThanksAnuj On Tue, 19 Jan, 2016 at 1:21 am, Paulo Motta<pauloricard...@gmail.com> wrote: Hello Anuj, Repairing a range with down replicas may be valid if there is still QUORUM up replicas and using at least QUORUM for writes. My understanding is that it was disabled as default behavior on CASSANDRA-2290 to avoid misuse/confusion, and its not advisable to remain with a down node for a long time without replacing it (with risk of not being able to achieve consistency if another node goes down). Issue https://issues.apache.org/jira/browse/CASSANDRA-10446 was created to allow repairing ranges with down replicas with a special flag (--force). If you're interested please add comments there and/or propose a patch. Thanks, Paulo 2016-01-17 1:33 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>: Hi We are on 2.0.14,RF=3 in a 3 node cluster. We use repair -pr . Recently, we observed that repair -pr for all nodes fails if a node is down. Then I found the JIRA https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290 where an intentional decision was taken to abort the repair if a replica is down. I need to understand the reasoning behind aborting the repair instead of proceeding with available replicas. I have following concerns with the approach: We say that we have a fault tolerant Cassandra system such that we can afford single node failure because RF=3 and we read/write at QUORUM.But when a node goes down and we are not sure how much time will be needed to restore the node, entire system health is in question as gc_grace_period is approaching and we are not able to run repair -pr on any of the nodes. Then there is a dilemma: Whether to remove the faulty node well before gc grace period so that we get enough time to save data by repairing other two nodes? This may cause massive streaming which may be unnecessary if we are able to bring back the faulty node up before gc grace period. OR Wait and hope that the issue will be resolved before gc grace time and we will have some buffer to run repair -pr on all nodes. OR Increase the gc grace period temporarily. Then we should have capacity planning to accomodate the extra storage needed for extra gc grace that may be needed in case of node failure scenarios. Besides knowing the reasoning behind the decision taken in CASSANDRA-2290, I need to understand the recommeded approach for maintaing a fault tolerant system which can handle node failures such that repair can be run smoothly and system health is maintained at all times. ThanksAnuj Sent from Yahoo Mail on Android
Re: Run Repairs when a Node is Down
Thanks Paulo for sharing the JIRA !! I have added my comments there. "It is not advisable to remain with a down node for a long time without replacing it (with risk of not being able to achieve consistency if another node goes down)." I am referring to a generic scenario where a cluster may afford 2+ node failures based on RF but due to a single node failure, entire system health is in question as gc period is approaching and nodes are not getting repaired. I think the issue is important. I would suggest you and others interested in the issue to join the discussion on JIRA page :https://issues.apache.org/jira/browse/CASSANDRA-10446 ThanksAnuj Sent from Yahoo Mail on Android On Tue, 19 Jan, 2016 at 1:21 am, Paulo Motta<pauloricard...@gmail.com> wrote: Hello Anuj, Repairing a range with down replicas may be valid if there is still QUORUM up replicas and using at least QUORUM for writes. My understanding is that it was disabled as default behavior on CASSANDRA-2290 to avoid misuse/confusion, and its not advisable to remain with a down node for a long time without replacing it (with risk of not being able to achieve consistency if another node goes down). Issue https://issues.apache.org/jira/browse/CASSANDRA-10446 was created to allow repairing ranges with down replicas with a special flag (--force). If you're interested please add comments there and/or propose a patch. Thanks, Paulo 2016-01-17 1:33 GMT-03:00 Anuj Wadehra <anujw_2...@yahoo.co.in>: Hi We are on 2.0.14,RF=3 in a 3 node cluster. We use repair -pr . Recently, we observed that repair -pr for all nodes fails if a node is down. Then I found the JIRA https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290 where an intentional decision was taken to abort the repair if a replica is down. I need to understand the reasoning behind aborting the repair instead of proceeding with available replicas. I have following concerns with the approach: We say that we have a fault tolerant Cassandra system such that we can afford single node failure because RF=3 and we read/write at QUORUM.But when a node goes down and we are not sure how much time will be needed to restore the node, entire system health is in question as gc_grace_period is approaching and we are not able to run repair -pr on any of the nodes. Then there is a dilemma: Whether to remove the faulty node well before gc grace period so that we get enough time to save data by repairing other two nodes? This may cause massive streaming which may be unnecessary if we are able to bring back the faulty node up before gc grace period. OR Wait and hope that the issue will be resolved before gc grace time and we will have some buffer to run repair -pr on all nodes. OR Increase the gc grace period temporarily. Then we should have capacity planning to accomodate the extra storage needed for extra gc grace that may be needed in case of node failure scenarios. Besides knowing the reasoning behind the decision taken in CASSANDRA-2290, I need to understand the recommeded approach for maintaing a fault tolerant system which can handle node failures such that repair can be run smoothly and system health is maintained at all times. ThanksAnuj Sent from Yahoo Mail on Android
Run Repairs when a Node is Down
Hi We are on 2.0.14,RF=3 in a 3 node cluster. We use repair -pr . Recently, we observed that repair -pr for all nodes fails if a node is down. Then I found the JIRA https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290 where an intentional decision was taken to abort the repair if a replica is down. I need to understand the reasoning behind aborting the repair instead of proceeding with available replicas. I have following concerns with the approach: We say that we have a fault tolerant Cassandra system such that we can afford single node failure because RF=3 and we read/write at QUORUM.But when a node goes down and we are not sure how much time will be needed to restore the node, entire system health is in question as gc_grace_period is approaching and we are not able to run repair -pr on any of the nodes. Then there is a dilemma: Whether to remove the faulty node well before gc grace period so that we get enough time to save data by repairing other two nodes? This may cause massive streaming which may be unnecessary if we are able to bring back the faulty node up before gc grace period. OR Wait and hope that the issue will be resolved before gc grace time and we will have some buffer to run repair -pr on all nodes. OR Increase the gc grace period temporarily. Then we should have capacity planning to accomodate the extra storage needed for extra gc grace that may be needed in case of node failure scenarios. Besides knowing the reasoning behind the decision taken in CASSANDRA-2290, I need to understand the recommeded approach for maintaing a fault tolerant system which can handle node failures such that repair can be run smoothly and system health is maintained at all times. ThanksAnuj Sent from Yahoo Mail on Android
Impact of Changing Compaction Strategy
Hi, I need to understand whether all existing sstables are recreated/updated when we change compaction strategy from STCS to DTCS? Sstables are immutable by design but do we take an exception for such cases and update same files when an Alter statement is fired to change the compaction strategy? ThanksAnuj Sent from Yahoo Mail on Android