Re: RFC: Cassandra Virtual Nodes
There is now a parent ticket for this issue in JIRA: https://issues.apache.org/jira/browse/CASSANDRA-4119 Comments and contributions are still welcome! Cheers, Sam On 16 March 2012 23:38, Sam Overton s...@acunu.com wrote: Hello cassandra-dev, This is a long email. It concerns a significant change to Cassandra, so deserves a thorough introduction. The summary is: we believe virtual nodes are the way forward. We would like to add virtual nodes to Cassandra and we are asking for comments, criticism and collaboration! Cassandra's current partitioning scheme is sub-optimal for bootstrap, decommission, repair and re-balance operations, and places the burden on users to properly calculate tokens (a common cause of mistakes), which is a recurring pain-point. Virtual nodes have a variety of benefits over the one-to-one mapping of host to key range which Cassandra currently supports. Among these benefits are: * Even load balancing when growing and shrinking the cluster A virtual node scheme ensures that all hosts in a cluster have an even portion of the total data, and a new node bootstrapped into the cluster will assume its share of the data. Doubling, or halving the cluster to ensure even load distribution would no longer be necessary. * Distributed rebuild When sizing a cluster, one of the considerations is the amount of time required to recover from a failed node. This is the exposure time, during which a secondary failure could cause data loss. In order to guarantee an upper bound on the exposure time, the amount of data which can be stored on each host is limited by the amount of time taken to recover the required replica count. At Acunu we have found that the exposure time is frequently the limiting factor which dictates the maximum allowed node size in customers' clusters. Using a virtual node scheme, the data stored on one host is not replicated on just RF-1 other physical hosts. Each virtual node is replicated to RF-1 other virtual nodes which may be on a different set of physical hosts to replicas of other virtual nodes stored on the same host. This means data for one host is replicated evenly across the entire cluster. In the event of a failure then, restoring the replica count can be done in a fully distributed way. Each host in the cluster participates in the rebuild, drastically reducing the exposure time, allowing more data to be stored on a single host while still maintaining an acceptable upper bound on the likelihood of secondary failure. This reduces TCO concerns. * Greater failure tolerance in streaming Operations which require streaming of a large range of data, eg. bootstrap, decommission, repair, etc. incur a heavy cost if an error (eg. dropped network connection) is encountered during the streaming. Currently the whole range must be re-streamed, and this could constitute a very large amount of data. Virtual nodes reduce the impact of streaming failures, since each virtual node is a much smaller range of the key-space, so re-streaming a whole virtual node is a much cheaper process. * Evenly distributed impact of streaming operations Streaming operations such as bootstrap, repair, et al. would involve every node in the cluster. This would distribute the load of these operations across the whole cluster, and could be staggered so that only a small subset of nodes were affected at once, similar to staggered repair[1]. * Possibility for active load balancing Load balancing in Cassandra currently involves moving a token to increase/reduce the amount of key-space for which a host is responsible. This only allows load balancing between neighbouring nodes, so it could involve moving more than one token just to redistribute a single overloaded node. Virtual nodes could allow load balancing on a much finer granularity, so heavily loaded portions of the key-space could be redistributed to lighter-loaded hosts by reassigning one or more virtual nodes. Implementing a virtual node scheme in Cassandra is not an insignificant amount of work, and it will touch a large amount of the codebase related to partitioning, placement, routing, gossip, and so on. We do believe that this is possible to do incrementally, and in such a way that there is an easy upgrade path for pre-virtual-node deployments. It would not however touch the storage layer. The virtual node concept is solely for partitioning and placement, not for segregating the data storage of the host, so all keys for all virtual nodes on a host would be stored in the same SSTables. We are not proposing the adoption of the same scheme used by Voldemort[2] and described in the Dynamo paper[3]. We feel this scheme is too different from Cassandra's current distribution model to be a viable target for incremental development. Their scheme also fixes the number of virtual nodes for the lifetime of the cluster, which can prove to be a ceiling to scaling the cluster if
Re: RFC: Cassandra Virtual Nodes
The SSTable indices should still be scanned for size tiered compaction. Do I miss anything here? No I don't think you did, in fact, depending on the size of your SSTable a contiguous range (or the entire SSTable) may or may not be affected by a cleanup/move or any type of topology change. There is lots of room for optimization here. After loading the indexes we actually know start/end range for an SSTable so we can include/exclude it in any such operation. -- Ben Coverston DataStax -- The Apache Cassandra Company
Re: RFC: Cassandra Virtual Nodes
No I don't think you did, in fact, depending on the size of your SSTable a contiguous range (or the entire SSTable) may or may not be affected by a cleanup/move or any type of topology change. There is lots of room for optimization here. After loading the indexes we actually know start/end range for an SSTable so we can include/exclude it in any such operation. Just note that unless there is some correlation between range and these sstables being created to begin with (like with leveled), you're highly unlikely to be able to optimize here. For uniformly distributed tokens (hashed keys), all sstables are likely to have almost the entire possible token range in them. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
On Sat, Mar 24, 2012 at 7:55 AM, Peter Schuller peter.schul...@infidyne.com wrote: No I don't think you did, in fact, depending on the size of your SSTable a contiguous range (or the entire SSTable) may or may not be affected by a cleanup/move or any type of topology change. There is lots of room for optimization here. After loading the indexes we actually know start/end range for an SSTable so we can include/exclude it in any such operation Just note that unless there is some correlation between range and these sstables being created to begin with (like with leveled), you're highly unlikely to be able to optimize here. For uniformly distributed tokens (hashed keys), all sstables are likely to have almost the entire possible token range in them. As Peter pointed out, for random partitioner, the rows of a specific range might scatter around all sstables. Unless whole sstable can be ignored, disk seek is the performance killer here. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
On 22 March 2012 05:48, Zhu Han schumi@gmail.com wrote: I second it. Is there some goals we missed which can not be achieved by assigning multiple tokens to a single node? This is exactly the proposed solution. The discussion is about how to implement this, and the methods of choosing tokens and replication strategy. Richard.
Re: RFC: Cassandra Virtual Nodes
On Thu, Mar 22, 2012 at 6:20 PM, Richard Low r...@acunu.com wrote: On 22 March 2012 05:48, Zhu Han schumi@gmail.com wrote: I second it. Is there some goals we missed which can not be achieved by assigning multiple tokens to a single node? This is exactly the proposed solution. The discussion is about how to implement this, and the methods of choosing tokens and replication strategy. Does the new scheme still require the node to re-iterate all sstables to build the merkle tree or streaming data for partition level repair and move? The disk IO triggered by above steps could be very time-consuming if the dataset on single node is very large. It could be much more costly than the network IO, especially when concurrent repair tasks hit the same node. Is there any good ideas on it? Richard.
Re: RFC: Cassandra Virtual Nodes
You would have to iterate through all sstables on the system to repair one vnode, yes: but building the tree for just one range of the data means that huge portions of the sstables files can be skipped. It should scale down linearly as the number of vnodes increases (ie, with 100 vnodes, it will take 1/100th the time to repair one vnode). The story is less good for nodetool cleanup however, which still has to truck over the entire dataset. (The partitions/buckets in my crush-inspired scheme addresses this by allowing that each ring segment, in vnode terminology, be stored separately in the file system.) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
On Fri, Mar 23, 2012 at 6:54 AM, Peter Schuller peter.schul...@infidyne.com wrote: You would have to iterate through all sstables on the system to repair one vnode, yes: but building the tree for just one range of the data means that huge portions of the sstables files can be skipped. It should scale down linearly as the number of vnodes increases (ie, with 100 vnodes, it will take 1/100th the time to repair one vnode). The SSTable indices should still be scanned for size tiered compaction. Do I miss anything here? The story is less good for nodetool cleanup however, which still has to truck over the entire dataset. (The partitions/buckets in my crush-inspired scheme addresses this by allowing that each ring segment, in vnode terminology, be stored separately in the file system.) But the number of files can be a big problem if there are hundreds of vnodes and millions of sstables on the same physical node. We need a way to pin sstable inode to memory. Otherwise, it's possible the average number of disk IO to access a row in a sstable could be five or more. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis jbel...@gmail.com wrote: It's reasonable that we can attach different levels of importance to these things. Taking a step back, I have two main points: 1) vnodes add enormous complexity to *many* parts of Cassandra. I'm skeptical of the cost:benefit ratio here. 1a) The benefit is lower in my mind because many of the problems solved by vnodes can be solved well enough for most people, for some value of those two phrases, without vnodes. 2) I'm not okay with a commit something half-baked and sort it out later approach. I must admit I find this a little disheartening. The discussion has barely started. No one has had a chance to discuss implementation specifics so that the rest of us could understand *how* disruptive it would be (a necessary requirement in weighing cost:benefit), or what an incremental approach would look like, and yet work has already begun on shutting this down. Unless I'm reading you wrong, your mandate (I say mandate because you hinted at a veto elsewhere), is No to anything complex or invasive (for some value of each). The only alternative would seem to be a phased or incremental approach, but you seem to be saying No to that as well. There seems to be quite a bit of interest in having virtual nodes (and there has been for as long as I can remember), the only serious reservations relate to the difficulty/complexity. Is there really no way to put our heads together and figure out how to properly manage that aspect? On Tue, Mar 20, 2012 at 11:10 AM, Richard Low r...@acunu.com wrote: On 20 March 2012 14:55, Jonathan Ellis jbel...@gmail.com wrote: Here's how I see Sam's list: * Even load balancing when growing and shrinking the cluster Nice to have, but post-bootstrap load balancing works well in practice (and is improved by TRP). Post-bootstrap load balancing without vnodes necessarily streams more data than is necessary. Vnodes streams the minimal amount. In fact, post-bootstrap load balancing currently streams a constant fraction of your data - the network traffic involved in a rebalance increases linearly with the size of your cluster. With vnodes it decreases linearly. Including removing the ops overhead of running the load balance and calculating new tokens, this makes removing post-bootstrap load balancing a pretty big deal. * Greater failure tolerance in streaming Directly addressed by TRP. Agreed. * Evenly distributed impact of streaming operations Not a problem in practice with stream throttling. Throttling slows them down, increasing rebuild times so increasing downtime. * Possibility for active load balancing Not really a feature of vnodes per se, but as with the other load balancing point, this is also improved by TRP. Again with the caveat that more data is streamed with TRP. Vnodes removes the need for any load balancing with RP. * Distributed rebuild This is the 20% that TRP does not address. Nice to have? Yes. Can I live without it? I have so far. Is this alone worth the complexity of vnodes? No, it is not. Especially since there are probably other approaches that we can take to mitigate this, one of which Rick has suggested in a separate sub-thread. Distributed rebuild means you can store more data per node with the same failure probabilities. This is frequently a limiting factor on how much data you can store per node, increasing cluster sizes unnecessarily. I'd argue that this alone is worth the complexity of vnodes. Richard. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On Wed, Mar 21, 2012 at 9:50 AM, Eric Evans eev...@acunu.com wrote: On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis jbel...@gmail.com wrote: It's reasonable that we can attach different levels of importance to these things. Taking a step back, I have two main points: 1) vnodes add enormous complexity to *many* parts of Cassandra. I'm skeptical of the cost:benefit ratio here. 1a) The benefit is lower in my mind because many of the problems solved by vnodes can be solved well enough for most people, for some value of those two phrases, without vnodes. 2) I'm not okay with a commit something half-baked and sort it out later approach. I must admit I find this a little disheartening. The discussion has barely started. No one has had a chance to discuss implementation specifics so that the rest of us could understand *how* disruptive it would be (a necessary requirement in weighing cost:benefit), or what an incremental approach would look like, and yet work has already begun on shutting this down. Unless I'm reading you wrong, your mandate (I say mandate because you hinted at a veto elsewhere), is No to anything complex or invasive (for some value of each). The only alternative would seem to be a phased or incremental approach, but you seem to be saying No to that as well. There seems to be quite a bit of interest in having virtual nodes (and there has been for as long as I can remember), the only serious reservations relate to the difficulty/complexity. Is there really no way to put our heads together and figure out how to properly manage that aspect? On Tue, Mar 20, 2012 at 11:10 AM, Richard Low r...@acunu.com wrote: On 20 March 2012 14:55, Jonathan Ellis jbel...@gmail.com wrote: Here's how I see Sam's list: * Even load balancing when growing and shrinking the cluster Nice to have, but post-bootstrap load balancing works well in practice (and is improved by TRP). Post-bootstrap load balancing without vnodes necessarily streams more data than is necessary. Vnodes streams the minimal amount. In fact, post-bootstrap load balancing currently streams a constant fraction of your data - the network traffic involved in a rebalance increases linearly with the size of your cluster. With vnodes it decreases linearly. Including removing the ops overhead of running the load balance and calculating new tokens, this makes removing post-bootstrap load balancing a pretty big deal. * Greater failure tolerance in streaming Directly addressed by TRP. Agreed. * Evenly distributed impact of streaming operations Not a problem in practice with stream throttling. Throttling slows them down, increasing rebuild times so increasing downtime. * Possibility for active load balancing Not really a feature of vnodes per se, but as with the other load balancing point, this is also improved by TRP. Again with the caveat that more data is streamed with TRP. Vnodes removes the need for any load balancing with RP. * Distributed rebuild This is the 20% that TRP does not address. Nice to have? Yes. Can I live without it? I have so far. Is this alone worth the complexity of vnodes? No, it is not. Especially since there are probably other approaches that we can take to mitigate this, one of which Rick has suggested in a separate sub-thread. Distributed rebuild means you can store more data per node with the same failure probabilities. This is frequently a limiting factor on how much data you can store per node, increasing cluster sizes unnecessarily. I'd argue that this alone is worth the complexity of vnodes. Richard. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Eric Evans Acunu | http://www.acunu.com | @acunu I have also thought of how I would like Vnodes to work from an operational prospective rather then a software one. I would like these features. 1) No more raid 0. If a machine is responsible for 4 vnodes they should correspond to for JBOD. 2) Vnodes should be able to be hot pluged. My normal cassandra chassis would be a 2U with 6 drive bays. Imagine I have 10 nodes. Now if my chassis dies I should be able to take the disks out and physically plug them into another chassis. Then in cassandra I should be able to run a command like. nodetool attach '/mnt/disk6'. disk6 should contain all data an it's vnode information. Now this would be awesome for upgrades/migrations/etc.
Re: RFC: Cassandra Virtual Nodes
Hi Edward 1) No more raid 0. If a machine is responsible for 4 vnodes they should correspond to for JBOD. So each vnode corresponds to a disk? I suppose we could have a separate data directory per disk, but I think this should be a separate, subsequent change. However, do note that making the vnode ~size of a disk (and only have 4-8 per machine) would make any non-hotswap rebuilds slower. To get the fast distributed rebuilds, you need to have at least as many vnodes per node as you do nodes in the cluster. And you would still need the distributed rebuilds to deal with disk failure. 2) Vnodes should be able to be hot pluged. My normal cassandra chassis would be a 2U with 6 drive bays. Imagine I have 10 nodes. Now if my chassis dies I should be able to take the disks out and physically plug them into another chassis. Then in cassandra I should be able to run a command like. nodetool attach '/mnt/disk6'. disk6 should contain all data an it's vnode information. Now this would be awesome for upgrades/migrations/etc. You know, your not the first person I've spoke to who has asked for this! I do wonder whether it is optimising for the right thing though - in my experience, disks fail more often than machines. Thanks Tom
Re: RFC: Cassandra Virtual Nodes
On Wed, Mar 21, 2012 at 8:50 AM, Eric Evans eev...@acunu.com wrote: I must admit I find this a little disheartening. The discussion has barely started. No one has had a chance to discuss implementation specifics so that the rest of us could understand *how* disruptive it would be (a necessary requirement in weighing cost:benefit), or what an incremental approach would look like, and yet work has already begun on shutting this down. This isn't the first time vnodes has been brought up, so I've thought at least a little bit about what that would entail for TokenMetadata, StorageProxy, streaming, CFS, and on down the stack. And it scares me. So, I wanted to make my points about cost:benefit and about not committing with intentions to work out the details later, up front. But if you're still in brainstorming mode, carry on. Unless I'm reading you wrong, your mandate (I say mandate because you hinted at a veto elsewhere), is No to anything complex or invasive (for some value of each). The only alternative would seem to be a phased or incremental approach, but you seem to be saying No to that as well. There seems to be quite a bit of interest in having virtual nodes (and there has been for as long as I can remember), the only serious reservations relate to the difficulty/complexity. Is there really no way to put our heads together and figure out how to properly manage that aspect? At the risk of putting words in your mouth, I think your concern is that you don't want to go off and put man-months into a vnode implementation, only to have me come back and say, Sorry, it's too complicated, -1. Which is totally reasonable, I get that. I would suggest that the best way to mitigate that is, when you are ready, to put together as detailed an implementation plan as possible ahead of time before you start generating patchsets. Then we can put some meat on the discussion more meaningful than vague that scares me statements from yours truly. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: RFC: Cassandra Virtual Nodes
On Wed, Mar 21, 2012 at 3:24 PM, Tom Wilkie t...@acunu.com wrote: Hi Edward 1) No more raid 0. If a machine is responsible for 4 vnodes they should correspond to for JBOD. So each vnode corresponds to a disk? I suppose we could have a separate data directory per disk, but I think this should be a separate, subsequent change. I think having more micro-ranges makes the process much easier. Image a token ring 1-30 Node1 | major range 0-10 | disk 10-2 , disk2 3-4, disk 3 5-7, disk 4 8-10 Node2 | major range 11-20| disk 1 11-12 , disk2 13-14, disk 3 15-17, disk 4 18-20 Node3 | major range 21-30| disk 1 21-22 , disk2 23-24, disk 3 25-27, disk 4 28-30 Adding a 4th node is easy: If you are at the data center, just take disk 4 out of each node and place it in new server :) Software wise it is the same deal. Each node streams off only disk 4 to the new node. Now at this point disk 4 is idle and each machine should re balance its own data across its 4 disks. However, do note that making the vnode ~size of a disk (and only have 4-8 per machine) would make any non-hotswap rebuilds slower. To get the fast distributed rebuilds, you need to have at least as many vnodes per node as you do nodes in the cluster. And you would still need the distributed rebuilds to deal with disk failure. 2) Vnodes should be able to be hot pluged. My normal cassandra chassis would be a 2U with 6 drive bays. Imagine I have 10 nodes. Now if my chassis dies I should be able to take the disks out and physically plug them into another chassis. Then in cassandra I should be able to run a command like. nodetool attach '/mnt/disk6'. disk6 should contain all data an it's vnode information. Now this would be awesome for upgrades/migrations/etc. You know, your not the first person I've spoke to who has asked for this! I do wonder whether it is optimising for the right thing though - in my experience, disks fail more often than machines. Thanks Tom
Re: RFC: Cassandra Virtual Nodes
Software wise it is the same deal. Each node streams off only disk 4 to the new node. I think an implication on software is that if you want to make specific selections of partitions to move, you are effectively incompatible with deterministically generating the mapping of partition to responsible node. I.e., it probably means the vnode information must be kept as state. It is probably difficult to reconcile with balancing solutions like consistent hashing/crush/etc. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
I just see vnodes as a way to make the problem smaller and by making the problem smaller the overall system is more agile. Aka rather then 1 node streaming 100 gb the 4 nodes stream 25gb. Moves by hand are not so bad because the take 1/4th the time. The most simple vnode implementation is vmware. Just make sure that 3 vm nodes consecutive nodes so not end up on the same host. This is wasteful because we have 4 jvms. I envision vnodes as Cassandra master being a shared cache,memtables, and manager for what we today consider a Cassandra instance. Makes it simple to think about. On Wednesday, March 21, 2012, Peter Schuller peter.schul...@infidyne.com wrote: Software wise it is the same deal. Each node streams off only disk 4 to the new node. I think an implication on software is that if you want to make specific selections of partitions to move, you are effectively incompatible with deterministically generating the mapping of partition to responsible node. I.e., it probably means the vnode information must be kept as state. It is probably difficult to reconcile with balancing solutions like consistent hashing/crush/etc. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
I envision vnodes as Cassandra master being a shared cache,memtables, and manager for what we today consider a Cassandra instance. It might be kind of problematic when you are moving the nodes you want the data associated with the node to move too, otherwise it will be a pain to cleanup after that (Something like nt clean). I think a vnode should be as much isolated as possible to reduce the impact when it is moving (which will become a normal cluster operation), Just my 2 cents. Regards, /VJ On Wed, Mar 21, 2012 at 5:41 PM, Edward Capriolo edlinuxg...@gmail.comwrote: I just see vnodes as a way to make the problem smaller and by making the problem smaller the overall system is more agile. Aka rather then 1 node streaming 100 gb the 4 nodes stream 25gb. Moves by hand are not so bad because the take 1/4th the time. The most simple vnode implementation is vmware. Just make sure that 3 vm nodes consecutive nodes so not end up on the same host. This is wasteful because we have 4 jvms. I envision vnodes as Cassandra master being a shared cache,memtables, and manager for what we today consider a Cassandra instance. Makes it simple to think about. On Wednesday, March 21, 2012, Peter Schuller peter.schul...@infidyne.com wrote: Software wise it is the same deal. Each node streams off only disk 4 to the new node. I think an implication on software is that if you want to make specific selections of partitions to move, you are effectively incompatible with deterministically generating the mapping of partition to responsible node. I.e., it probably means the vnode information must be kept as state. It is probably difficult to reconcile with balancing solutions like consistent hashing/crush/etc. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
A friend pointed out to me privately that I came across pretty harsh in this thread. While I stand by my technical concerns, I do want to acknowledge that Sam's proposal here indicates a strong grasp of the principles involved, and a deeper level of thought into the issues than I think anyone else has brought to date. Thanks for putting that energy into it, Sam, and I look forward to seeing how you approach the implementation. On Fri, Mar 16, 2012 at 6:38 PM, Sam Overton s...@acunu.com wrote: Hello cassandra-dev, This is a long email. It concerns a significant change to Cassandra, so deserves a thorough introduction. *The summary is*: we believe virtual nodes are the way forward. We would like to add virtual nodes to Cassandra and we are asking for comments, criticism and collaboration! Cassandra's current partitioning scheme is sub-optimal for bootstrap, decommission, repair and re-balance operations, and places the burden on users to properly calculate tokens (a common cause of mistakes), which is a recurring pain-point. Virtual nodes have a variety of benefits over the one-to-one mapping of host to key range which Cassandra currently supports. Among these benefits are: * Even load balancing when growing and shrinking the cluster A virtual node scheme ensures that all hosts in a cluster have an even portion of the total data, and a new node bootstrapped into the cluster will assume its share of the data. Doubling, or halving the cluster to ensure even load distribution would no longer be necessary. * Distributed rebuild When sizing a cluster, one of the considerations is the amount of time required to recover from a failed node. This is the exposure time, during which a secondary failure could cause data loss. In order to guarantee an upper bound on the exposure time, the amount of data which can be stored on each host is limited by the amount of time taken to recover the required replica count. At Acunu we have found that the exposure time is frequently the limiting factor which dictates the maximum allowed node size in customers' clusters. Using a virtual node scheme, the data stored on one host is not replicated on just RF-1 other physical hosts. Each virtual node is replicated to RF-1 other virtual nodes which may be on a different set of physical hosts to replicas of other virtual nodes stored on the same host. This means data for one host is replicated evenly across the entire cluster. In the event of a failure then, restoring the replica count can be done in a fully distributed way. Each host in the cluster participates in the rebuild, drastically reducing the exposure time, allowing more data to be stored on a single host while still maintaining an acceptable upper bound on the likelihood of secondary failure. This reduces TCO concerns. * Greater failure tolerance in streaming Operations which require streaming of a large range of data, eg. bootstrap, decommission, repair, etc. incur a heavy cost if an error (eg. dropped network connection) is encountered during the streaming. Currently the whole range must be re-streamed, and this could constitute a very large amount of data. Virtual nodes reduce the impact of streaming failures, since each virtual node is a much smaller range of the key-space, so re-streaming a whole virtual node is a much cheaper process. * Evenly distributed impact of streaming operations Streaming operations such as bootstrap, repair, et al. would involve every node in the cluster. This would distribute the load of these operations across the whole cluster, and could be staggered so that only a small subset of nodes were affected at once, similar to staggered repair[1]. * Possibility for active load balancing Load balancing in Cassandra currently involves moving a token to increase/reduce the amount of key-space for which a host is responsible. This only allows load balancing between neighbouring nodes, so it could involve moving more than one token just to redistribute a single overloaded node. Virtual nodes could allow load balancing on a much finer granularity, so heavily loaded portions of the key-space could be redistributed to lighter-loaded hosts by reassigning one or more virtual nodes. Implementing a virtual node scheme in Cassandra is not an insignificant amount of work, and it will touch a large amount of the codebase related to partitioning, placement, routing, gossip, and so on. We do believe that this is possible to do incrementally, and in such a way that there is an easy upgrade path for pre-virtual-node deployments. It would not however touch the storage layer. The virtual node concept is solely for partitioning and placement, not for segregating the data storage of the host, so all keys for all virtual nodes on a host would be stored in the same SSTables. We are not proposing the adoption of the same scheme used by Voldemort[2] and described in the Dynamo paper[3]. We
Re: RFC: Cassandra Virtual Nodes
On Tue, Mar 20, 2012 at 11:24 PM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: So taking a step back, if we want vnodes why can't we just give every node 100 tokens instead of only one? Seems to me this would have less impact on the rest of the code. It would just look like you had a 500 node cluster, instead of a 5 node cluster. Your replication strategy would have to know about the physical machines so that data gets replicated right, but there is already some concept of this with the data center aware and rack aware stuff. From what I see I think you could get most of the benefits of vnodes by implementing a new Placement Strategy that did something like this, and you wouldn't have to touch (and maybe break) code in other places. Am I crazy? Naive? Once you had this setup, you can start to implement the vnode like stuff on top of it. Like bootstrapping nodes in one token at a time, and taking them on from the whole cluster, not just your neighbor. etc. etc. I second it. Is there some goals we missed which can not be achieved by assigning multiple tokens to a single node? -Jeremiah Jordan From: Rick Branson [rbran...@datastax.com] Sent: Monday, March 19, 2012 5:16 PM To: dev@cassandra.apache.org Subject: Re: RFC: Cassandra Virtual Nodes I think if we could go back and rebuild Cassandra from scratch, vnodes would likely be implemented from the beginning. However, I'm concerned that implementing them now could be a big distraction from more productive uses of all of our time and introduce major potential stability issues into what is becoming a business critical piece of infrastructure for many people. However, instead of just complaining and pedantry, I'd like to offer a feasible alternative: Has there been consideration given to the idea of a supporting a single token range for a node? While not theoretically as capable as vnodes, it seems to me to be more practical as it would have a significantly lower impact on the codebase and provides a much clearer migration path. It also seems to solve a majority of complaints regarding operational issues with Cassandra clusters. Each node would have a lower and an upper token, which would form a range that would be actively distributed via gossip. Read and replication requests would only be routed to a replica when the key of these operations matched the replica's token range in the gossip tables. Each node would locally store it's own current active token range as well as a target token range it's moving towards. As a new node undergoes bootstrap, the bounds would be gradually expanded to allow it to handle requests for a wider range of the keyspace as it moves towards it's target token range. This idea boils down to a move from hard cutovers to smoother operations by gradually adjusting active token ranges over a period of time. It would apply to token change operations (nodetool 'move' and 'removetoken') as well. Failure during streaming could be recovered at the bounds instead of restarting the whole process as the active bounds would effectively track the progress for bootstrap target token changes. Implicitly these operations would be throttled to some degree. Node repair (AES) could also be modified using the same overall ideas provide a more gradual impact on the cluster overall similar as the ideas given in CASSANDRA-3721. While this doesn't spread the load over the cluster for these operations evenly like vnodes does, this is likely an issue that could be worked around by performing concurrent (throttled) bootstrap node repair (AES) operations. It does allow some kind of active load balancing, but clearly this is not as flexible or as useful as vnodes, but you should be using RandomPartitioner or sort-of-randomized keys with OPP right? ;) As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I think some progress could be made here by allowing a pack of independent Cassandra nodes to be ran on a single host; somewhat (but nowhere near entirely) similar to a pre-fork model used by some UNIX-based servers. Input? -- Rick Branson DataStax
Re: RFC: Cassandra Virtual Nodes
On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote: On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote: I'm guessing you're referring to Rick's proposal about ranges per node? May be, what i mean is little more simple than that... We can consider every node having a multiple conservative ranges and moving those ranges for bootstrap etc, instead of finding the mid point etc in the bootstrap code. Once we get that working all the way to the FS/Streaming then we can move those ranges and assign those ranges to nodes in random orders. Hope it makes sense. I agree that this should be approached in incremental steps. Rick already raised concerns about stability issues which might arise from changing large parts of code at once. I would anticipate the first step to be, exactly as you suggest, to support multiple tokens per host instead of just one. Presumably in your suggestion you imagine these tokens to define contiguous ranges for a given host, so that the distribution model is the same as before, but bootstrap can be done incrementally. This would be a great first step. The extension to a virtual node scheme as described previously is then fairly trivial. The only additional change needed is to assign the tokens in some other way which does not restrict the ranges to being contiguous. -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton s...@acunu.com wrote: On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote: On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote: I'm guessing you're referring to Rick's proposal about ranges per node? May be, what i mean is little more simple than that... We can consider every node having a multiple conservative ranges and moving those ranges for bootstrap etc, instead of finding the mid point etc in the bootstrap code. Once we get that working all the way to the FS/Streaming then we can move those ranges and assign those ranges to nodes in random orders. Hope it makes sense. I agree that this should be approached in incremental steps. Rick already raised concerns about stability issues which might arise from changing large parts of code at once. I would anticipate the first step to be, exactly as you suggest, to support multiple tokens per host instead of just one. Presumably in your suggestion you imagine these tokens to define contiguous ranges for a given host, so that the distribution model is the same as before, but bootstrap can be done incrementally. This would be a great first step. The extension to a virtual node scheme as described previously is then fairly trivial. The only additional change needed is to assign the tokens in some other way which does not restrict the ranges to being contiguous. Sounds good to me. What can an upgrading user expect in the way of disruption? What would be required to move an existing cluster from one token per node to virtual nodes? Could this be made transparent? -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
I like this idea. It feels like a good 80/20 solution -- 80% of the benefits, 20% of the effort. More like 5% of the effort. I can't even enumerate all the places full vnode support would change, but an active token range concept would be relatively limited in scope. Full vnodes feels a lot more like the counters quagmire, where Digg/Twitter worked on it for... 8? months, and then DataStax worked on it about for about 6 months post-commit, and we're still finding the occasional bug-since-0.7 there. With the benefit of hindsight, as bad as maintaining that patchset was out of tree, committing it as early as we did was a mistake. We won't do that again. (On the bright side, git makes maintaining such a patchset easier now.) On Mon, Mar 19, 2012 at 5:16 PM, Rick Branson rbran...@datastax.com wrote: I think if we could go back and rebuild Cassandra from scratch, vnodes would likely be implemented from the beginning. However, I'm concerned that implementing them now could be a big distraction from more productive uses of all of our time and introduce major potential stability issues into what is becoming a business critical piece of infrastructure for many people. However, instead of just complaining and pedantry, I'd like to offer a feasible alternative: Has there been consideration given to the idea of a supporting a single token range for a node? While not theoretically as capable as vnodes, it seems to me to be more practical as it would have a significantly lower impact on the codebase and provides a much clearer migration path. It also seems to solve a majority of complaints regarding operational issues with Cassandra clusters. Each node would have a lower and an upper token, which would form a range that would be actively distributed via gossip. Read and replication requests would only be routed to a replica when the key of these operations matched the replica's token range in the gossip tables. Each node would locally store it's own current active token range as well as a target token range it's moving towards. As a new node undergoes bootstrap, the bounds would be gradually expanded to allow it to handle requests for a wider range of the keyspace as it moves towards it's target token range. This idea boils down to a move from hard cutovers to smoother operations by gradually adjusting active token ranges over a period of time. It would apply to token change operations (nodetool 'move' and 'removetoken') as well. Failure during streaming could be recovered at the bounds instead of restarting the whole process as the active bounds would effectively track the progress for bootstrap target token changes. Implicitly these operations would be throttled to some degree. Node repair (AES) could also be modified using the same overall ideas provide a more gradual impact on the cluster overall similar as the ideas given in CASSANDRA-3721. While this doesn't spread the load over the cluster for these operations evenly like vnodes does, this is likely an issue that could be worked around by performing concurrent (throttled) bootstrap node repair (AES) operations. It does allow some kind of active load balancing, but clearly this is not as flexible or as useful as vnodes, but you should be using RandomPartitioner or sort-of-randomized keys with OPP right? ;) As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I think some progress could be made here by allowing a pack of independent Cassandra nodes to be ran on a single host; somewhat (but nowhere near entirely) similar to a pre-fork model used by some UNIX-based servers. Input? -- Rick Branson DataStax -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: RFC: Cassandra Virtual Nodes
On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis jbel...@gmail.com wrote: I like this idea. It feels like a good 80/20 solution -- 80% of the benefits, 20% of the effort. More like 5% of the effort. I can't even enumerate all the places full vnode support would change, but an active token range concept would be relatively limited in scope. It only addresses 1 of Sam's original 5 points, so I wouldn't call it an 80% solution. Full vnodes feels a lot more like the counters quagmire, where Digg/Twitter worked on it for... 8? months, and then DataStax worked on it about for about 6 months post-commit, and we're still finding the occasional bug-since-0.7 there. With the benefit of hindsight, as bad as maintaining that patchset was out of tree, committing it as early as we did was a mistake. We won't do that again. (On the bright side, git makes maintaining such a patchset easier now.) And yet counters have become a very important feature for Cassandra; We're better off with them, than without. I think there were a number of problems with how counters went down that could be avoided here. For one, we can take a phased, incremental approach, rather than waiting 8 months to drop a large patchset. On Mon, Mar 19, 2012 at 5:16 PM, Rick Branson rbran...@datastax.com wrote: I think if we could go back and rebuild Cassandra from scratch, vnodes would likely be implemented from the beginning. However, I'm concerned that implementing them now could be a big distraction from more productive uses of all of our time and introduce major potential stability issues into what is becoming a business critical piece of infrastructure for many people. However, instead of just complaining and pedantry, I'd like to offer a feasible alternative: Has there been consideration given to the idea of a supporting a single token range for a node? While not theoretically as capable as vnodes, it seems to me to be more practical as it would have a significantly lower impact on the codebase and provides a much clearer migration path. It also seems to solve a majority of complaints regarding operational issues with Cassandra clusters. Each node would have a lower and an upper token, which would form a range that would be actively distributed via gossip. Read and replication requests would only be routed to a replica when the key of these operations matched the replica's token range in the gossip tables. Each node would locally store it's own current active token range as well as a target token range it's moving towards. As a new node undergoes bootstrap, the bounds would be gradually expanded to allow it to handle requests for a wider range of the keyspace as it moves towards it's target token range. This idea boils down to a move from hard cutovers to smoother operations by gradually adjusting active token ranges over a period of time. It would apply to token change operations (nodetool 'move' and 'removetoken') as well. Failure during streaming could be recovered at the bounds instead of restarting the whole process as the active bounds would effectively track the progress for bootstrap target token changes. Implicitly these operations would be throttled to some degree. Node repair (AES) could also be modified using the same overall ideas provide a more gradual impact on the cluster overall similar as the ideas given in CASSANDRA-3721. While this doesn't spread the load over the cluster for these operations evenly like vnodes does, this is likely an issue that could be worked around by performing concurrent (throttled) bootstrap node repair (AES) operations. It does allow some kind of active load balancing, but clearly this is not as flexible or as useful as vnodes, but you should be using RandomPartitioner or sort-of-randomized keys with OPP right? ;) As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I think some progress could be made here by allowing a pack of independent Cassandra nodes to be ran on a single host; somewhat (but nowhere near entirely) similar to a pre-fork model used by some UNIX-based servers. Input? -- Rick Branson DataStax -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
I like this idea. It feels like a good 80/20 solution -- 80% of the benefits, 20% of the effort. More like 5% of the effort. I can't even enumerate all the places full vnode support would change, but an active token range concept would be relatively limited in scope. It only addresses 1 of Sam's original 5 points, so I wouldn't call it an 80% solution. To support a form of DF, I think some tweaking of the replica placement could achieve this effect quite well. We could introduce a variable into replica placement, which I'm going to incorrectly call DF for the purposes of illustration. The key range for a node would be sub-divided by DF (1 by default) and this would be used to further distribution replica selection based on this sub-partition. Currently, the offset formula works out to be something like this: offset = replica For RandomPartitioner, DF placement might look something like: offset = replica + (token % DF) Now, I realize replica selection is actually much more complicated than this, but these formulas are for illustration purposes. Modifying replica placement the partitioners to support this seems straightforward, but I'm unsure of what's required to get it working for ring management operations. On the surface, it does seem like this could be added without any kind of difficult migration support. Thoughts?
Re: RFC: Cassandra Virtual Nodes
On 20 March 2012 13:37, Eric Evans eev...@acunu.com wrote: On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton s...@acunu.com wrote: On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote: May be, what i mean is little more simple than that... We can consider every node having a multiple conservative ranges and moving those ranges for bootstrap etc, instead of finding the mid point etc in the bootstrap code. Once we get that working all the way to the FS/Streaming then we can move those ranges and assign those ranges to nodes in random orders. Hope it makes sense. I agree that this should be approached in incremental steps. Rick already raised concerns about stability issues which might arise from changing large parts of code at once. I would anticipate the first step to be, exactly as you suggest, to support multiple tokens per host instead of just one. Presumably in your suggestion you imagine these tokens to define contiguous ranges for a given host, so that the distribution model is the same as before, but bootstrap can be done incrementally. This would be a great first step. The extension to a virtual node scheme as described previously is then fairly trivial. The only additional change needed is to assign the tokens in some other way which does not restrict the ranges to being contiguous. Sounds good to me. What can an upgrading user expect in the way of disruption? What would be required to move an existing cluster from one token per node to virtual nodes? Could this be made transparent? The disruption for an end-user would be no more than the same rolling upgrade process that they have to go through currently to upgrade to a new version. This is how I envisage it working: * When a node is upgraded and the new version starts up in an old cluster, it would split its own token range into multiple sub-ranges by assigning itself more tokens in its own range * These tokens could then be gossiped to any other new versions in the cluster. The old versions don't need to know about these intermediate tokens because distribution is exactly the same - node ranges are still contiguous * Once every node has been upgraded, distribution is still the same as before, but now ranges are split into sub-ranges * The benefits of vnodes start to become apparent when adding new nodes to the cluster - a new node bootstrapping would take an even amount of data from each other node and would not require doubling the cluster to maintain balance * As more nodes are added to the cluster it gets closer to full vnode distribution as more of the original hosts' ranges get reassigned to new nodes If the user wants to migrate to full vnode functionality straight away then they can do a rolling migration (decommission/bootstrap). During this migration there would be some imbalance in the cluster, but once all of the old nodes have been migrated, the cluster would be balanced. -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On Tue, Mar 20, 2012 at 9:08 AM, Eric Evans eev...@acunu.com wrote: On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis jbel...@gmail.com wrote: I like this idea. It feels like a good 80/20 solution -- 80% of the benefits, 20% of the effort. More like 5% of the effort. I can't even enumerate all the places full vnode support would change, but an active token range concept would be relatively limited in scope. It only addresses 1 of Sam's original 5 points, so I wouldn't call it an 80% solution. I guess a more accurate way to put this is, only 20% of Sam's list is an actual pain point that doesn't get addressed by The Rick Proposal [TRP]. Here's how I see Sam's list: * Even load balancing when growing and shrinking the cluster Nice to have, but post-bootstrap load balancing works well in practice (and is improved by TRP). * Greater failure tolerance in streaming Directly addressed by TRP. * Evenly distributed impact of streaming operations Not a problem in practice with stream throttling. * Possibility for active load balancing Not really a feature of vnodes per se, but as with the other load balancing point, this is also improved by TRP. * Distributed rebuild This is the 20% that TRP does not address. Nice to have? Yes. Can I live without it? I have so far. Is this alone worth the complexity of vnodes? No, it is not. Especially since there are probably other approaches that we can take to mitigate this, one of which Rick has suggested in a separate sub-thread. Full vnodes feels a lot more like the counters quagmire, where Digg/Twitter worked on it for... 8? months, and then DataStax worked on it about for about 6 months post-commit, and we're still finding the occasional bug-since-0.7 there. With the benefit of hindsight, as bad as maintaining that patchset was out of tree, committing it as early as we did was a mistake. We won't do that again. (On the bright side, git makes maintaining such a patchset easier now.) And yet counters have become a very important feature for Cassandra; We're better off with them, than without. False dichotomy (we could have waited for a better counter design), but that's mostly irrelevant to my point that jamming incomplete code in-tree to sort out later is a bad idea. I think there were a number of problems with how counters went down that could be avoided here. For one, we can take a phased, incremental approach, rather than waiting 8 months to drop a large patchset. If there are incremental improvements to be made that justify themselves independently, then I agree. Small, self-contained steps are a good thing. A good example is https://issues.apache.org/jira/browse/CASSANDRA-2319, a product of The Grand Storage Engine Redesign of 674 fame. But, when things don't naturally break down into such mini-features, then I'm -1 on committing code that has no purpose other than to be a foundation for later commits. I've seen people get bored or assigned to other projects too often to just trust that those later commits will indeed be forthcoming. Or even if Sam [for instance] is still working hard on it, it's very easy for unforseen difficulties to come up that invalidate the original approach. Since we were talking about counters, the original vector clock approach -- that we ended up ripping out, painfully -- is a good example. Once bitten, twice shy. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
RE: RFC: Cassandra Virtual Nodes
So taking a step back, if we want vnodes why can't we just give every node 100 tokens instead of only one? Seems to me this would have less impact on the rest of the code. It would just look like you had a 500 node cluster, instead of a 5 node cluster. Your replication strategy would have to know about the physical machines so that data gets replicated right, but there is already some concept of this with the data center aware and rack aware stuff. From what I see I think you could get most of the benefits of vnodes by implementing a new Placement Strategy that did something like this, and you wouldn't have to touch (and maybe break) code in other places. Am I crazy? Naive? Once you had this setup, you can start to implement the vnode like stuff on top of it. Like bootstrapping nodes in one token at a time, and taking them on from the whole cluster, not just your neighbor. etc. etc. -Jeremiah Jordan From: Rick Branson [rbran...@datastax.com] Sent: Monday, March 19, 2012 5:16 PM To: dev@cassandra.apache.org Subject: Re: RFC: Cassandra Virtual Nodes I think if we could go back and rebuild Cassandra from scratch, vnodes would likely be implemented from the beginning. However, I'm concerned that implementing them now could be a big distraction from more productive uses of all of our time and introduce major potential stability issues into what is becoming a business critical piece of infrastructure for many people. However, instead of just complaining and pedantry, I'd like to offer a feasible alternative: Has there been consideration given to the idea of a supporting a single token range for a node? While not theoretically as capable as vnodes, it seems to me to be more practical as it would have a significantly lower impact on the codebase and provides a much clearer migration path. It also seems to solve a majority of complaints regarding operational issues with Cassandra clusters. Each node would have a lower and an upper token, which would form a range that would be actively distributed via gossip. Read and replication requests would only be routed to a replica when the key of these operations matched the replica's token range in the gossip tables. Each node would locally store it's own current active token range as well as a target token range it's moving towards. As a new node undergoes bootstrap, the bounds would be gradually expanded to allow it to handle requests for a wider range of the keyspace as it moves towards it's target token range. This idea boils down to a move from hard cutovers to smoother operations by gradually adjusting active token ranges over a period of time. It would apply to token change operations (nodetool 'move' and 'removetoken') as well. Failure during streaming could be recovered at the bounds instead of restarting the whole process as the active bounds would effectively track the progress for bootstrap target token changes. Implicitly these operations would be throttled to some degree. Node repair (AES) could also be modified using the same overall ideas provide a more gradual impact on the cluster overall similar as the ideas given in CASSANDRA-3721. While this doesn't spread the load over the cluster for these operations evenly like vnodes does, this is likely an issue that could be worked around by performing concurrent (throttled) bootstrap node repair (AES) operations. It does allow some kind of active load balancing, but clearly this is not as flexible or as useful as vnodes, but you should be using RandomPartitioner or sort-of-randomized keys with OPP right? ;) As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I think some progress could be made here by allowing a pack of independent Cassandra nodes to be ran on a single host; somewhat (but nowhere near entirely) similar to a pre-fork model used by some UNIX-based servers. Input? -- Rick Branson DataStax
Re: RFC: Cassandra Virtual Nodes
On 19 March 2012 23:41, Peter Schuller peter.schul...@infidyne.com wrote: Using this ring bucket in the CRUSH topology, (with the hash function being the identity function) would give the exact same distribution properties as the virtual node strategy that I suggested previously, but of course with much better topology awareness. I will have to re-read your orignal post. I seem to have missed something :) I did, and I may or may not understand what you mean. Are you comparing vnodes + hashing, with CRUSH + pre-partitioning by hash + identity hash as you traverse down the topology tree? Yes. I was just trying to illustrate that it's not necessary to have CRUSH doing the partitioning and placement of primary replicas. The same functionality can be achieved by having logically separate placement (a ring with virtual nodes) and a replication strategy which implements the CRUSH algorithm for replica placement. I think you agreed with this further down your previous reply anyway, perhaps I was just being too verbose :) The reason I'm trying to make that distinction is because it will be less work than wholesale replacing the entire distribution logic in Cassandra with CRUSH. I'm not sure if that's exactly what your design is suggesting? -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On 20 March 2012 14:50, Rick Branson rbran...@datastax.com wrote: To support a form of DF, I think some tweaking of the replica placement could achieve this effect quite well. We could introduce a variable into replica placement, which I'm going to incorrectly call DF for the purposes of illustration. The key range for a node would be sub-divided by DF (1 by default) and this would be used to further distribution replica selection based on this sub-partition. Currently, the offset formula works out to be something like this: offset = replica For RandomPartitioner, DF placement might look something like: offset = replica + (token % DF) Now, I realize replica selection is actually much more complicated than this, but these formulas are for illustration purposes. Modifying replica placement the partitioners to support this seems straightforward, but I'm unsure of what's required to get it working for ring management operations. On the surface, it does seem like this could be added without any kind of difficult migration support. Thoughts? This solution increases the DF, which has the advantage of providing some balancing when a node is down temporarily. The reads and writes it would have served are now distributed around ~DF nodes. However, it doesn't have any distributed rebuild. In fact, any distribution mechanism with one token per node cannot have distributed rebuild. Should a node fail, the next node in the ring has twice the token range so must have twice the data. This node will limit the rebuild time - 'nodetool removetoken' will have to replicate the data of the failed node onto this node. Increasing the distribution factor without speeding up rebuild increases the failure probability - both for data loss or being unable to reach required consistency levels. The failure probability is a trade-off between rebuild time and distribution factor. Lower rebuild time helps, and lower distribution factor helps. Cassandra as it is now has the longest rebuild time and lowest possible distribution factor. The original vnodes scheme is the other extreme - shortest rebuild time and largest possible distribution factor. It turns out that the rebuild time is more important, so this decreases failure probability (with some assumptions you can show it decreases by a factor RF! - I'll spare you the math but can send it if you're interested). This scheme has the longest rebuild time and a (tuneable) distribution factor, but larger than the lowest. That necessarily increases the failure probability over both Cassandra now and vnode schemes, so I'd be very careful about choosing it. Richard.
Re: RFC: Cassandra Virtual Nodes
Each node would have a lower and an upper token, which would form a range that would be actively distributed via gossip. Read and replication requests would only be routed to a replica when the key of these operations matched the replica's token range in the gossip tables. Each node would locally store it's own current active token range as well as a target token range it's moving towards. How is this significantly different than just using nodetool move (in post-1.x) more rapidly and on smaller segments at a time? There is the ring delay stuff which makes it un-workable to do at high granularity, but that should apply to the active range solution too. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
Hi Radim, The number of virtual nodes for each host would be configurable by the user, in much the same way that initial_token is configurable now. A host taking a larger number of virtual nodes (tokens) would have proportionately more of the data. This is how we anticipate support for heterogeneity in cluster hardware. Yes, but this is good only for random partitioner. For ordered you need to be able split token space on highly loaded servers. With virtual tokens it will move load to random node. What if random node will be also hotspot node? Administration will be more difficult because you don't know where workload lands after you reduce number of tokens held by node.
Re: RFC: Cassandra Virtual Nodes
Hi Peter, It's great to hear that others have come to some of the same conclusions! I think a CRUSH-like strategy for topologically aware replication/routing/locality is a great idea. I think I can see three mostly orthogonal sets of functionality that we're concerned with: a) a virtual node partitioning scheme (to support heterogeneity and management simplicity) b) topology aware replication c) topology aware routing First of all, I think that while (c) depends on (b) it does not affect partitioning or replication directly, so I'm going to set that aside for the moment and talk just about the former two. I'll summarise your design here, mainly to make sure that I understand it, but also to refer back to it: 1. The hash-space is partitioned into a fixed number of partitions 2. The CRUSH algorithm is run - select(1, disk) - over the topology using each partition as a key, to get an assignment of partition - physical host (primary) 2a. adding or removing a node requires re-running CRUSH to recalculate the partition assignment (and move data) 3. The CRUSH algorithm is run - select(RF-1, disk) - over the topology using each primary host id, to get an assignment of primary host - RF-1 replicas 3a. adding or removing a node requires re-running CRUSH to recalculate replica assignment (which might be a different set of hosts to before?) Here are some thoughts: (clarification: when I'm talking about buckets, I'm referring to the same concept as in the CRUSH paper!) One of my concerns about using CRUSH exactly as described in the paper is that it seems to be sub-optimal in the amount of data that it moves after modifying the topology. The authors of the paper introduce several bucket types (uniform, list, tree, straw) which appear to be various sub-optimal alternatives to consistent hashing, with various trade-offs. Why not use consistent hashing? Given (2a) and (3a) I think we might end up moving way too much data when the set of replicas changes completely for a given host. Let's suppose we introduce our own bucket type called a ring bucket. Each item in a ring bucket is assigned an equal, non-contiguous portion of the key hash-space, which determines which keys are assigned to it. When an item is added to the ring bucket, it takes an equal portion of the hash-space from every other item already in the bucket. And vice-versa for removals. It's easy to see that this ring bucket implements consistent hashing with some unspecified virtual node scheme. Additions and removals would be optimal (only \deltaw/W keys require moving when the topology changes). Using this ring bucket in the CRUSH topology, (with the hash function being the identity function) would give the exact same distribution properties as the virtual node strategy that I suggested previously, but of course with much better topology awareness. This makes it evident that the partitioning scheme, and a CRUSH-like replication scheme are orthogonal concerns. In the same way as NTS currently uses the ring to provide distribution at DC and rack level by conceptually separating the ring into a distinct logical rings for each DC, a CrushReplicationStrategy could use the ring as its bucketing function to distribute partitions in the topology. This brings me on to (1) and the reasons for our choice of virtual node scheme - choose N random tokens - instead of the Dynamo-like scheme that you suggest where the partitions are fixed in advance. With the Dynamo scheme, the size of a virtual node partition will only ever grow as more data is inserted. Since the number of partitions is fixed when the cluster is created, the partition size is unbounded. There are certain advantages to having a limit on partition size. Streaming failures that cause retries do not have to resend so much data. Streaming operations can be staggered in smaller chunks to minimise the impact on the nodes involved. Load balancing can operate on a finer granularity. In the N tokens per node scheme, adding nodes to the cluster decreases the partition size and so gives some control about how much data is stored in each partition. The average size can be reduced by adding more machines to the cluster. The other concern you mentioned was The probability of data loss increases linearly with cluster size. but you also acknowledge that In making this determination, one must take into account that if a larger `DF` makes reconstruction/replacement significantly faster, that also decreases the time window in which multiple failures can occurr. Increasing `DF` is thus not *necessarily* increasing the total probability of data loss (for small values of `DF`). Our calculations lead us to believe that in fact the shorter rebuild window more than compensates for the increased probability of multiple failure, so with DF=N the probability of data loss is minimised. The CRUSH paper also states: With 2-way mirroring these two factors cancel each other out, while overall data safety with
Re: RFC: Cassandra Virtual Nodes
On 19 March 2012 09:23, Radim Kolar h...@filez.com wrote: Hi Radim, The number of virtual nodes for each host would be configurable by the user, in much the same way that initial_token is configurable now. A host taking a larger number of virtual nodes (tokens) would have proportionately more of the data. This is how we anticipate support for heterogeneity in cluster hardware. Yes, but this is good only for random partitioner. For ordered you need to be able split token space on highly loaded servers. With virtual tokens it will move load to random node. What if random node will be also hotspot node? Administration will be more difficult because you don't know where workload lands after you reduce number of tokens held by node. For OPP we envisage an external management process performing active load balancing. The initial token assignment would be random within some user-specified range corresponding to the range of their keys. The load would then be monitored and hot-spots would be moved by reassigning virtual nodes to lightly loaded machines, or introducing new tokens into hot ranges. It makes sense that this would not be a manual process, but there would certainly be more control than just increasing or decreasing the number of tokens assigned to a node. -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On Mon, Mar 19, 2012 at 4:15 PM, Sam Overton s...@acunu.com wrote: On 19 March 2012 09:23, Radim Kolar h...@filez.com wrote: Hi Radim, The number of virtual nodes for each host would be configurable by the user, in much the same way that initial_token is configurable now. A host taking a larger number of virtual nodes (tokens) would have proportionately more of the data. This is how we anticipate support for heterogeneity in cluster hardware. Yes, but this is good only for random partitioner. For ordered you need to be able split token space on highly loaded servers. With virtual tokens it will move load to random node. What if random node will be also hotspot node? Administration will be more difficult because you don't know where workload lands after you reduce number of tokens held by node. For OPP we envisage an external management process performing active load balancing. The initial token assignment would be random within some user-specified range corresponding to the range of their keys. The load would then be monitored and hot-spots would be moved by reassigning virtual nodes to lightly loaded machines, or introducing new tokens into hot ranges. It makes sense that this would not be a manual process, but there would certainly be more control than just increasing or decreasing the number of tokens assigned to a node. -- Sam Overton Acunu | http://www.acunu.com | @acunu For OPP the problem of load balancing is more profound. Now you need vnodes per keyspace because you can not expect each keyspace to have the same distribution. With three keyspaces you are not unsure as to which was is causing the hotness. I think OPP should just go away.
Re: RFC: Cassandra Virtual Nodes
For OPP the problem of load balancing is more profound. Now you need vnodes per keyspace because you can not expect each keyspace to have the same distribution. With three keyspaces you are not unsure as to which was is causing the hotness. I think OPP should just go away. That's a good point, but isn't that the same problem with trying to balance tokens with OPP currently?
Re: RFC: Cassandra Virtual Nodes
On Mon, Mar 19, 2012 at 4:24 PM, Sam Overton s...@acunu.com wrote: For OPP the problem of load balancing is more profound. Now you need vnodes per keyspace because you can not expect each keyspace to have the same distribution. With three keyspaces you are not unsure as to which was is causing the hotness. I think OPP should just go away. That's a good point, but isn't that the same problem with trying to balance tokens with OPP currently? Yes. I was bringing this up because the external management process you suggested performing active load balancing will have to be smart enough to understand this. Right now since this is done manually it is the users problem.
Re: RFC: Cassandra Virtual Nodes
I think if we could go back and rebuild Cassandra from scratch, vnodes would likely be implemented from the beginning. However, I'm concerned that implementing them now could be a big distraction from more productive uses of all of our time and introduce major potential stability issues into what is becoming a business critical piece of infrastructure for many people. However, instead of just complaining and pedantry, I'd like to offer a feasible alternative: Has there been consideration given to the idea of a supporting a single token range for a node? While not theoretically as capable as vnodes, it seems to me to be more practical as it would have a significantly lower impact on the codebase and provides a much clearer migration path. It also seems to solve a majority of complaints regarding operational issues with Cassandra clusters. Each node would have a lower and an upper token, which would form a range that would be actively distributed via gossip. Read and replication requests would only be routed to a replica when the key of these operations matched the replica's token range in the gossip tables. Each node would locally store it's own current active token range as well as a target token range it's moving towards. As a new node undergoes bootstrap, the bounds would be gradually expanded to allow it to handle requests for a wider range of the keyspace as it moves towards it's target token range. This idea boils down to a move from hard cutovers to smoother operations by gradually adjusting active token ranges over a period of time. It would apply to token change operations (nodetool 'move' and 'removetoken') as well. Failure during streaming could be recovered at the bounds instead of restarting the whole process as the active bounds would effectively track the progress for bootstrap target token changes. Implicitly these operations would be throttled to some degree. Node repair (AES) could also be modified using the same overall ideas provide a more gradual impact on the cluster overall similar as the ideas given in CASSANDRA-3721. While this doesn't spread the load over the cluster for these operations evenly like vnodes does, this is likely an issue that could be worked around by performing concurrent (throttled) bootstrap node repair (AES) operations. It does allow some kind of active load balancing, but clearly this is not as flexible or as useful as vnodes, but you should be using RandomPartitioner or sort-of-randomized keys with OPP right? ;) As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I think some progress could be made here by allowing a pack of independent Cassandra nodes to be ran on a single host; somewhat (but nowhere near entirely) similar to a pre-fork model used by some UNIX-based servers. Input? -- Rick Branson DataStax
Re: RFC: Cassandra Virtual Nodes
a) a virtual node partitioning scheme (to support heterogeneity and management simplicity) b) topology aware replication c) topology aware routing I would add (d) limiting the distribution factor to decrease the probability of data loss/multiple failures within a replica set. First of all, I think that while (c) depends on (b) it does not affect partitioning or replication directly, so I'm going to set that aside for the moment and talk just about the former two. Agreed (but I think (d) relates). 1. The hash-space is partitioned into a fixed number of partitions 2. The CRUSH algorithm is run - select(1, disk) - over the topology using each partition as a key, to get an assignment of partition - physical host (primary) 2a. adding or removing a node requires re-running CRUSH to recalculate the partition assignment (and move data) (or any other arbitrary change, yes) 3. The CRUSH algorithm is run - select(RF-1, disk) - over the topology using each primary host id, to get an assignment of primary host - RF-1 replicas 3a. adding or removing a node requires re-running CRUSH to recalculate replica assignment (which might be a different set of hosts to before?) Yes. Cassandra would have a minimum of two topologies; the current and the next topology. Each would imply a mapping of partition - replica set, and that mapping will potentially be different between the two. Reads would always be served form the current topology. Writes would go to the union of the current and the next topology, taking care to tie replicas together correctly for consistency level purposes (this is what CASSANDRA-3901 and CASSANDRA-3833 are talking about). Any topology change is treated the same from the read/write path perspective, regardless of whether you're adding a node, removing a node, adding an entire rack, or even an entire data center. No added complexity is introduced beyond the base case. One of my concerns about using CRUSH exactly as described in the paper is that it seems to be sub-optimal in the amount of data that it moves after modifying the topology. The authors of the paper introduce several bucket types (uniform, list, tree, straw) which appear to be various sub-optimal alternatives to consistent hashing, with various trade-offs. Why not use consistent hashing? Given (2a) and (3a) I think we might end up moving way too much data when the set of replicas changes completely for a given host. One of the benefits of pre-partitioning to a fixed set of partitions is that we can pre-calculate the mapping. This removes the CPU efficiency trade-off of the straw bucket, and the straw bucket would be a good choice. Consistent hashing: It's totally doable to use consistent hashing at each node in the topology. It is not without its own trade-offs though, because the granularity of weighting you want to support, and the accurace of it, relates directly to the number of vnodes per child you need to keep in your consistent hashing ring. Taking granularity, accuracy target and number of children into account can easily lead to very large amounts of vnodes. (At least experimentally from when I've implemented and played with the simple form of consistent hashing in the past. I don't currently have good mathematical evidence.) Let's suppose we introduce our own bucket type called a ring bucket. Each item in a ring bucket is assigned an equal, non-contiguous portion of the key hash-space, which determines which keys are assigned to it. When an item is added to the ring bucket, it takes an equal portion of the hash-space from every other item already in the bucket. And vice-versa for removals. It's easy to see that this ring bucket implements consistent hashing with some unspecified virtual node scheme. Additions and removals would be optimal (only \deltaw/W keys require moving when the topology changes). This is my understanding of what you meant by consistent hashing, and what I refer to above. Using this ring bucket in the CRUSH topology, (with the hash function being the identity function) would give the exact same distribution properties as the virtual node strategy that I suggested previously, but of course with much better topology awareness. I will have to re-read your orignal post. I seem to have missed something :) This makes it evident that the partitioning scheme, and a CRUSH-like replication scheme are orthogonal concerns. In the same way as NTS currently uses the ring to provide distribution at DC and rack level by conceptually separating the ring into a distinct logical rings for each DC, a CrushReplicationStrategy could use the ring as its bucketing function to distribute partitions in the topology. Yes, agreed. Also, the distribution factor limiting is also compatible with non-crush by hash chaining from the primary replica instead of the row key. This brings me on to (1) and the reasons for our choice of virtual node scheme - choose N random tokens -
Re: RFC: Cassandra Virtual Nodes
Using this ring bucket in the CRUSH topology, (with the hash function being the identity function) would give the exact same distribution properties as the virtual node strategy that I suggested previously, but of course with much better topology awareness. I will have to re-read your orignal post. I seem to have missed something :) I did, and I may or may not understand what you mean. Are you comparing vnodes + hashing, with CRUSH + pre-partitioning by hash + identity hash as you traverse down the topology tree? -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
(I may comment on other things more later) As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I Actually, it does. At least assumign DF RF (as in the original proposal, and mine). The impact of a node suffering from a performance degradation is mitigated because the effects are spread out over DF-1 (N-1 in the original post) nodes instead of just RF nodes. think some progress could be made here by allowing a pack of independent Cassandra nodes to be ran on a single host; somewhat (but nowhere near entirely) similar to a pre-fork model used by some UNIX-based servers. I have pretty significant knee-jerk negative reactions to that idea to be honest, even if the pack is limited to a handful of instances. In order for vnodes to be useful with random placement, we'd need much more than a handful of vnodes per node (cassandra instances in a pack in that model). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
I also did create a ticket https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the reason why I would like to see vnodes in cassandra. It can also potentially reduce the SSTable seeks which a node has to do to query data in SizeTireCompaction if extended to the filesystem. But 110% agree with Peter, we need to take incremental steps and start with the existing bootstrapping. May be we can start it by making a set of Ranges/Token to a node insted of one token. And then may be building things around the movement of those ranges. I have been thinking about this for a while but having trouble to get to a point where i am comfortable changing big chunks of code. Regards, /VJ On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller peter.schul...@infidyne.com wrote: (I may comment on other things more later) As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I Actually, it does. At least assumign DF RF (as in the original proposal, and mine). The impact of a node suffering from a performance degradation is mitigated because the effects are spread out over DF-1 (N-1 in the original post) nodes instead of just RF nodes. think some progress could be made here by allowing a pack of independent Cassandra nodes to be ran on a single host; somewhat (but nowhere near entirely) similar to a pre-fork model used by some UNIX-based servers. I have pretty significant knee-jerk negative reactions to that idea to be honest, even if the pack is limited to a handful of instances. In order for vnodes to be useful with random placement, we'd need much more than a handful of vnodes per node (cassandra instances in a pack in that model). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller peter.schul...@infidyne.com wrote: As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I Actually, it does. At least assumign DF RF (as in the original proposal, and mine). The impact of a node suffering from a performance degradation is mitigated because the effects are spread out over DF-1 (N-1 in the original post) nodes instead of just RF nodes. You've got me on one of those after some re-thought. For any node outage (an upgrade/restart) definitely has a big impact by distributed the load more evenly, but (and correct me if I'm wrong) for things like additional latency caused by GC/compaction, those requests will just be slower rather than timing out or getting redirected via the dynamic snitch. think some progress could be made here by allowing a pack of independent Cassandra nodes to be ran on a single host; somewhat (but nowhere near entirely) similar to a pre-fork model used by some UNIX-based servers. I have pretty significant knee-jerk negative reactions to that idea to be honest, even if the pack is limited to a handful of instances. In order for vnodes to be useful with random placement, we'd need much more than a handful of vnodes per node (cassandra instances in a pack in that model). Fair enough, I'm not super fond of the idea personally, but I don't see a way around the limitations of the current JVM GC without multiple processes. After some rethinking my ideas a bit, I think actually what I've settled a bit more on is to keep the existing node tokens, but add an additional active token that would be used to determine the data range that a node is ready to receive reads for. This should gain all of the benefits highlighted in my earlier post, but with less complexity in implementation. Node repair (AES) would still allow ranges to be specified.
Re: RFC: Cassandra Virtual Nodes
On Mon, Mar 19, 2012 at 9:37 PM, Vijay vijay2...@gmail.com wrote: I also did create a ticket https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the reason why I would like to see vnodes in cassandra. It can also potentially reduce the SSTable seeks which a node has to do to query data in SizeTireCompaction if extended to the filesystem. But 110% agree with Peter, we need to take incremental steps and start with the existing bootstrapping. I'm guessing you're referring to Rick's proposal about ranges per node? May be we can start it by making a set of Ranges/Token to a node insted of one token. And then may be building things around the movement of those ranges. I have been thinking about this for a while but having trouble to get to a point where i am comfortable changing big chunks of code. It might help to see some more detail in the proposals. Both ideas seem invasive, vnodes more so, but there are many more benefits as well. On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller peter.schul...@infidyne.com wrote: (I may comment on other things more later) As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I Actually, it does. At least assumign DF RF (as in the original proposal, and mine). The impact of a node suffering from a performance degradation is mitigated because the effects are spread out over DF-1 (N-1 in the original post) nodes instead of just RF nodes. think some progress could be made here by allowing a pack of independent Cassandra nodes to be ran on a single host; somewhat (but nowhere near entirely) similar to a pre-fork model used by some UNIX-based servers. I have pretty significant knee-jerk negative reactions to that idea to be honest, even if the pack is limited to a handful of instances. In order for vnodes to be useful with random placement, we'd need much more than a handful of vnodes per node (cassandra instances in a pack in that model). -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote: I'm guessing you're referring to Rick's proposal about ranges per node? May be, what i mean is little more simple than that... We can consider every node having a multiple conservative ranges and moving those ranges for bootstrap etc, instead of finding the mid point etc in the bootstrap code. Once we get that working all the way to the FS/Streaming then we can move those ranges and assign those ranges to nodes in random orders. Hope it makes sense.
Re: RFC: Cassandra Virtual Nodes
I don't like that every node will have same portion of data. 1. We are using nodes with different HW sizes (number of disks) 2. especially with ordered partitioner there tends to be hotspots and you must assign smaller portion of data to nodes holding hotspots
Re: RFC: Cassandra Virtual Nodes
On 17 March 2012 11:15, Radim Kolar h...@filez.com wrote: I don't like that every node will have same portion of data. 1. We are using nodes with different HW sizes (number of disks) 2. especially with ordered partitioner there tends to be hotspots and you must assign smaller portion of data to nodes holding hotspots Hi Radim, The number of virtual nodes for each host would be configurable by the user, in much the same way that initial_token is configurable now. A host taking a larger number of virtual nodes (tokens) would have proportionately more of the data. This is how we anticipate support for heterogeneity in cluster hardware. Sam -- Sam Overton Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton s...@acunu.com wrote: Hello cassandra-dev, This is a long email. It concerns a significant change to Cassandra, so deserves a thorough introduction. *The summary is*: we believe virtual nodes are the way forward. We would like to add virtual nodes to Cassandra and we are asking for comments, criticism and collaboration! Cassandra's current partitioning scheme is sub-optimal for bootstrap, decommission, repair and re-balance operations, and places the burden on users to properly calculate tokens (a common cause of mistakes), which is a recurring pain-point. Virtual nodes have a variety of benefits over the one-to-one mapping of host to key range which Cassandra currently supports. Among these benefits are: * Even load balancing when growing and shrinking the cluster A virtual node scheme ensures that all hosts in a cluster have an even portion of the total data, and a new node bootstrapped into the cluster will assume its share of the data. Doubling, or halving the cluster to ensure even load distribution would no longer be necessary. * Distributed rebuild When sizing a cluster, one of the considerations is the amount of time required to recover from a failed node. This is the exposure time, during which a secondary failure could cause data loss. In order to guarantee an upper bound on the exposure time, the amount of data which can be stored on each host is limited by the amount of time taken to recover the required replica count. At Acunu we have found that the exposure time is frequently the limiting factor which dictates the maximum allowed node size in customers' clusters. Using a virtual node scheme, the data stored on one host is not replicated on just RF-1 other physical hosts. Each virtual node is replicated to RF-1 other virtual nodes which may be on a different set of physical hosts to replicas of other virtual nodes stored on the same host. This means data for one host is replicated evenly across the entire cluster. In the event of a failure then, restoring the replica count can be done in a fully distributed way. Each host in the cluster participates in the rebuild, drastically reducing the exposure time, allowing more data to be stored on a single host while still maintaining an acceptable upper bound on the likelihood of secondary failure. This reduces TCO concerns. * Greater failure tolerance in streaming Operations which require streaming of a large range of data, eg. bootstrap, decommission, repair, etc. incur a heavy cost if an error (eg. dropped network connection) is encountered during the streaming. Currently the whole range must be re-streamed, and this could constitute a very large amount of data. Virtual nodes reduce the impact of streaming failures, since each virtual node is a much smaller range of the key-space, so re-streaming a whole virtual node is a much cheaper process. * Evenly distributed impact of streaming operations Streaming operations such as bootstrap, repair, et al. would involve every node in the cluster. This would distribute the load of these operations across the whole cluster, and could be staggered so that only a small subset of nodes were affected at once, similar to staggered repair[1]. * Possibility for active load balancing Load balancing in Cassandra currently involves moving a token to increase/reduce the amount of key-space for which a host is responsible. This only allows load balancing between neighbouring nodes, so it could involve moving more than one token just to redistribute a single overloaded node. Virtual nodes could allow load balancing on a much finer granularity, so heavily loaded portions of the key-space could be redistributed to lighter-loaded hosts by reassigning one or more virtual nodes. Implementing a virtual node scheme in Cassandra is not an insignificant amount of work, and it will touch a large amount of the codebase related to partitioning, placement, routing, gossip, and so on. We do believe that this is possible to do incrementally, and in such a way that there is an easy upgrade path for pre-virtual-node deployments. It would not however touch the storage layer. The virtual node concept is solely for partitioning and placement, not for segregating the data storage of the host, so all keys for all virtual nodes on a host would be stored in the same SSTables. We are not proposing the adoption of the same scheme used by Voldemort[2] and described in the Dynamo paper[3]. We feel this scheme is too different from Cassandra's current distribution model to be a viable target for incremental development. Their scheme also fixes the number of virtual nodes for the lifetime of the cluster, which can prove to be a ceiling to scaling the cluster if the virtual nodes grow too large. The proposed design is: * Assign each host T random tokens. Will it work well for OrderedPartitioner? For load balance
Re: RFC: Cassandra Virtual Nodes
On Sat, Mar 17, 2012 at 11:15 AM, Radim Kolar h...@filez.com wrote: I don't like that every node will have same portion of data. 1. We are using nodes with different HW sizes (number of disks) 2. especially with ordered partitioner there tends to be hotspots and you must assign smaller portion of data to nodes holding hotspots Yes, these are exactly the sorts of problems that virtual nodes are meant to make easier. -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
I agree having smaller regions would help the rebalencing situation both with rp and bop. However i an not sure if dividing tables across disk s will give any better performance. you will have more seeking spindles and can possibly sub divide token ranges into separate files. But fs cache will get shared across all disks so that is a wash. On Saturday, March 17, 2012, Eric Evans eev...@acunu.com wrote: On Sat, Mar 17, 2012 at 11:15 AM, Radim Kolar h...@filez.com wrote: I don't like that every node will have same portion of data. 1. We are using nodes with different HW sizes (number of disks) 2. especially with ordered partitioner there tends to be hotspots and you must assign smaller portion of data to nodes holding hotspots Yes, these are exactly the sorts of problems that virtual nodes are meant to make easier. -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: RFC: Cassandra Virtual Nodes
On Sat, Mar 17, 2012 at 3:22 PM, Zhu Han schumi@gmail.com wrote: On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton s...@acunu.com wrote: This is a long email. It concerns a significant change to Cassandra, so deserves a thorough introduction. *The summary is*: we believe virtual nodes are the way forward. We would like to add virtual nodes to Cassandra and we are asking for comments, criticism and collaboration! Cassandra's current partitioning scheme is sub-optimal for bootstrap, decommission, repair and re-balance operations, and places the burden on users to properly calculate tokens (a common cause of mistakes), which is a recurring pain-point. Virtual nodes have a variety of benefits over the one-to-one mapping of host to key range which Cassandra currently supports. Among these benefits are: * Even load balancing when growing and shrinking the cluster A virtual node scheme ensures that all hosts in a cluster have an even portion of the total data, and a new node bootstrapped into the cluster will assume its share of the data. Doubling, or halving the cluster to ensure even load distribution would no longer be necessary. * Distributed rebuild When sizing a cluster, one of the considerations is the amount of time required to recover from a failed node. This is the exposure time, during which a secondary failure could cause data loss. In order to guarantee an upper bound on the exposure time, the amount of data which can be stored on each host is limited by the amount of time taken to recover the required replica count. At Acunu we have found that the exposure time is frequently the limiting factor which dictates the maximum allowed node size in customers' clusters. Using a virtual node scheme, the data stored on one host is not replicated on just RF-1 other physical hosts. Each virtual node is replicated to RF-1 other virtual nodes which may be on a different set of physical hosts to replicas of other virtual nodes stored on the same host. This means data for one host is replicated evenly across the entire cluster. In the event of a failure then, restoring the replica count can be done in a fully distributed way. Each host in the cluster participates in the rebuild, drastically reducing the exposure time, allowing more data to be stored on a single host while still maintaining an acceptable upper bound on the likelihood of secondary failure. This reduces TCO concerns. * Greater failure tolerance in streaming Operations which require streaming of a large range of data, eg. bootstrap, decommission, repair, etc. incur a heavy cost if an error (eg. dropped network connection) is encountered during the streaming. Currently the whole range must be re-streamed, and this could constitute a very large amount of data. Virtual nodes reduce the impact of streaming failures, since each virtual node is a much smaller range of the key-space, so re-streaming a whole virtual node is a much cheaper process. * Evenly distributed impact of streaming operations Streaming operations such as bootstrap, repair, et al. would involve every node in the cluster. This would distribute the load of these operations across the whole cluster, and could be staggered so that only a small subset of nodes were affected at once, similar to staggered repair[1]. * Possibility for active load balancing Load balancing in Cassandra currently involves moving a token to increase/reduce the amount of key-space for which a host is responsible. This only allows load balancing between neighbouring nodes, so it could involve moving more than one token just to redistribute a single overloaded node. Virtual nodes could allow load balancing on a much finer granularity, so heavily loaded portions of the key-space could be redistributed to lighter-loaded hosts by reassigning one or more virtual nodes. Implementing a virtual node scheme in Cassandra is not an insignificant amount of work, and it will touch a large amount of the codebase related to partitioning, placement, routing, gossip, and so on. We do believe that this is possible to do incrementally, and in such a way that there is an easy upgrade path for pre-virtual-node deployments. It would not however touch the storage layer. The virtual node concept is solely for partitioning and placement, not for segregating the data storage of the host, so all keys for all virtual nodes on a host would be stored in the same SSTables. We are not proposing the adoption of the same scheme used by Voldemort[2] and described in the Dynamo paper[3]. We feel this scheme is too different from Cassandra's current distribution model to be a viable target for incremental development. Their scheme also fixes the number of virtual nodes for the lifetime of the cluster, which can prove to be a ceiling to scaling the cluster if the virtual nodes grow too large. The proposed design is: * Assign each host T random tokens. Will it work
Re: RFC: Cassandra Virtual Nodes
*The summary is*: we believe virtual nodes are the way forward. We would like to add virtual nodes to Cassandra and we are asking for comments, criticism and collaboration! I am very happy to see some momentum on this, and I would like to go even further than what you propose. The main reasons why I do not think simply adding vnodes and making random assignments is the best end goal are: (1) The probability of data loss increases linearly with cluster size. (2) It does not take network topology into account. What follows is mostly a half-finished long text that I have been sitting on for a few months but not finished/posted. Unfortunately I do not have the possibility (due to time constraints) to go through everything in detail and update with current information and to specifically address what you already said, so there will be overlap with your original post. However given your E-Mail and the momentum in this thread, I really wanted to post something rather than not. It would be awesome if interested parties had a chance to read the referenced CRUSH paper, and the deltas proposed below. The goals are to address everything you already wanted in your post, while also addressing: (1) Probability of data loss (2) Network topology awareness The following text will first try to paint a picture of the goals that I have in mind, and then go on to the actual proposed solution. The proposition is very very short and undetailed now and there is plenty of discussion and details to fill in. I apologize, but again, I really want to post something now that this is being brought up. BEGIN un-polished text (we = I):= = CRUSHing Cassandra Author: Peter Schuller peter.schul...@infidyne.com This is a proposal for a significant re-design of some fundamentals of Cassandra, aimed at addressing a number of current issues as well as anticipating future issues. It is particularly aimed at large clusters, but as a side-effect should improve the small cluster experience as well. == New terminology: Distribution factor A Cassandra cluster is today said to have `N` nodes, and data is replicated at a particular replication factor (`RF`). The placement of replicas is such that all rows that has a certain node `N1` as its primary replica, are located on a specific set of `RF-1` other nodes. In addition, it holds secondary replicas of data for `RF-1` other nodes. In total, it shares data with `2RF - 2` other nodes. The number of nodes with whom a node shares data is the distribution factor. In the case of Cassandra, `DF = 2RF - 2`. == Goals The goals this suggestion attempts to help solve include the following. === Goal: DF should not be tied to RF, nor N `DF` is important for these reasons: * The `DF` determines how many nodes are involved in re-constructing a lost node after failure; the higher the `DF`, the less of a performance impact a reconstruction has on remaining nodes. * The `DF` determines the significance on other nodes, with respect to read/write load, on a node being down. * The `DF` affects the probability of multiple failures causing data loss, since one looses data if any `RF` nodes within a group of `DF` nodes all go down. Having `DF` tied to `RF` like Cassandra does now has its problems. A single node failure has a significant effect on the performance characteristics of neighboring nodes (in terms relative to the normal level of load on the neighbors). On large data sets, a failed node needing reconstruction is a significant event, as it * Increases the load on neighbors just from going down. * Further increases the load on neighbors as they have to stream data (adding I/O and cache thrashing). This typically leads to the desire to throttle / rate limit reconstruction, which adds to the reconstruction window in addition to the fact that it was already naturally bottlenecking on neighbors. The other extreme is to tie `DF` to `N`, such that the data contained on one node, has it's secondary replicas spread out over the entire ring. This is an unacceptable choice because the probabiliy of multiple failures increases linearly with the cluster size. In other words, we want `DF` to be tied to neither `RF` nor `N`. Rather, `DF` should be chosen as a trade-off between the effects of `DF`: * The higher the `DF`, the higher the probability of data loss in case of multiple failures. * The higher the `DF`, the faster to reconstruct/replace a lost node. * The higher the `DF`, the less impact is seen on node failures on the performance requirements on other nodes. In making this determination, one must take into account that if a larger `DF` makes reconstruction/replacement significantly faster, that also decreases the time window in which multiple failures can occurr. Increasing `DF` is thus not *necessarily* increasing the total probability of data loss (for small values of `DF`). === Goal: Topologically aware redundancy We maintain the goal of being topology aware for the purpose
Re: RFC: Cassandra Virtual Nodes
Point of clarification: My use of the term bucket is completely unrelated to the term bucket used in the CRUSH paper. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)