Re: RFC: Cassandra Virtual Nodes

2012-04-10 Thread Sam Overton
There is now a parent ticket for this issue in JIRA:
https://issues.apache.org/jira/browse/CASSANDRA-4119

Comments and contributions are still welcome!

Cheers,

Sam

On 16 March 2012 23:38, Sam Overton s...@acunu.com wrote:
 Hello cassandra-dev,

 This is a long email. It concerns a significant change to Cassandra, so
 deserves a thorough introduction.

 The summary is: we believe virtual nodes are the way forward. We would like
 to add virtual nodes to Cassandra and we are asking for comments, criticism
 and collaboration!

 Cassandra's current partitioning scheme is sub-optimal for bootstrap,
 decommission, repair and re-balance operations, and places the burden on
 users to properly calculate tokens (a common cause of mistakes), which is a
 recurring pain-point.

 Virtual nodes have a variety of benefits over the one-to-one mapping of host
 to key range which Cassandra currently supports.

 Among these benefits are:

 * Even load balancing when growing and shrinking the cluster
 A virtual node scheme ensures that all hosts in a cluster have an even
 portion of the total data, and a new node bootstrapped into the cluster will
 assume its share of the data. Doubling, or halving the cluster to ensure
 even load distribution would no longer be necessary.

 * Distributed rebuild
 When sizing a cluster, one of the considerations is the amount of time
 required to recover from a failed node. This is the exposure time, during
 which a secondary failure could cause data loss. In order to guarantee an
 upper bound on the exposure time, the amount of data which can be stored on
 each host is limited by the amount of time taken to recover the required
 replica count. At Acunu we have found that the exposure time is frequently
 the limiting factor which dictates the maximum allowed node size in
 customers' clusters.

 Using a virtual node scheme, the data stored on one host is not replicated
 on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
 other virtual nodes which may be on a different set of physical hosts to
 replicas of other virtual nodes stored on the same host. This means data for
 one host is replicated evenly across the entire cluster.

 In the event of a failure then, restoring the replica count can be done in a
 fully distributed way. Each host in the cluster participates in the rebuild,
 drastically reducing the exposure time, allowing more data to be stored on a
 single host while still maintaining an acceptable upper bound on the
 likelihood of secondary failure. This reduces TCO concerns.

 * Greater failure tolerance in streaming
 Operations which require streaming of a large range of data, eg. bootstrap,
 decommission, repair, etc. incur a heavy cost if an error (eg. dropped
 network connection) is encountered during the streaming. Currently the whole
 range must be re-streamed, and this could constitute a very large amount of
 data. Virtual nodes reduce the impact of streaming failures, since each
 virtual node is a much smaller range of the key-space, so re-streaming a
 whole virtual node is a much cheaper process.

 * Evenly distributed impact of streaming operations
 Streaming operations such as bootstrap, repair, et al. would involve every
 node in the cluster. This would distribute the load of these operations
 across the whole cluster, and could be staggered so that only a small subset
 of nodes were affected at once, similar to staggered repair[1].

 * Possibility for active load balancing
 Load balancing in Cassandra currently involves moving a token to
 increase/reduce the amount of key-space for which a host is responsible.
 This only allows load balancing between neighbouring nodes, so it could
 involve moving more than one token just to redistribute a single overloaded
 node. Virtual nodes could allow load balancing on a much finer granularity,
 so heavily loaded portions of the key-space could be redistributed to
 lighter-loaded hosts by reassigning one or more virtual nodes.


 Implementing a virtual node scheme in Cassandra is not an insignificant
 amount of work, and it will touch a large amount of the codebase related to
 partitioning, placement, routing, gossip, and so on. We do believe that this
 is possible to do incrementally, and in such a way that there is an easy
 upgrade path for pre-virtual-node deployments.

 It would not however touch the storage layer. The virtual node concept is
 solely for partitioning and placement, not for segregating the data storage
 of the host, so all keys for all virtual nodes on a host would be stored in
 the same SSTables.

 We are not proposing the adoption of the same scheme used by Voldemort[2]
 and described in the Dynamo paper[3]. We feel this scheme is too different
 from Cassandra's current distribution model to be a viable target for
 incremental development. Their scheme also fixes the number of virtual nodes
 for the lifetime of the cluster, which can prove to be a ceiling to scaling
 the cluster if 

Re: RFC: Cassandra Virtual Nodes

2012-03-23 Thread Ben Coverston
 The SSTable indices should still be scanned for size tiered compaction.
 Do I miss anything here?


No I don't think you did, in fact, depending on the size of your SSTable a
contiguous range (or the entire SSTable) may or may not be affected by a
cleanup/move or any type of topology change. There is lots of room for
optimization here. After loading the indexes we actually know start/end
range for an SSTable so we can include/exclude it in any such operation.

-- 
Ben Coverston
DataStax -- The Apache Cassandra Company


Re: RFC: Cassandra Virtual Nodes

2012-03-23 Thread Peter Schuller
 No I don't think you did, in fact, depending on the size of your SSTable a
 contiguous range (or the entire SSTable) may or may not be affected by a
 cleanup/move or any type of topology change. There is lots of room for
 optimization here. After loading the indexes we actually know start/end
 range for an SSTable so we can include/exclude it in any such operation.

Just note that unless there is some correlation between range and
these sstables being created to begin with (like with leveled), you're
highly unlikely to be able to optimize here. For uniformly distributed
tokens (hashed keys), all sstables are likely to have almost the
entire possible token range in them.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-23 Thread Zhu Han
On Sat, Mar 24, 2012 at 7:55 AM, Peter Schuller peter.schul...@infidyne.com
 wrote:

  No I don't think you did, in fact, depending on the size of your SSTable
 a
  contiguous range (or the entire SSTable) may or may not be affected by a
  cleanup/move or any type of topology change. There is lots of room for
  optimization here. After loading the indexes we actually know start/end
  range for an SSTable so we can include/exclude it in any such operation


 Just note that unless there is some correlation between range and
 these sstables being created to begin with (like with leveled), you're
 highly unlikely to be able to optimize here. For uniformly distributed
 tokens (hashed keys), all sstables are likely to have almost the
 entire possible token range in them.


As Peter pointed out, for random partitioner, the rows of  a specific range
might scatter around all sstables.

Unless whole sstable can be ignored, disk seek is the performance killer
here.




 --
 / Peter Schuller (@scode, http://worldmodscode.wordpress.com)



Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Richard Low
On 22 March 2012 05:48, Zhu Han schumi@gmail.com wrote:

 I second it.

 Is there some goals we missed which can not be achieved by assigning
 multiple tokens to a single node?

This is exactly the proposed solution.  The discussion is about how to
implement this, and the methods of choosing tokens and replication
strategy.

Richard.


Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Zhu Han
On Thu, Mar 22, 2012 at 6:20 PM, Richard Low r...@acunu.com wrote:

 On 22 March 2012 05:48, Zhu Han schumi@gmail.com wrote:

  I second it.
 
  Is there some goals we missed which can not be achieved by assigning
  multiple tokens to a single node?

 This is exactly the proposed solution.  The discussion is about how to
 implement this, and the methods of choosing tokens and replication
 strategy.


Does the new scheme  still require the node to re-iterate all sstables to
build the merkle tree or streaming data for partition level
repair and move?

The disk IO triggered by above steps could be very time-consuming if the
dataset on single node is very large.  It could be much more costly than
the network IO, especially when concurrent repair tasks hit the same
node.

Is there any good ideas on it?


 Richard.



Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Peter Schuller
 You would have to iterate through all sstables on the system to repair one
 vnode, yes: but building the tree for just one range of the data means that
 huge portions of the sstables files can be skipped. It should scale down
 linearly as the number of vnodes increases (ie, with 100 vnodes, it will
 take 1/100th the time to repair one vnode).

The story is less good for nodetool cleanup however, which still has
to truck over the entire dataset.

(The partitions/buckets in my crush-inspired scheme addresses this by
allowing that each ring segment, in vnode terminology, be stored
separately in the file system.)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Zhu Han
On Fri, Mar 23, 2012 at 6:54 AM, Peter Schuller peter.schul...@infidyne.com
 wrote:

  You would have to iterate through all sstables on the system to repair
 one
  vnode, yes: but building the tree for just one range of the data means
 that
  huge portions of the sstables files can be skipped. It should scale down
  linearly as the number of vnodes increases (ie, with 100 vnodes, it will
  take 1/100th the time to repair one vnode).


The SSTable indices should still be scanned for size tiered compaction.
Do I miss anything here?


 The story is less good for nodetool cleanup however, which still has
 to truck over the entire dataset.

 (The partitions/buckets in my crush-inspired scheme addresses this by
 allowing that each ring segment, in vnode terminology, be stored
 separately in the file system.)


But the number of files can be a big problem if there are hundreds of
vnodes and millions of sstables
on the same physical node.

We need a way to pin sstable inode to memory.  Otherwise,
it's possible the average number of disk IO to access a row in a sstable
could
be five or more.



 --
 / Peter Schuller (@scode, http://worldmodscode.wordpress.com)



Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Eric Evans
On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis jbel...@gmail.com wrote:
 It's reasonable that we can attach different levels of importance to
 these things.  Taking a step back, I have two main points:

 1) vnodes add enormous complexity to *many* parts of Cassandra.  I'm
 skeptical of the cost:benefit ratio here.

 1a) The benefit is lower in my mind because many of the problems
 solved by vnodes can be solved well enough for most people, for
 some value of those two phrases, without vnodes.

 2) I'm not okay with a commit something half-baked and sort it out
 later approach.

I must admit I find this a little disheartening.  The discussion has
barely started.  No one has had a chance to discuss implementation
specifics so that the rest of us could understand *how* disruptive it
would be (a necessary requirement in weighing cost:benefit), or what
an incremental approach would look like, and yet work has already
begun on shutting this down.

Unless I'm reading you wrong, your mandate (I say mandate because you
hinted at a veto elsewhere), is No to anything complex or invasive
(for some value of each).  The only alternative would seem to be a
phased or incremental approach, but you seem to be saying No to that
as well.

There seems to be quite a bit of interest in having virtual nodes (and
there has been for as long as I can remember), the only serious
reservations relate to the difficulty/complexity.  Is there really no
way to put our heads together and figure out how to properly manage
that aspect?

 On Tue, Mar 20, 2012 at 11:10 AM, Richard Low r...@acunu.com wrote:
 On 20 March 2012 14:55, Jonathan Ellis jbel...@gmail.com wrote:
 Here's how I see Sam's list:

 * Even load balancing when growing and shrinking the cluster

 Nice to have, but post-bootstrap load balancing works well in practice
 (and is improved by TRP).

 Post-bootstrap load balancing without vnodes necessarily streams more
 data than is necessary.  Vnodes streams the minimal amount.

 In fact, post-bootstrap load balancing currently streams a constant
 fraction of your data - the network traffic involved in a rebalance
 increases linearly with the size of your cluster.  With vnodes it
 decreases linearly.

 Including removing the ops overhead of running the load balance and
 calculating new tokens, this makes removing post-bootstrap load
 balancing a pretty big deal.

 * Greater failure tolerance in streaming

 Directly addressed by TRP.

 Agreed.

 * Evenly distributed impact of streaming operations

 Not a problem in practice with stream throttling.

 Throttling slows them down, increasing rebuild times so increasing downtime.

 * Possibility for active load balancing

 Not really a feature of vnodes per se, but as with the other load
 balancing point, this is also improved by TRP.

 Again with the caveat that more data is streamed with TRP.  Vnodes
 removes the need for any load balancing with RP.

 * Distributed rebuild

 This is the 20% that TRP does not address.  Nice to have?  Yes.  Can I
 live without it?  I have so far.  Is this alone worth the complexity
 of vnodes?  No, it is not.  Especially since there are probably other
 approaches that we can take to mitigate this, one of which Rick has
 suggested in a separate sub-thread.

 Distributed rebuild means you can store more data per node with the
 same failure probabilities.  This is frequently a limiting factor on
 how much data you can store per node, increasing cluster sizes
 unnecessarily.  I'd argue that this alone is worth the complexity of
 vnodes.

 Richard.



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Edward Capriolo
On Wed, Mar 21, 2012 at 9:50 AM, Eric Evans eev...@acunu.com wrote:
 On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis jbel...@gmail.com wrote:
 It's reasonable that we can attach different levels of importance to
 these things.  Taking a step back, I have two main points:

 1) vnodes add enormous complexity to *many* parts of Cassandra.  I'm
 skeptical of the cost:benefit ratio here.

 1a) The benefit is lower in my mind because many of the problems
 solved by vnodes can be solved well enough for most people, for
 some value of those two phrases, without vnodes.

 2) I'm not okay with a commit something half-baked and sort it out
 later approach.

 I must admit I find this a little disheartening.  The discussion has
 barely started.  No one has had a chance to discuss implementation
 specifics so that the rest of us could understand *how* disruptive it
 would be (a necessary requirement in weighing cost:benefit), or what
 an incremental approach would look like, and yet work has already
 begun on shutting this down.

 Unless I'm reading you wrong, your mandate (I say mandate because you
 hinted at a veto elsewhere), is No to anything complex or invasive
 (for some value of each).  The only alternative would seem to be a
 phased or incremental approach, but you seem to be saying No to that
 as well.

 There seems to be quite a bit of interest in having virtual nodes (and
 there has been for as long as I can remember), the only serious
 reservations relate to the difficulty/complexity.  Is there really no
 way to put our heads together and figure out how to properly manage
 that aspect?

 On Tue, Mar 20, 2012 at 11:10 AM, Richard Low r...@acunu.com wrote:
 On 20 March 2012 14:55, Jonathan Ellis jbel...@gmail.com wrote:
 Here's how I see Sam's list:

 * Even load balancing when growing and shrinking the cluster

 Nice to have, but post-bootstrap load balancing works well in practice
 (and is improved by TRP).

 Post-bootstrap load balancing without vnodes necessarily streams more
 data than is necessary.  Vnodes streams the minimal amount.

 In fact, post-bootstrap load balancing currently streams a constant
 fraction of your data - the network traffic involved in a rebalance
 increases linearly with the size of your cluster.  With vnodes it
 decreases linearly.

 Including removing the ops overhead of running the load balance and
 calculating new tokens, this makes removing post-bootstrap load
 balancing a pretty big deal.

 * Greater failure tolerance in streaming

 Directly addressed by TRP.

 Agreed.

 * Evenly distributed impact of streaming operations

 Not a problem in practice with stream throttling.

 Throttling slows them down, increasing rebuild times so increasing downtime.

 * Possibility for active load balancing

 Not really a feature of vnodes per se, but as with the other load
 balancing point, this is also improved by TRP.

 Again with the caveat that more data is streamed with TRP.  Vnodes
 removes the need for any load balancing with RP.

 * Distributed rebuild

 This is the 20% that TRP does not address.  Nice to have?  Yes.  Can I
 live without it?  I have so far.  Is this alone worth the complexity
 of vnodes?  No, it is not.  Especially since there are probably other
 approaches that we can take to mitigate this, one of which Rick has
 suggested in a separate sub-thread.

 Distributed rebuild means you can store more data per node with the
 same failure probabilities.  This is frequently a limiting factor on
 how much data you can store per node, increasing cluster sizes
 unnecessarily.  I'd argue that this alone is worth the complexity of
 vnodes.

 Richard.



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



 --
 Eric Evans
 Acunu | http://www.acunu.com | @acunu

I have also thought of how I would like Vnodes to work from an
operational prospective rather then a software one. I would like these
features.
1) No more raid 0. If a machine is responsible for 4 vnodes they
should correspond to for JBOD.

2) Vnodes should be able to be hot pluged. My normal cassandra chassis
would be a 2U with 6 drive bays. Imagine I have 10 nodes. Now if my
chassis dies I should be able to take the disks out and physically
plug them into another chassis. Then in cassandra I should be able to
run a command like.
nodetool attach '/mnt/disk6'. disk6 should contain all data an it's
vnode information.

Now this would be awesome for upgrades/migrations/etc.


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Tom Wilkie
Hi Edward

 1) No more raid 0. If a machine is responsible for 4 vnodes they
 should correspond to for JBOD.

So each vnode corresponds to a disk?  I suppose we could have a
separate data directory per disk, but I think this should be a
separate, subsequent change.

However, do note that making the vnode ~size of a disk (and only have
4-8 per machine) would make any non-hotswap rebuilds slower.  To get
the fast distributed rebuilds, you need to have at least as many
vnodes per node as you do nodes in the cluster.  And you would still
need the distributed rebuilds to deal with disk failure.

 2) Vnodes should be able to be hot pluged. My normal cassandra chassis
 would be a 2U with 6 drive bays. Imagine I have 10 nodes. Now if my
 chassis dies I should be able to take the disks out and physically
 plug them into another chassis. Then in cassandra I should be able to
 run a command like.
 nodetool attach '/mnt/disk6'. disk6 should contain all data an it's
 vnode information.

 Now this would be awesome for upgrades/migrations/etc.

You know, your not the first person I've spoke to who has asked for
this!  I do wonder whether it is optimising for the right thing though
- in my experience, disks fail more often than machines.

Thanks

Tom


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Jonathan Ellis
On Wed, Mar 21, 2012 at 8:50 AM, Eric Evans eev...@acunu.com wrote:
 I must admit I find this a little disheartening.  The discussion has
 barely started.  No one has had a chance to discuss implementation
 specifics so that the rest of us could understand *how* disruptive it
 would be (a necessary requirement in weighing cost:benefit), or what
 an incremental approach would look like, and yet work has already
 begun on shutting this down.

This isn't the first time vnodes has been brought up, so I've thought
at least a little bit about what that would entail for TokenMetadata,
StorageProxy, streaming, CFS, and on down the stack.  And it scares
me.  So, I wanted to make my points about cost:benefit and about not
committing with intentions to work out the details later, up front.

But if you're still in brainstorming mode, carry on.

 Unless I'm reading you wrong, your mandate (I say mandate because you
 hinted at a veto elsewhere), is No to anything complex or invasive
 (for some value of each).  The only alternative would seem to be a
 phased or incremental approach, but you seem to be saying No to that
 as well.

 There seems to be quite a bit of interest in having virtual nodes (and
 there has been for as long as I can remember), the only serious
 reservations relate to the difficulty/complexity.  Is there really no
 way to put our heads together and figure out how to properly manage
 that aspect?

At the risk of putting words in your mouth, I think your concern is
that you don't want to go off and put man-months into a vnode
implementation, only to have me come back and say, Sorry, it's too
complicated, -1.

Which is totally reasonable, I get that.

I would suggest that the best way to mitigate that is, when you are
ready, to put together as detailed an implementation plan as possible
ahead of time before you start generating patchsets.  Then we can put
some meat on the discussion more meaningful than vague that scares
me statements from yours truly.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Edward Capriolo
On Wed, Mar 21, 2012 at 3:24 PM, Tom Wilkie t...@acunu.com wrote:
 Hi Edward

 1) No more raid 0. If a machine is responsible for 4 vnodes they
 should correspond to for JBOD.

 So each vnode corresponds to a disk?  I suppose we could have a
 separate data directory per disk, but I think this should be a
 separate, subsequent change.

I think having more micro-ranges makes the process much easier.
Image a token ring 1-30

Node1 | major range 0-10  | disk 10-2 , disk2 3-4, disk 3 5-7, disk 4 8-10
Node2 | major range 11-20| disk 1 11-12 , disk2 13-14, disk 3 15-17,
disk 4 18-20
Node3 | major range 21-30| disk 1 21-22 , disk2 23-24, disk 3 25-27,
disk 4 28-30

Adding a 4th node is easy:
If you are at the data center, just take disk 4 out of each node and
place it in new server :)
Software wise it is the same deal. Each node streams off only disk 4
to the new node.

Now at this point disk 4 is idle and each machine should re balance
its own data across its 4 disks.

 However, do note that making the vnode ~size of a disk (and only have
 4-8 per machine) would make any non-hotswap rebuilds slower.  To get
 the fast distributed rebuilds, you need to have at least as many
 vnodes per node as you do nodes in the cluster.  And you would still
 need the distributed rebuilds to deal with disk failure.

 2) Vnodes should be able to be hot pluged. My normal cassandra chassis
 would be a 2U with 6 drive bays. Imagine I have 10 nodes. Now if my
 chassis dies I should be able to take the disks out and physically
 plug them into another chassis. Then in cassandra I should be able to
 run a command like.
 nodetool attach '/mnt/disk6'. disk6 should contain all data an it's
 vnode information.

 Now this would be awesome for upgrades/migrations/etc.

 You know, your not the first person I've spoke to who has asked for
 this!  I do wonder whether it is optimising for the right thing though
 - in my experience, disks fail more often than machines.

 Thanks

 Tom


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Peter Schuller
 Software wise it is the same deal. Each node streams off only disk 4
 to the new node.

I think an implication on software is that if you want to make
specific selections of partitions to move, you are effectively
incompatible with deterministically generating the mapping of
partition to responsible node. I.e., it probably means the vnode
information must be kept as state. It is probably difficult to
reconcile with balancing solutions like consistent hashing/crush/etc.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Edward Capriolo
I just see vnodes as a way to make the problem smaller and by making the
problem smaller the overall system is more agile. Aka rather then 1 node
streaming 100 gb the 4 nodes stream 25gb. Moves by hand are not so bad
because the take 1/4th the time.

The most simple vnode implementation is vmware.  Just make sure that 3 vm
nodes consecutive nodes so not end up on the same host. This is wasteful
because we have 4 jvms.

I envision vnodes as Cassandra master being a shared cache,memtables, and
manager for what we today consider a Cassandra  instance. Makes it simple
to think about.

On Wednesday, March 21, 2012, Peter Schuller peter.schul...@infidyne.com
wrote:
 Software wise it is the same deal. Each node streams off only disk 4
 to the new node.

 I think an implication on software is that if you want to make
 specific selections of partitions to move, you are effectively
 incompatible with deterministically generating the mapping of
 partition to responsible node. I.e., it probably means the vnode
 information must be kept as state. It is probably difficult to
 reconcile with balancing solutions like consistent hashing/crush/etc.

 --
 / Peter Schuller (@scode, http://worldmodscode.wordpress.com)



Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Vijay
 I envision vnodes as Cassandra master being a shared cache,memtables,
and manager for what we today consider a Cassandra  instance.

It might be kind of problematic when you are moving the nodes you want the
data associated with the node to move too, otherwise it will be a pain to
cleanup after that (Something like nt clean). I think a vnode should be as
much isolated as possible to reduce the impact when it is moving (which
will become a normal cluster operation), Just my 2 cents.

Regards,
/VJ



On Wed, Mar 21, 2012 at 5:41 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I just see vnodes as a way to make the problem smaller and by making the
 problem smaller the overall system is more agile. Aka rather then 1 node
 streaming 100 gb the 4 nodes stream 25gb. Moves by hand are not so bad
 because the take 1/4th the time.

 The most simple vnode implementation is vmware.  Just make sure that 3 vm
 nodes consecutive nodes so not end up on the same host. This is wasteful
 because we have 4 jvms.

 I envision vnodes as Cassandra master being a shared cache,memtables, and
 manager for what we today consider a Cassandra  instance. Makes it simple
 to think about.

 On Wednesday, March 21, 2012, Peter Schuller peter.schul...@infidyne.com
 wrote:
  Software wise it is the same deal. Each node streams off only disk 4
  to the new node.
 
  I think an implication on software is that if you want to make
  specific selections of partitions to move, you are effectively
  incompatible with deterministically generating the mapping of
  partition to responsible node. I.e., it probably means the vnode
  information must be kept as state. It is probably difficult to
  reconcile with balancing solutions like consistent hashing/crush/etc.
 
  --
  / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
 



Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Jonathan Ellis
A friend pointed out to me privately that I came across pretty harsh
in this thread.  While I stand by my technical concerns, I do want to
acknowledge that Sam's proposal here indicates a strong grasp of the
principles involved, and a deeper level of thought into the issues
than I think anyone else has brought to date.  Thanks for putting that
energy into it, Sam, and I look forward to seeing how you approach the
implementation.

On Fri, Mar 16, 2012 at 6:38 PM, Sam Overton s...@acunu.com wrote:
 Hello cassandra-dev,

 This is a long email. It concerns a significant change to Cassandra, so
 deserves a thorough introduction.

 *The summary is*: we believe virtual nodes are the way forward. We would
 like to add virtual nodes to Cassandra and we are asking for comments,
 criticism and collaboration!

 Cassandra's current partitioning scheme is sub-optimal for bootstrap,
 decommission, repair and re-balance operations, and places the burden on
 users to properly calculate tokens (a common cause of mistakes), which is a
 recurring pain-point.

 Virtual nodes have a variety of benefits over the one-to-one mapping of
 host to key range which Cassandra currently supports.

 Among these benefits are:

 * Even load balancing when growing and shrinking the cluster
 A virtual node scheme ensures that all hosts in a cluster have an even
 portion of the total data, and a new node bootstrapped into the cluster
 will assume its share of the data. Doubling, or halving the cluster to
 ensure even load distribution would no longer be necessary.

 * Distributed rebuild
 When sizing a cluster, one of the considerations is the amount of time
 required to recover from a failed node. This is the exposure time, during
 which a secondary failure could cause data loss. In order to guarantee an
 upper bound on the exposure time, the amount of data which can be stored on
 each host is limited by the amount of time taken to recover the required
 replica count. At Acunu we have found that the exposure time is frequently
 the limiting factor which dictates the maximum allowed node size in
 customers' clusters.

 Using a virtual node scheme, the data stored on one host is not replicated
 on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
 other virtual nodes which may be on a different set of physical hosts to
 replicas of other virtual nodes stored on the same host. This means data
 for one host is replicated evenly across the entire cluster.

 In the event of a failure then, restoring the replica count can be done in
 a fully distributed way. Each host in the cluster participates in the
 rebuild, drastically reducing the exposure time, allowing more data to be
 stored on a single host while still maintaining an acceptable upper bound
 on the likelihood of secondary failure. This reduces TCO concerns.

 * Greater failure tolerance in streaming
 Operations which require streaming of a large range of data, eg. bootstrap,
 decommission, repair, etc. incur a heavy cost if an error (eg. dropped
 network connection) is encountered during the streaming. Currently the
 whole range must be re-streamed, and this could constitute a very large
 amount of data. Virtual nodes reduce the impact of streaming failures,
 since each virtual node is a much smaller range of the key-space, so
 re-streaming a whole virtual node is a much cheaper process.

 * Evenly distributed impact of streaming operations
 Streaming operations such as bootstrap, repair, et al. would involve every
 node in the cluster. This would distribute the load of these operations
 across the whole cluster, and could be staggered so that only a small
 subset of nodes were affected at once, similar to staggered repair[1].

 * Possibility for active load balancing
 Load balancing in Cassandra currently involves moving a token to
 increase/reduce the amount of key-space for which a host is responsible.
 This only allows load balancing between neighbouring nodes, so it could
 involve moving more than one token just to redistribute a single overloaded
 node. Virtual nodes could allow load balancing on a much finer granularity,
 so heavily loaded portions of the key-space could be redistributed to
 lighter-loaded hosts by reassigning one or more virtual nodes.


 Implementing a virtual node scheme in Cassandra is not an insignificant
 amount of work, and it will touch a large amount of the codebase related to
 partitioning, placement, routing, gossip, and so on. We do believe that
 this is possible to do incrementally, and in such a way that there is an
 easy upgrade path for pre-virtual-node deployments.

 It would not however touch the storage layer. The virtual node concept is
 solely for partitioning and placement, not for segregating the data storage
 of the host, so all keys for all virtual nodes on a host would be stored in
 the same SSTables.

 We are not proposing the adoption of the same scheme used by Voldemort[2]
 and described in the Dynamo paper[3]. We 

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Zhu Han
On Tue, Mar 20, 2012 at 11:24 PM, Jeremiah Jordan 
jeremiah.jor...@morningstar.com wrote:

 So taking a step back, if we want vnodes why can't we just give every
 node 100 tokens instead of only one?  Seems to me this would have less
 impact on the rest of the code.  It would just look like you had a 500 node
 cluster, instead of a 5 node cluster.  Your replication strategy would have
 to know about the physical machines so that data gets replicated right, but
 there is already some concept of this with the data center aware and rack
 aware stuff.

 From what I see I think you could get most of the benefits of vnodes by
 implementing a new Placement Strategy that did something like this, and you
 wouldn't have to touch (and maybe break) code in other places.

 Am I crazy? Naive?

 Once you had this setup, you can start to implement the vnode like stuff
 on top of it.  Like bootstrapping nodes in one token at a time, and taking
 them on from the whole cluster, not just your neighbor. etc. etc.


I second it.

Is there some goals we missed which can not be achieved by assigning
multiple tokens to a single node?



 -Jeremiah Jordan

 
 From: Rick Branson [rbran...@datastax.com]
 Sent: Monday, March 19, 2012 5:16 PM
 To: dev@cassandra.apache.org
 Subject: Re: RFC: Cassandra Virtual Nodes

 I think if we could go back and rebuild Cassandra from scratch, vnodes
 would likely be implemented from the beginning. However, I'm concerned that
 implementing them now could be a big distraction from more productive uses
 of all of our time and introduce major potential stability issues into what
 is becoming a business critical piece of infrastructure for many people.
 However, instead of just complaining and pedantry, I'd like to offer a
 feasible alternative:

 Has there been consideration given to the idea of a supporting a single
 token range for a node?

 While not theoretically as capable as vnodes, it seems to me to be more
 practical as it would have a significantly lower impact on the codebase and
 provides a much clearer migration path. It also seems to solve a majority
 of complaints regarding operational issues with Cassandra clusters.

 Each node would have a lower and an upper token, which would form a range
 that would be actively distributed via gossip. Read and replication
 requests would only be routed to a replica when the key of these operations
 matched the replica's token range in the gossip tables. Each node would
 locally store it's own current active token range as well as a target token
 range it's moving towards.

 As a new node undergoes bootstrap, the bounds would be gradually expanded
 to allow it to handle requests for a wider range of the keyspace as it
 moves towards it's target token range. This idea boils down to a move from
 hard cutovers to smoother operations by gradually adjusting active token
 ranges over a period of time. It would apply to token change operations
 (nodetool 'move' and 'removetoken') as well.

 Failure during streaming could be recovered at the bounds instead of
 restarting the whole process as the active bounds would effectively track
 the progress for bootstrap  target token changes. Implicitly these
 operations would be throttled to some degree. Node repair (AES) could also
 be modified using the same overall ideas provide a more gradual impact on
 the cluster overall similar as the ideas given in CASSANDRA-3721.

 While this doesn't spread the load over the cluster for these operations
 evenly like vnodes does, this is likely an issue that could be worked
 around by performing concurrent (throttled) bootstrap  node repair (AES)
 operations. It does allow some kind of active load balancing, but clearly
 this is not as flexible or as useful as vnodes, but you should be using
 RandomPartitioner or sort-of-randomized keys with OPP right? ;)

 As a side note: vnodes fail to provide solutions to node-based limitations
 that seem to me to cause a substantial portion of operational issues such
 as impact of node restarts / upgrades, GC and compaction induced latency. I
 think some progress could be made here by allowing a pack of independent
 Cassandra nodes to be ran on a single host; somewhat (but nowhere near
 entirely) similar to a pre-fork model used by some UNIX-based servers.

 Input?

 --
 Rick Branson
 DataStax



Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote:
 On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote:

 I'm guessing you're referring to Rick's proposal about ranges per node?


 May be, what i mean is little more simple than that... We can consider
 every node having a multiple conservative ranges and moving those ranges
 for bootstrap etc, instead of finding the mid point etc in the bootstrap
 code. Once we get that working all the way to the FS/Streaming then we can
 move those ranges and assign those ranges to nodes in random orders. Hope
 it makes sense.

I agree that this should be approached in incremental steps. Rick
already raised concerns about stability issues which might arise from
changing large parts of code at once.

I would anticipate the first step to be, exactly as you suggest, to
support multiple tokens per host instead of just one. Presumably in
your suggestion you imagine these tokens to define contiguous ranges
for a given host, so that the distribution model is the same as
before, but bootstrap can be done incrementally.

This would be a great first step. The extension to a virtual node
scheme as described previously is then fairly trivial. The only
additional change needed is to assign the tokens in some other way
which does not restrict the ranges to being contiguous.

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Eric Evans
On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton s...@acunu.com wrote:
 On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote:
 On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote:

 I'm guessing you're referring to Rick's proposal about ranges per node?


 May be, what i mean is little more simple than that... We can consider
 every node having a multiple conservative ranges and moving those ranges
 for bootstrap etc, instead of finding the mid point etc in the bootstrap
 code. Once we get that working all the way to the FS/Streaming then we can
 move those ranges and assign those ranges to nodes in random orders. Hope
 it makes sense.

 I agree that this should be approached in incremental steps. Rick
 already raised concerns about stability issues which might arise from
 changing large parts of code at once.

 I would anticipate the first step to be, exactly as you suggest, to
 support multiple tokens per host instead of just one. Presumably in
 your suggestion you imagine these tokens to define contiguous ranges
 for a given host, so that the distribution model is the same as
 before, but bootstrap can be done incrementally.

 This would be a great first step. The extension to a virtual node
 scheme as described previously is then fairly trivial. The only
 additional change needed is to assign the tokens in some other way
 which does not restrict the ranges to being contiguous.

Sounds good to me.

What can an upgrading user expect in the way of disruption?  What
would be required to move an existing cluster from one token per node
to virtual nodes?  Could this be made transparent?

-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jonathan Ellis
I like this idea.  It feels like a good 80/20 solution -- 80% of the
benefits, 20% of the effort.  More like 5% of the effort.  I can't
even enumerate all the places full vnode support would change, but an
active token range concept would be relatively limited in scope.

Full vnodes feels a lot more like the counters quagmire, where
Digg/Twitter worked on it for... 8? months, and then DataStax worked
on it about for about 6 months post-commit, and we're still finding
the occasional bug-since-0.7 there.  With the benefit of hindsight, as
bad as maintaining that patchset was out of tree, committing it as
early as we did was a mistake.  We won't do that again.  (On the
bright side, git makes maintaining such a patchset easier now.)

On Mon, Mar 19, 2012 at 5:16 PM, Rick Branson rbran...@datastax.com wrote:
 I think if we could go back and rebuild Cassandra from scratch, vnodes
 would likely be implemented from the beginning. However, I'm concerned that
 implementing them now could be a big distraction from more productive uses
 of all of our time and introduce major potential stability issues into what
 is becoming a business critical piece of infrastructure for many people.
 However, instead of just complaining and pedantry, I'd like to offer a
 feasible alternative:

 Has there been consideration given to the idea of a supporting a single
 token range for a node?

 While not theoretically as capable as vnodes, it seems to me to be more
 practical as it would have a significantly lower impact on the codebase and
 provides a much clearer migration path. It also seems to solve a majority
 of complaints regarding operational issues with Cassandra clusters.

 Each node would have a lower and an upper token, which would form a range
 that would be actively distributed via gossip. Read and replication
 requests would only be routed to a replica when the key of these operations
 matched the replica's token range in the gossip tables. Each node would
 locally store it's own current active token range as well as a target token
 range it's moving towards.

 As a new node undergoes bootstrap, the bounds would be gradually expanded
 to allow it to handle requests for a wider range of the keyspace as it
 moves towards it's target token range. This idea boils down to a move from
 hard cutovers to smoother operations by gradually adjusting active token
 ranges over a period of time. It would apply to token change operations
 (nodetool 'move' and 'removetoken') as well.

 Failure during streaming could be recovered at the bounds instead of
 restarting the whole process as the active bounds would effectively track
 the progress for bootstrap  target token changes. Implicitly these
 operations would be throttled to some degree. Node repair (AES) could also
 be modified using the same overall ideas provide a more gradual impact on
 the cluster overall similar as the ideas given in CASSANDRA-3721.

 While this doesn't spread the load over the cluster for these operations
 evenly like vnodes does, this is likely an issue that could be worked
 around by performing concurrent (throttled) bootstrap  node repair (AES)
 operations. It does allow some kind of active load balancing, but clearly
 this is not as flexible or as useful as vnodes, but you should be using
 RandomPartitioner or sort-of-randomized keys with OPP right? ;)

 As a side note: vnodes fail to provide solutions to node-based limitations
 that seem to me to cause a substantial portion of operational issues such
 as impact of node restarts / upgrades, GC and compaction induced latency. I
 think some progress could be made here by allowing a pack of independent
 Cassandra nodes to be ran on a single host; somewhat (but nowhere near
 entirely) similar to a pre-fork model used by some UNIX-based servers.

 Input?

 --
 Rick Branson
 DataStax



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Eric Evans
On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis jbel...@gmail.com wrote:
 I like this idea.  It feels like a good 80/20 solution -- 80% of the
 benefits, 20% of the effort.  More like 5% of the effort.  I can't
 even enumerate all the places full vnode support would change, but an
 active token range concept would be relatively limited in scope.

It only addresses 1 of Sam's original 5 points, so I wouldn't call it
an 80% solution.

 Full vnodes feels a lot more like the counters quagmire, where
 Digg/Twitter worked on it for... 8? months, and then DataStax worked
 on it about for about 6 months post-commit, and we're still finding
 the occasional bug-since-0.7 there.  With the benefit of hindsight, as
 bad as maintaining that patchset was out of tree, committing it as
 early as we did was a mistake.  We won't do that again.  (On the
 bright side, git makes maintaining such a patchset easier now.)

And yet counters have become a very important feature for Cassandra;
We're better off with them, than without.

I think there were a number of problems with how counters went down
that could be avoided here.  For one, we can take a phased,
incremental approach, rather than waiting 8 months to drop a large
patchset.

 On Mon, Mar 19, 2012 at 5:16 PM, Rick Branson rbran...@datastax.com wrote:
 I think if we could go back and rebuild Cassandra from scratch, vnodes
 would likely be implemented from the beginning. However, I'm concerned that
 implementing them now could be a big distraction from more productive uses
 of all of our time and introduce major potential stability issues into what
 is becoming a business critical piece of infrastructure for many people.
 However, instead of just complaining and pedantry, I'd like to offer a
 feasible alternative:

 Has there been consideration given to the idea of a supporting a single
 token range for a node?

 While not theoretically as capable as vnodes, it seems to me to be more
 practical as it would have a significantly lower impact on the codebase and
 provides a much clearer migration path. It also seems to solve a majority
 of complaints regarding operational issues with Cassandra clusters.

 Each node would have a lower and an upper token, which would form a range
 that would be actively distributed via gossip. Read and replication
 requests would only be routed to a replica when the key of these operations
 matched the replica's token range in the gossip tables. Each node would
 locally store it's own current active token range as well as a target token
 range it's moving towards.

 As a new node undergoes bootstrap, the bounds would be gradually expanded
 to allow it to handle requests for a wider range of the keyspace as it
 moves towards it's target token range. This idea boils down to a move from
 hard cutovers to smoother operations by gradually adjusting active token
 ranges over a period of time. It would apply to token change operations
 (nodetool 'move' and 'removetoken') as well.

 Failure during streaming could be recovered at the bounds instead of
 restarting the whole process as the active bounds would effectively track
 the progress for bootstrap  target token changes. Implicitly these
 operations would be throttled to some degree. Node repair (AES) could also
 be modified using the same overall ideas provide a more gradual impact on
 the cluster overall similar as the ideas given in CASSANDRA-3721.

 While this doesn't spread the load over the cluster for these operations
 evenly like vnodes does, this is likely an issue that could be worked
 around by performing concurrent (throttled) bootstrap  node repair (AES)
 operations. It does allow some kind of active load balancing, but clearly
 this is not as flexible or as useful as vnodes, but you should be using
 RandomPartitioner or sort-of-randomized keys with OPP right? ;)

 As a side note: vnodes fail to provide solutions to node-based limitations
 that seem to me to cause a substantial portion of operational issues such
 as impact of node restarts / upgrades, GC and compaction induced latency. I
 think some progress could be made here by allowing a pack of independent
 Cassandra nodes to be ran on a single host; somewhat (but nowhere near
 entirely) similar to a pre-fork model used by some UNIX-based servers.

 Input?

 --
 Rick Branson
 DataStax



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Rick Branson
  I like this idea. It feels like a good 80/20 solution -- 80% of the
  benefits, 20% of the effort. More like 5% of the effort. I can't
  even enumerate all the places full vnode support would change, but an
  active token range concept would be relatively limited in scope.
 
 
 It only addresses 1 of Sam's original 5 points, so I wouldn't call it
 an 80% solution.
 
To support a form of DF, I think some tweaking of the replica placement could 
achieve this effect quite well. We could introduce a variable into replica 
placement, which I'm going to incorrectly call DF for the purposes of 
illustration. The key range for a node would be sub-divided by DF (1 by 
default) and this would be used to further distribution replica selection based 
on this sub-partition. 

Currently, the offset formula works out to be something like this:

offset = replica

For RandomPartitioner, DF placement might look something like:

offset = replica + (token % DF)

Now, I realize replica selection is actually much more complicated than this, 
but these formulas are for illustration purposes.

Modifying replica placement  the partitioners to support this seems 
straightforward, but I'm unsure of what's required to get it working for ring 
management operations. On the surface, it does seem like this could be added 
without any kind of difficult migration support. 

Thoughts?




Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 20 March 2012 13:37, Eric Evans eev...@acunu.com wrote:
 On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton s...@acunu.com wrote:
 On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote:
 May be, what i mean is little more simple than that... We can consider
 every node having a multiple conservative ranges and moving those ranges
 for bootstrap etc, instead of finding the mid point etc in the bootstrap
 code. Once we get that working all the way to the FS/Streaming then we can
 move those ranges and assign those ranges to nodes in random orders. Hope
 it makes sense.

 I agree that this should be approached in incremental steps. Rick
 already raised concerns about stability issues which might arise from
 changing large parts of code at once.

 I would anticipate the first step to be, exactly as you suggest, to
 support multiple tokens per host instead of just one. Presumably in
 your suggestion you imagine these tokens to define contiguous ranges
 for a given host, so that the distribution model is the same as
 before, but bootstrap can be done incrementally.

 This would be a great first step. The extension to a virtual node
 scheme as described previously is then fairly trivial. The only
 additional change needed is to assign the tokens in some other way
 which does not restrict the ranges to being contiguous.

 Sounds good to me.

 What can an upgrading user expect in the way of disruption?  What
 would be required to move an existing cluster from one token per node
 to virtual nodes?  Could this be made transparent?


The disruption for an end-user would be no more than the same rolling
upgrade process that they have to go through currently to upgrade to a
new version.

This is how I envisage it working:
* When a node is upgraded and the new version starts up in an old
cluster, it would split its own token range into multiple sub-ranges
by assigning itself more tokens in its own range
* These tokens could then be gossiped to any other new versions in the
cluster. The old versions don't need to know about these intermediate
tokens because distribution is exactly the same - node ranges are
still contiguous
* Once every node has been upgraded, distribution is still the same as
before, but now ranges are split into sub-ranges
* The benefits of vnodes start to become apparent when adding new
nodes to the cluster - a new node bootstrapping would take an even
amount of data from each other node and would not require doubling the
cluster to maintain balance
* As more nodes are added to the cluster it gets closer to full vnode
distribution as more of the original hosts' ranges get reassigned to
new nodes

If the user wants to migrate to full vnode functionality straight away
then they can do a rolling migration (decommission/bootstrap). During
this migration there would be some imbalance in the cluster, but once
all of the old nodes have been migrated, the cluster would be
balanced.

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jonathan Ellis
On Tue, Mar 20, 2012 at 9:08 AM, Eric Evans eev...@acunu.com wrote:
 On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis jbel...@gmail.com wrote:
 I like this idea.  It feels like a good 80/20 solution -- 80% of the
 benefits, 20% of the effort.  More like 5% of the effort.  I can't
 even enumerate all the places full vnode support would change, but an
 active token range concept would be relatively limited in scope.

 It only addresses 1 of Sam's original 5 points, so I wouldn't call it
 an 80% solution.

I guess a more accurate way to put this is, only 20% of Sam's list is
an actual pain point that doesn't get addressed by The Rick Proposal
[TRP].

Here's how I see Sam's list:

* Even load balancing when growing and shrinking the cluster

Nice to have, but post-bootstrap load balancing works well in practice
(and is improved by TRP).

* Greater failure tolerance in streaming

Directly addressed by TRP.

* Evenly distributed impact of streaming operations

Not a problem in practice with stream throttling.

* Possibility for active load balancing

Not really a feature of vnodes per se, but as with the other load
balancing point, this is also improved by TRP.

* Distributed rebuild

This is the 20% that TRP does not address.  Nice to have?  Yes.  Can I
live without it?  I have so far.  Is this alone worth the complexity
of vnodes?  No, it is not.  Especially since there are probably other
approaches that we can take to mitigate this, one of which Rick has
suggested in a separate sub-thread.

 Full vnodes feels a lot more like the counters quagmire, where
 Digg/Twitter worked on it for... 8? months, and then DataStax worked
 on it about for about 6 months post-commit, and we're still finding
 the occasional bug-since-0.7 there.  With the benefit of hindsight, as
 bad as maintaining that patchset was out of tree, committing it as
 early as we did was a mistake.  We won't do that again.  (On the
 bright side, git makes maintaining such a patchset easier now.)

 And yet counters have become a very important feature for Cassandra;
 We're better off with them, than without.

False dichotomy (we could have waited for a better counter design),
but that's mostly irrelevant to my point that jamming incomplete code
in-tree to sort out later is a bad idea.

 I think there were a number of problems with how counters went down
 that could be avoided here.  For one, we can take a phased,
 incremental approach, rather than waiting 8 months to drop a large
 patchset.

If there are incremental improvements to be made that justify
themselves independently, then I agree.  Small, self-contained steps
are a good thing.  A good example is
https://issues.apache.org/jira/browse/CASSANDRA-2319, a product of The
Grand Storage Engine Redesign of 674 fame.

But, when things don't naturally break down into such mini-features,
then I'm -1 on committing code that has no purpose other than to be a
foundation for later commits.  I've seen people get bored or assigned
to other projects too often to just trust that those later commits
will indeed be forthcoming.  Or even if Sam [for instance] is still
working hard on it, it's very easy for unforseen difficulties to come
up that invalidate the original approach.  Since we were talking about
counters, the original vector clock approach -- that we ended up
ripping out, painfully -- is a good example.  Once bitten, twice shy.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


RE: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jeremiah Jordan
So taking a step back, if we want vnodes why can't we just give every node 
100 tokens instead of only one?  Seems to me this would have less impact on the 
rest of the code.  It would just look like you had a 500 node cluster, instead 
of a 5 node cluster.  Your replication strategy would have to know about the 
physical machines so that data gets replicated right, but there is already some 
concept of this with the data center aware and rack aware stuff.

From what I see I think you could get most of the benefits of vnodes by 
implementing a new Placement Strategy that did something like this, and you 
wouldn't have to touch (and maybe break) code in other places.

Am I crazy? Naive?

Once you had this setup, you can start to implement the vnode like stuff on top 
of it.  Like bootstrapping nodes in one token at a time, and taking them on 
from the whole cluster, not just your neighbor. etc. etc.

-Jeremiah Jordan


From: Rick Branson [rbran...@datastax.com]
Sent: Monday, March 19, 2012 5:16 PM
To: dev@cassandra.apache.org
Subject: Re: RFC: Cassandra Virtual Nodes

I think if we could go back and rebuild Cassandra from scratch, vnodes
would likely be implemented from the beginning. However, I'm concerned that
implementing them now could be a big distraction from more productive uses
of all of our time and introduce major potential stability issues into what
is becoming a business critical piece of infrastructure for many people.
However, instead of just complaining and pedantry, I'd like to offer a
feasible alternative:

Has there been consideration given to the idea of a supporting a single
token range for a node?

While not theoretically as capable as vnodes, it seems to me to be more
practical as it would have a significantly lower impact on the codebase and
provides a much clearer migration path. It also seems to solve a majority
of complaints regarding operational issues with Cassandra clusters.

Each node would have a lower and an upper token, which would form a range
that would be actively distributed via gossip. Read and replication
requests would only be routed to a replica when the key of these operations
matched the replica's token range in the gossip tables. Each node would
locally store it's own current active token range as well as a target token
range it's moving towards.

As a new node undergoes bootstrap, the bounds would be gradually expanded
to allow it to handle requests for a wider range of the keyspace as it
moves towards it's target token range. This idea boils down to a move from
hard cutovers to smoother operations by gradually adjusting active token
ranges over a period of time. It would apply to token change operations
(nodetool 'move' and 'removetoken') as well.

Failure during streaming could be recovered at the bounds instead of
restarting the whole process as the active bounds would effectively track
the progress for bootstrap  target token changes. Implicitly these
operations would be throttled to some degree. Node repair (AES) could also
be modified using the same overall ideas provide a more gradual impact on
the cluster overall similar as the ideas given in CASSANDRA-3721.

While this doesn't spread the load over the cluster for these operations
evenly like vnodes does, this is likely an issue that could be worked
around by performing concurrent (throttled) bootstrap  node repair (AES)
operations. It does allow some kind of active load balancing, but clearly
this is not as flexible or as useful as vnodes, but you should be using
RandomPartitioner or sort-of-randomized keys with OPP right? ;)

As a side note: vnodes fail to provide solutions to node-based limitations
that seem to me to cause a substantial portion of operational issues such
as impact of node restarts / upgrades, GC and compaction induced latency. I
think some progress could be made here by allowing a pack of independent
Cassandra nodes to be ran on a single host; somewhat (but nowhere near
entirely) similar to a pre-fork model used by some UNIX-based servers.

Input?

--
Rick Branson
DataStax


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 19 March 2012 23:41, Peter Schuller peter.schul...@infidyne.com wrote:
 Using this ring bucket in the CRUSH topology, (with the hash function
 being the identity function) would give the exact same distribution
 properties as the virtual node strategy that I suggested previously,
 but of course with much better topology awareness.

 I will have to re-read your orignal post. I seem to have missed something :)

 I did, and I may or may not understand what you mean.

 Are you comparing vnodes + hashing, with CRUSH + pre-partitioning by
 hash + identity hash as you traverse down the topology tree?

Yes. I was just trying to illustrate that it's not necessary to have
CRUSH doing the partitioning and placement of primary replicas. The
same functionality can be achieved by having logically separate
placement (a ring with virtual nodes) and a replication strategy which
implements the CRUSH algorithm for replica placement. I think you
agreed with this further down your previous reply anyway, perhaps I
was just being too verbose :)

The reason I'm trying to make that distinction is because it will be
less work than wholesale replacing the entire distribution logic in
Cassandra with CRUSH. I'm not sure if that's exactly what your design
is suggesting?

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Richard Low
On 20 March 2012 14:50, Rick Branson rbran...@datastax.com wrote:

 To support a form of DF, I think some tweaking of the replica placement could 
 achieve this effect quite well. We could introduce a variable into replica 
 placement, which I'm going to incorrectly call DF for the purposes of 
 illustration. The key range for a node would be sub-divided by DF (1 by 
 default) and this would be used to further distribution replica selection 
 based on this sub-partition.

 Currently, the offset formula works out to be something like this:

 offset = replica

 For RandomPartitioner, DF placement might look something like:

 offset = replica + (token % DF)

 Now, I realize replica selection is actually much more complicated than this, 
 but these formulas are for illustration purposes.

 Modifying replica placement  the partitioners to support this seems 
 straightforward, but I'm unsure of what's required to get it working for ring 
 management operations. On the surface, it does seem like this could be added 
 without any kind of difficult migration support.

 Thoughts?

This solution increases the DF, which has the advantage of providing
some balancing when a node is down temporarily.  The reads and writes
it would have served are now distributed around ~DF nodes.

However, it doesn't have any distributed rebuild.  In fact, any
distribution mechanism with one token per node cannot have distributed
rebuild.  Should a node fail, the next node in the ring has twice the
token range so must have twice the data.  This node will limit the
rebuild time - 'nodetool removetoken' will have to replicate the data
of the failed node onto this node.

Increasing the distribution factor without speeding up rebuild
increases the failure probability - both for data loss or being unable
to reach required consistency levels.  The failure probability is a
trade-off between rebuild time and distribution factor.  Lower rebuild
time helps, and lower distribution factor helps.

Cassandra as it is now has the longest rebuild time and lowest
possible distribution factor.  The original vnodes scheme is the other
extreme - shortest rebuild time and largest possible distribution
factor.
 It turns out that the rebuild time is more important, so this
decreases failure probability (with some assumptions you can show it
decreases by a factor RF! - I'll spare you the math but can send it if
you're interested).

This scheme has the longest rebuild time and a (tuneable) distribution
factor, but larger than the lowest.  That necessarily increases the
failure probability over both Cassandra now and vnode schemes, so I'd
be very careful about choosing it.

Richard.


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Peter Schuller
 Each node would have a lower and an upper token, which would form a range
 that would be actively distributed via gossip. Read and replication
 requests would only be routed to a replica when the key of these operations
 matched the replica's token range in the gossip tables. Each node would
 locally store it's own current active token range as well as a target token
 range it's moving towards.

How is this significantly different than just using nodetool move
(in post-1.x) more rapidly and on smaller segments at a time?

There is the ring delay stuff which makes it un-workable to do at high
granularity, but that should apply to the active range solution too.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Radim Kolar




Hi Radim,

The number of virtual nodes for each host would be configurable by the
user, in much the same way that initial_token is configurable now. A host
taking a larger number of virtual nodes (tokens) would have proportionately
more of the data. This is how we anticipate support for heterogeneity in
cluster hardware.
Yes, but this is good only for random partitioner. For ordered you need 
to be able split token space on highly loaded servers. With virtual 
tokens it will move load to random node.
What if random node will be also hotspot node? Administration will be 
more difficult because you don't know where workload lands after you 
reduce number of tokens held by node.


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
Hi Peter,

It's great to hear that others have come to some of the same conclusions!

I think a CRUSH-like strategy for topologically aware
replication/routing/locality is a great idea. I think I can see three
mostly orthogonal sets of functionality that we're concerned with:

a) a virtual node partitioning scheme (to support heterogeneity and
management simplicity)
b) topology aware replication
c) topology aware routing

First of all, I think that while (c) depends on (b) it does not affect
partitioning or replication directly, so I'm going to set that aside
for the moment and talk just about the former two.

I'll summarise your design here, mainly to make sure that I understand
it, but also to refer back to it:

1. The hash-space is partitioned into a fixed number of partitions
2. The CRUSH algorithm is run - select(1, disk) - over the topology
using each partition as a key, to get an assignment of partition -
physical host (primary)
2a. adding or removing a node requires re-running CRUSH to recalculate
the partition assignment (and move data)
3. The CRUSH algorithm is run - select(RF-1, disk) - over the topology
using each primary host id, to get an assignment of primary host -
RF-1 replicas
3a. adding or removing a node requires re-running CRUSH to recalculate
replica assignment (which might be a different set of hosts to
before?)

Here are some thoughts:
(clarification: when I'm talking about buckets, I'm referring to the
same concept as in the CRUSH paper!)

One of my concerns about using CRUSH exactly as described in the paper
is that it seems to be sub-optimal in the amount of data that it moves
after modifying the topology. The authors of the paper introduce
several bucket types (uniform, list, tree, straw) which appear to be
various sub-optimal alternatives to consistent hashing, with various
trade-offs. Why not use consistent hashing? Given (2a) and (3a) I
think we might end up moving way too much data when the set of
replicas changes completely for a given host.

Let's suppose we introduce our own bucket type called a ring bucket.
Each item in a ring bucket is assigned an equal, non-contiguous
portion of the key hash-space, which determines which keys are
assigned to it. When an item is added to the ring bucket, it takes an
equal portion of the hash-space from every other item already in the
bucket. And vice-versa for removals. It's easy to see that this ring
bucket implements consistent hashing with some unspecified virtual
node scheme. Additions and removals would be optimal (only \deltaw/W
keys require moving when the topology changes).

Using this ring bucket in the CRUSH topology, (with the hash function
being the identity function) would give the exact same distribution
properties as the virtual node strategy that I suggested previously,
but of course with much better topology awareness.

This makes it evident that the partitioning scheme, and a CRUSH-like
replication scheme are orthogonal concerns. In the same way as NTS
currently uses the ring to provide distribution at DC and rack level
by conceptually separating the ring into a distinct logical rings for
each DC, a CrushReplicationStrategy could use the ring as its
bucketing function to distribute partitions in the topology.

This brings me on to (1) and the reasons for our choice of virtual
node scheme - choose N random tokens - instead of the Dynamo-like
scheme that you suggest where the partitions are fixed in advance.
With the Dynamo scheme, the size of a virtual node partition will only
ever grow as more data is inserted. Since the number of partitions is
fixed when the cluster is created, the partition size is unbounded.

There are certain advantages to having a limit on partition size.
Streaming failures that cause retries do not have to resend so much
data. Streaming operations can be staggered in smaller chunks to
minimise the impact on the nodes involved. Load balancing can operate
on a finer granularity.

In the N tokens per node scheme, adding nodes to the cluster decreases
the partition size and so gives some control about how much data is
stored in each partition. The average size can be reduced by adding
more machines to the cluster.

The other concern you mentioned was
 The probability of data loss increases linearly with cluster size.

but you also acknowledge that

 In making this determination, one must take into account that if a
 larger `DF` makes reconstruction/replacement significantly faster,
 that also decreases the time window in which multiple failures can
 occurr. Increasing `DF` is thus not *necessarily* increasing the total
 probability of data loss (for small values of `DF`).

Our calculations lead us to believe that in fact the shorter rebuild
window more than compensates for the increased probability of multiple
failure, so with DF=N the probability of data loss is minimised.

The CRUSH paper also states:

With 2-way mirroring these two factors cancel
each other out, while overall data safety with 

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
On 19 March 2012 09:23, Radim Kolar h...@filez.com wrote:


 Hi Radim,

 The number of virtual nodes for each host would be configurable by the
 user, in much the same way that initial_token is configurable now. A host
 taking a larger number of virtual nodes (tokens) would have
 proportionately
 more of the data. This is how we anticipate support for heterogeneity in
 cluster hardware.

 Yes, but this is good only for random partitioner. For ordered you need to
 be able split token space on highly loaded servers. With virtual tokens it
 will move load to random node.
 What if random node will be also hotspot node? Administration will be more
 difficult because you don't know where workload lands after you reduce
 number of tokens held by node.

For OPP we envisage an external management process performing active
load balancing. The initial token assignment would be random within
some user-specified range corresponding to the range of their keys.
The load would then be monitored and hot-spots would be moved by
reassigning virtual nodes to lightly loaded machines, or introducing
new tokens into hot ranges. It makes sense that this would not be a
manual process, but there would certainly be more control than just
increasing or decreasing the number of tokens assigned to a node.

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Edward Capriolo
On Mon, Mar 19, 2012 at 4:15 PM, Sam Overton s...@acunu.com wrote:
 On 19 March 2012 09:23, Radim Kolar h...@filez.com wrote:


 Hi Radim,

 The number of virtual nodes for each host would be configurable by the
 user, in much the same way that initial_token is configurable now. A host
 taking a larger number of virtual nodes (tokens) would have
 proportionately
 more of the data. This is how we anticipate support for heterogeneity in
 cluster hardware.

 Yes, but this is good only for random partitioner. For ordered you need to
 be able split token space on highly loaded servers. With virtual tokens it
 will move load to random node.
 What if random node will be also hotspot node? Administration will be more
 difficult because you don't know where workload lands after you reduce
 number of tokens held by node.

 For OPP we envisage an external management process performing active
 load balancing. The initial token assignment would be random within
 some user-specified range corresponding to the range of their keys.
 The load would then be monitored and hot-spots would be moved by
 reassigning virtual nodes to lightly loaded machines, or introducing
 new tokens into hot ranges. It makes sense that this would not be a
 manual process, but there would certainly be more control than just
 increasing or decreasing the number of tokens assigned to a node.

 --
 Sam Overton
 Acunu | http://www.acunu.com | @acunu

For OPP the problem of load balancing is more profound. Now you need
vnodes per keyspace because you can not expect each keyspace to have
the same distribution. With three keyspaces you are not unsure as to
which was is causing the hotness. I think OPP should just go away.


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
 For OPP the problem of load balancing is more profound. Now you need
 vnodes per keyspace because you can not expect each keyspace to have
 the same distribution. With three keyspaces you are not unsure as to
 which was is causing the hotness. I think OPP should just go away.

That's a good point, but isn't that the same problem with trying to
balance tokens with OPP currently?


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Edward Capriolo
On Mon, Mar 19, 2012 at 4:24 PM, Sam Overton s...@acunu.com wrote:
 For OPP the problem of load balancing is more profound. Now you need
 vnodes per keyspace because you can not expect each keyspace to have
 the same distribution. With three keyspaces you are not unsure as to
 which was is causing the hotness. I think OPP should just go away.

 That's a good point, but isn't that the same problem with trying to
 balance tokens with OPP currently?

Yes. I was bringing this up because the external management process
you suggested performing active load balancing will have to be smart
enough to understand this. Right now since this is done manually it is
the users problem.


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Rick Branson
I think if we could go back and rebuild Cassandra from scratch, vnodes
would likely be implemented from the beginning. However, I'm concerned that
implementing them now could be a big distraction from more productive uses
of all of our time and introduce major potential stability issues into what
is becoming a business critical piece of infrastructure for many people.
However, instead of just complaining and pedantry, I'd like to offer a
feasible alternative:

Has there been consideration given to the idea of a supporting a single
token range for a node?

While not theoretically as capable as vnodes, it seems to me to be more
practical as it would have a significantly lower impact on the codebase and
provides a much clearer migration path. It also seems to solve a majority
of complaints regarding operational issues with Cassandra clusters.

Each node would have a lower and an upper token, which would form a range
that would be actively distributed via gossip. Read and replication
requests would only be routed to a replica when the key of these operations
matched the replica's token range in the gossip tables. Each node would
locally store it's own current active token range as well as a target token
range it's moving towards.

As a new node undergoes bootstrap, the bounds would be gradually expanded
to allow it to handle requests for a wider range of the keyspace as it
moves towards it's target token range. This idea boils down to a move from
hard cutovers to smoother operations by gradually adjusting active token
ranges over a period of time. It would apply to token change operations
(nodetool 'move' and 'removetoken') as well.

Failure during streaming could be recovered at the bounds instead of
restarting the whole process as the active bounds would effectively track
the progress for bootstrap  target token changes. Implicitly these
operations would be throttled to some degree. Node repair (AES) could also
be modified using the same overall ideas provide a more gradual impact on
the cluster overall similar as the ideas given in CASSANDRA-3721.

While this doesn't spread the load over the cluster for these operations
evenly like vnodes does, this is likely an issue that could be worked
around by performing concurrent (throttled) bootstrap  node repair (AES)
operations. It does allow some kind of active load balancing, but clearly
this is not as flexible or as useful as vnodes, but you should be using
RandomPartitioner or sort-of-randomized keys with OPP right? ;)

As a side note: vnodes fail to provide solutions to node-based limitations
that seem to me to cause a substantial portion of operational issues such
as impact of node restarts / upgrades, GC and compaction induced latency. I
think some progress could be made here by allowing a pack of independent
Cassandra nodes to be ran on a single host; somewhat (but nowhere near
entirely) similar to a pre-fork model used by some UNIX-based servers.

Input?

--
Rick Branson
DataStax


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
 a) a virtual node partitioning scheme (to support heterogeneity and
 management simplicity)
 b) topology aware replication
 c) topology aware routing

I would add (d) limiting the distribution factor to decrease the
probability of data loss/multiple failures within a replica set.

 First of all, I think that while (c) depends on (b) it does not affect
 partitioning or replication directly, so I'm going to set that aside
 for the moment and talk just about the former two.

Agreed (but I think (d) relates).

 1. The hash-space is partitioned into a fixed number of partitions
 2. The CRUSH algorithm is run - select(1, disk) - over the topology
 using each partition as a key, to get an assignment of partition -
 physical host (primary)
 2a. adding or removing a node requires re-running CRUSH to recalculate
 the partition assignment (and move data)

(or any other arbitrary change, yes)

 3. The CRUSH algorithm is run - select(RF-1, disk) - over the topology
 using each primary host id, to get an assignment of primary host -
 RF-1 replicas
 3a. adding or removing a node requires re-running CRUSH to recalculate
 replica assignment (which might be a different set of hosts to
 before?)

Yes. Cassandra would have a minimum of two topologies; the current
and the next topology. Each would imply a mapping of partition -
replica set, and that mapping will potentially be different between
the two.

Reads would always be served form the current topology. Writes would
go to the union of the current and the next topology, taking care to
tie replicas together correctly for consistency level purposes (this
is what CASSANDRA-3901 and CASSANDRA-3833 are talking about).

Any topology change is treated the same from the read/write path
perspective, regardless of whether you're adding a node, removing a
node, adding an entire rack, or even an entire data center. No added
complexity is introduced beyond the base case.

 One of my concerns about using CRUSH exactly as described in the paper
 is that it seems to be sub-optimal in the amount of data that it moves
 after modifying the topology. The authors of the paper introduce
 several bucket types (uniform, list, tree, straw) which appear to be
 various sub-optimal alternatives to consistent hashing, with various
 trade-offs. Why not use consistent hashing? Given (2a) and (3a) I
 think we might end up moving way too much data when the set of
 replicas changes completely for a given host.

One of the benefits of pre-partitioning to a fixed set of partitions
is that we can pre-calculate the mapping. This removes the CPU
efficiency trade-off of the straw bucket, and the straw bucket would
be a good choice.

Consistent hashing: It's totally doable to use consistent hashing at
each node in the topology. It is not without its own trade-offs
though, because the granularity of weighting you want to support, and
the accurace of it, relates directly to the number of vnodes per child
you need to keep in your consistent hashing ring. Taking granularity,
accuracy target and number of children into account can easily lead to
very large amounts of vnodes.

(At least experimentally from when I've implemented and played with
the simple form of consistent hashing in the past. I don't currently
have good mathematical evidence.)

 Let's suppose we introduce our own bucket type called a ring bucket.
 Each item in a ring bucket is assigned an equal, non-contiguous
 portion of the key hash-space, which determines which keys are
 assigned to it. When an item is added to the ring bucket, it takes an
 equal portion of the hash-space from every other item already in the
 bucket. And vice-versa for removals. It's easy to see that this ring
 bucket implements consistent hashing with some unspecified virtual
 node scheme. Additions and removals would be optimal (only \deltaw/W
 keys require moving when the topology changes).

This is my understanding of what you meant by consistent hashing, and
what I refer to above.

 Using this ring bucket in the CRUSH topology, (with the hash function
 being the identity function) would give the exact same distribution
 properties as the virtual node strategy that I suggested previously,
 but of course with much better topology awareness.

I will have to re-read your orignal post. I seem to have missed something :)

 This makes it evident that the partitioning scheme, and a CRUSH-like
 replication scheme are orthogonal concerns. In the same way as NTS
 currently uses the ring to provide distribution at DC and rack level
 by conceptually separating the ring into a distinct logical rings for
 each DC, a CrushReplicationStrategy could use the ring as its
 bucketing function to distribute partitions in the topology.

Yes, agreed. Also, the distribution factor limiting is also compatible
with non-crush by hash chaining from the primary replica instead of
the row key.

 This brings me on to (1) and the reasons for our choice of virtual
 node scheme - choose N random tokens - 

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
 Using this ring bucket in the CRUSH topology, (with the hash function
 being the identity function) would give the exact same distribution
 properties as the virtual node strategy that I suggested previously,
 but of course with much better topology awareness.

 I will have to re-read your orignal post. I seem to have missed something :)

I did, and I may or may not understand what you mean.

Are you comparing vnodes + hashing, with CRUSH + pre-partitioning by
hash + identity hash as you traverse down the topology tree?

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
(I may comment on other things more later)

 As a side note: vnodes fail to provide solutions to node-based limitations
 that seem to me to cause a substantial portion of operational issues such
 as impact of node restarts / upgrades, GC and compaction induced latency. I

Actually, it does. At least assumign DF  RF (as in the original
proposal, and mine). The impact of a node suffering from a performance
degradation is mitigated because the effects are spread out over DF-1
(N-1 in the original post) nodes instead of just RF nodes.

 think some progress could be made here by allowing a pack of independent
 Cassandra nodes to be ran on a single host; somewhat (but nowhere near
 entirely) similar to a pre-fork model used by some UNIX-based servers.

I have pretty significant knee-jerk negative reactions to that idea to
be honest, even if the pack is limited to a handful of instances. In
order for vnodes to be useful with random placement, we'd need much
more than a handful of vnodes per node (cassandra instances in a
pack in that model).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Vijay
I also did create a ticket
https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the
reason why I would like to see vnodes in cassandra.
It can also potentially reduce the SSTable seeks which a node has to do to
query data in SizeTireCompaction if extended to the filesystem.

But 110% agree with Peter, we need to take incremental steps and start with
the existing bootstrapping.
May be we can start it by making a set of Ranges/Token to a node insted of
one token. And then may be building things around the movement of those
ranges.

I have been thinking about this for a while but having trouble to get to a
point where i am comfortable changing big chunks of code.

Regards,
/VJ



On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller peter.schul...@infidyne.com
 wrote:

 (I may comment on other things more later)

  As a side note: vnodes fail to provide solutions to node-based
 limitations
  that seem to me to cause a substantial portion of operational issues such
  as impact of node restarts / upgrades, GC and compaction induced
 latency. I

 Actually, it does. At least assumign DF  RF (as in the original
 proposal, and mine). The impact of a node suffering from a performance
 degradation is mitigated because the effects are spread out over DF-1
 (N-1 in the original post) nodes instead of just RF nodes.

  think some progress could be made here by allowing a pack of
 independent
  Cassandra nodes to be ran on a single host; somewhat (but nowhere near
  entirely) similar to a pre-fork model used by some UNIX-based servers.

 I have pretty significant knee-jerk negative reactions to that idea to
 be honest, even if the pack is limited to a handful of instances. In
 order for vnodes to be useful with random placement, we'd need much
 more than a handful of vnodes per node (cassandra instances in a
 pack in that model).

 --
 / Peter Schuller (@scode, http://worldmodscode.wordpress.com)



Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Rick Branson
On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller
peter.schul...@infidyne.com wrote:
  As a side note: vnodes fail to provide solutions to node-based limitations
  that seem to me to cause a substantial portion of operational issues such
  as impact of node restarts / upgrades, GC and compaction induced latency. I

 Actually, it does. At least assumign DF  RF (as in the original
 proposal, and mine). The impact of a node suffering from a performance
 degradation is mitigated because the effects are spread out over DF-1
 (N-1 in the original post) nodes instead of just RF nodes.

You've got me on one of those after some re-thought. For any node
outage (an upgrade/restart) definitely has a big impact by distributed
the load more evenly, but (and correct me if I'm wrong) for things
like additional latency caused by GC/compaction, those requests will
just be slower rather than timing out or getting redirected via the
dynamic snitch.

  think some progress could be made here by allowing a pack of independent
  Cassandra nodes to be ran on a single host; somewhat (but nowhere near
  entirely) similar to a pre-fork model used by some UNIX-based servers.

 I have pretty significant knee-jerk negative reactions to that idea to
 be honest, even if the pack is limited to a handful of instances. In
 order for vnodes to be useful with random placement, we'd need much
 more than a handful of vnodes per node (cassandra instances in a
 pack in that model).


Fair enough, I'm not super fond of the idea personally, but I don't
see a way around the limitations of the current JVM GC without
multiple processes.

After some rethinking my ideas a bit, I think actually what I've
settled a bit more on is to keep the existing node tokens, but add an
additional active token that would be used to determine the data
range that a node is ready to receive reads for. This should gain all
of the benefits highlighted in my earlier post, but with less
complexity in implementation. Node repair (AES) would still allow
ranges to be specified.


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Eric Evans
On Mon, Mar 19, 2012 at 9:37 PM, Vijay vijay2...@gmail.com wrote:
 I also did create a ticket
 https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the
 reason why I would like to see vnodes in cassandra.
 It can also potentially reduce the SSTable seeks which a node has to do to
 query data in SizeTireCompaction if extended to the filesystem.

 But 110% agree with Peter, we need to take incremental steps and start with
 the existing bootstrapping.

I'm guessing you're referring to Rick's proposal about ranges per node?

 May be we can start it by making a set of Ranges/Token to a node insted of
 one token. And then may be building things around the movement of those
 ranges.

 I have been thinking about this for a while but having trouble to get to a
 point where i am comfortable changing big chunks of code.

It might help to see some more detail in the proposals.  Both ideas
seem invasive, vnodes more so, but there are many more benefits as
well.

 On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller peter.schul...@infidyne.com
 wrote:

 (I may comment on other things more later)

  As a side note: vnodes fail to provide solutions to node-based
 limitations
  that seem to me to cause a substantial portion of operational issues such
  as impact of node restarts / upgrades, GC and compaction induced
 latency. I

 Actually, it does. At least assumign DF  RF (as in the original
 proposal, and mine). The impact of a node suffering from a performance
 degradation is mitigated because the effects are spread out over DF-1
 (N-1 in the original post) nodes instead of just RF nodes.

  think some progress could be made here by allowing a pack of
 independent
  Cassandra nodes to be ran on a single host; somewhat (but nowhere near
  entirely) similar to a pre-fork model used by some UNIX-based servers.

 I have pretty significant knee-jerk negative reactions to that idea to
 be honest, even if the pack is limited to a handful of instances. In
 order for vnodes to be useful with random placement, we'd need much
 more than a handful of vnodes per node (cassandra instances in a
 pack in that model).


-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Vijay
On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote:

 I'm guessing you're referring to Rick's proposal about ranges per node?


May be, what i mean is little more simple than that... We can consider
every node having a multiple conservative ranges and moving those ranges
for bootstrap etc, instead of finding the mid point etc in the bootstrap
code. Once we get that working all the way to the FS/Streaming then we can
move those ranges and assign those ranges to nodes in random orders. Hope
it makes sense.


Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Radim Kolar

I don't like that every node will have same portion of data.

1. We are using nodes with different HW sizes (number of disks)
2.  especially with ordered partitioner there tends to be hotspots and 
you must assign smaller portion of data to nodes holding hotspots





Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Sam Overton
On 17 March 2012 11:15, Radim Kolar h...@filez.com wrote:

 I don't like that every node will have same portion of data.

 1. We are using nodes with different HW sizes (number of disks)
 2.  especially with ordered partitioner there tends to be hotspots and you
 must assign smaller portion of data to nodes holding hotspots


Hi Radim,

The number of virtual nodes for each host would be configurable by the
user, in much the same way that initial_token is configurable now. A host
taking a larger number of virtual nodes (tokens) would have proportionately
more of the data. This is how we anticipate support for heterogeneity in
cluster hardware.

Sam

-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Zhu Han
On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton s...@acunu.com wrote:

 Hello cassandra-dev,

 This is a long email. It concerns a significant change to Cassandra, so
 deserves a thorough introduction.

 *The summary is*: we believe virtual nodes are the way forward. We would
 like to add virtual nodes to Cassandra and we are asking for comments,
 criticism and collaboration!

 Cassandra's current partitioning scheme is sub-optimal for bootstrap,
 decommission, repair and re-balance operations, and places the burden on
 users to properly calculate tokens (a common cause of mistakes), which is a
 recurring pain-point.

 Virtual nodes have a variety of benefits over the one-to-one mapping of
 host to key range which Cassandra currently supports.

 Among these benefits are:

 * Even load balancing when growing and shrinking the cluster
 A virtual node scheme ensures that all hosts in a cluster have an even
 portion of the total data, and a new node bootstrapped into the cluster
 will assume its share of the data. Doubling, or halving the cluster to
 ensure even load distribution would no longer be necessary.

 * Distributed rebuild
 When sizing a cluster, one of the considerations is the amount of time
 required to recover from a failed node. This is the exposure time, during
 which a secondary failure could cause data loss. In order to guarantee an
 upper bound on the exposure time, the amount of data which can be stored on
 each host is limited by the amount of time taken to recover the required
 replica count. At Acunu we have found that the exposure time is frequently
 the limiting factor which dictates the maximum allowed node size in
 customers' clusters.

 Using a virtual node scheme, the data stored on one host is not replicated
 on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
 other virtual nodes which may be on a different set of physical hosts to
 replicas of other virtual nodes stored on the same host. This means data
 for one host is replicated evenly across the entire cluster.

 In the event of a failure then, restoring the replica count can be done in
 a fully distributed way. Each host in the cluster participates in the
 rebuild, drastically reducing the exposure time, allowing more data to be
 stored on a single host while still maintaining an acceptable upper bound
 on the likelihood of secondary failure. This reduces TCO concerns.

 * Greater failure tolerance in streaming
 Operations which require streaming of a large range of data, eg. bootstrap,
 decommission, repair, etc. incur a heavy cost if an error (eg. dropped
 network connection) is encountered during the streaming. Currently the
 whole range must be re-streamed, and this could constitute a very large
 amount of data. Virtual nodes reduce the impact of streaming failures,
 since each virtual node is a much smaller range of the key-space, so
 re-streaming a whole virtual node is a much cheaper process.

 * Evenly distributed impact of streaming operations
 Streaming operations such as bootstrap, repair, et al. would involve every
 node in the cluster. This would distribute the load of these operations
 across the whole cluster, and could be staggered so that only a small
 subset of nodes were affected at once, similar to staggered repair[1].

 * Possibility for active load balancing
 Load balancing in Cassandra currently involves moving a token to
 increase/reduce the amount of key-space for which a host is responsible.
 This only allows load balancing between neighbouring nodes, so it could
 involve moving more than one token just to redistribute a single overloaded
 node. Virtual nodes could allow load balancing on a much finer granularity,
 so heavily loaded portions of the key-space could be redistributed to
 lighter-loaded hosts by reassigning one or more virtual nodes.


 Implementing a virtual node scheme in Cassandra is not an insignificant
 amount of work, and it will touch a large amount of the codebase related to
 partitioning, placement, routing, gossip, and so on. We do believe that
 this is possible to do incrementally, and in such a way that there is an
 easy upgrade path for pre-virtual-node deployments.

 It would not however touch the storage layer. The virtual node concept is
 solely for partitioning and placement, not for segregating the data storage
 of the host, so all keys for all virtual nodes on a host would be stored in
 the same SSTables.

 We are not proposing the adoption of the same scheme used by Voldemort[2]
 and described in the Dynamo paper[3]. We feel this scheme is too different
 from Cassandra's current distribution model to be a viable target for
 incremental development. Their scheme also fixes the number of virtual
 nodes for the lifetime of the cluster, which can prove to be a ceiling to
 scaling the cluster if the virtual nodes grow too large.

 The proposed design is:
 * Assign each host T random tokens.


Will it work well for OrderedPartitioner?

For load balance 

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Eric Evans
On Sat, Mar 17, 2012 at 11:15 AM, Radim Kolar h...@filez.com wrote:
 I don't like that every node will have same portion of data.

 1. We are using nodes with different HW sizes (number of disks)
 2.  especially with ordered partitioner there tends to be hotspots and you
 must assign smaller portion of data to nodes holding hotspots

Yes, these are exactly the sorts of problems that virtual nodes are
meant to make easier.

-- 
Eric Evans
Acunu | http://www.acunu.com | @acunu


Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Edward Capriolo
 I agree having smaller regions would help the rebalencing situation both
with rp and bop. However i an not sure if  dividing tables across disk s
will give any better performance. you will have more seeking spindles and
can possibly sub divide token ranges into separate files. But fs cache will
get shared across all disks so that is a wash.

On Saturday, March 17, 2012, Eric Evans eev...@acunu.com wrote:
 On Sat, Mar 17, 2012 at 11:15 AM, Radim Kolar h...@filez.com wrote:
 I don't like that every node will have same portion of data.

 1. We are using nodes with different HW sizes (number of disks)
 2.  especially with ordered partitioner there tends to be hotspots and
you
 must assign smaller portion of data to nodes holding hotspots

 Yes, these are exactly the sorts of problems that virtual nodes are
 meant to make easier.

 --
 Eric Evans
 Acunu | http://www.acunu.com | @acunu



Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Eric Evans
On Sat, Mar 17, 2012 at 3:22 PM, Zhu Han schumi@gmail.com wrote:
 On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton s...@acunu.com wrote:
 This is a long email. It concerns a significant change to Cassandra, so
 deserves a thorough introduction.

 *The summary is*: we believe virtual nodes are the way forward. We would
 like to add virtual nodes to Cassandra and we are asking for comments,
 criticism and collaboration!

 Cassandra's current partitioning scheme is sub-optimal for bootstrap,
 decommission, repair and re-balance operations, and places the burden on
 users to properly calculate tokens (a common cause of mistakes), which is a
 recurring pain-point.

 Virtual nodes have a variety of benefits over the one-to-one mapping of
 host to key range which Cassandra currently supports.

 Among these benefits are:

 * Even load balancing when growing and shrinking the cluster
 A virtual node scheme ensures that all hosts in a cluster have an even
 portion of the total data, and a new node bootstrapped into the cluster
 will assume its share of the data. Doubling, or halving the cluster to
 ensure even load distribution would no longer be necessary.

 * Distributed rebuild
 When sizing a cluster, one of the considerations is the amount of time
 required to recover from a failed node. This is the exposure time, during
 which a secondary failure could cause data loss. In order to guarantee an
 upper bound on the exposure time, the amount of data which can be stored on
 each host is limited by the amount of time taken to recover the required
 replica count. At Acunu we have found that the exposure time is frequently
 the limiting factor which dictates the maximum allowed node size in
 customers' clusters.

 Using a virtual node scheme, the data stored on one host is not replicated
 on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
 other virtual nodes which may be on a different set of physical hosts to
 replicas of other virtual nodes stored on the same host. This means data
 for one host is replicated evenly across the entire cluster.

 In the event of a failure then, restoring the replica count can be done in
 a fully distributed way. Each host in the cluster participates in the
 rebuild, drastically reducing the exposure time, allowing more data to be
 stored on a single host while still maintaining an acceptable upper bound
 on the likelihood of secondary failure. This reduces TCO concerns.

 * Greater failure tolerance in streaming
 Operations which require streaming of a large range of data, eg. bootstrap,
 decommission, repair, etc. incur a heavy cost if an error (eg. dropped
 network connection) is encountered during the streaming. Currently the
 whole range must be re-streamed, and this could constitute a very large
 amount of data. Virtual nodes reduce the impact of streaming failures,
 since each virtual node is a much smaller range of the key-space, so
 re-streaming a whole virtual node is a much cheaper process.

 * Evenly distributed impact of streaming operations
 Streaming operations such as bootstrap, repair, et al. would involve every
 node in the cluster. This would distribute the load of these operations
 across the whole cluster, and could be staggered so that only a small
 subset of nodes were affected at once, similar to staggered repair[1].

 * Possibility for active load balancing
 Load balancing in Cassandra currently involves moving a token to
 increase/reduce the amount of key-space for which a host is responsible.
 This only allows load balancing between neighbouring nodes, so it could
 involve moving more than one token just to redistribute a single overloaded
 node. Virtual nodes could allow load balancing on a much finer granularity,
 so heavily loaded portions of the key-space could be redistributed to
 lighter-loaded hosts by reassigning one or more virtual nodes.


 Implementing a virtual node scheme in Cassandra is not an insignificant
 amount of work, and it will touch a large amount of the codebase related to
 partitioning, placement, routing, gossip, and so on. We do believe that
 this is possible to do incrementally, and in such a way that there is an
 easy upgrade path for pre-virtual-node deployments.

 It would not however touch the storage layer. The virtual node concept is
 solely for partitioning and placement, not for segregating the data storage
 of the host, so all keys for all virtual nodes on a host would be stored in
 the same SSTables.

 We are not proposing the adoption of the same scheme used by Voldemort[2]
 and described in the Dynamo paper[3]. We feel this scheme is too different
 from Cassandra's current distribution model to be a viable target for
 incremental development. Their scheme also fixes the number of virtual
 nodes for the lifetime of the cluster, which can prove to be a ceiling to
 scaling the cluster if the virtual nodes grow too large.

 The proposed design is:
 * Assign each host T random tokens.


 Will it work 

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Peter Schuller
 *The summary is*: we believe virtual nodes are the way forward. We would
 like to add virtual nodes to Cassandra and we are asking for comments,
 criticism and collaboration!

I am very happy to see some momentum on this, and I would like to go
even further than what you propose. The main reasons why I do not
think simply adding vnodes and making random assignments is the best
end goal are:

(1) The probability of data loss increases linearly with cluster size.
(2) It does not take network topology into account.

What follows is mostly a half-finished long text that I have been
sitting on for a few months but not finished/posted. Unfortunately I
do not have the possibility (due to time constraints) to go through
everything in detail and update with current information and to
specifically address what you already said, so there will be overlap
with your original post. However given your E-Mail and the momentum in
this thread, I really wanted to post something rather than not. It
would be awesome if interested parties had a chance to read the
referenced CRUSH paper, and the deltas proposed below.

The goals are to address everything you already wanted in your post,
while also addressing:

(1) Probability of data loss
(2) Network topology awareness

The following text will first try to paint a picture of the goals that
I have in mind, and then go on to the actual proposed solution. The
proposition is very very short and undetailed now and there is plenty
of discussion and details to fill in. I apologize, but again, I really
want to post something now that this is being brought up.

BEGIN un-polished text (we = I):=

= CRUSHing Cassandra

Author: Peter Schuller peter.schul...@infidyne.com

This is a proposal for a significant re-design of some fundamentals of
Cassandra, aimed at addressing a number of current issues as well as
anticipating future issues. It is particularly aimed at large
clusters, but as a side-effect should improve the small cluster
experience as well.

== New terminology: Distribution factor

A Cassandra cluster is today said to have `N` nodes, and data is
replicated at a particular replication factor (`RF`). The placement of
replicas is such that all rows that has a certain node `N1` as its
primary replica, are located on a specific set of `RF-1` other
nodes. In addition, it holds secondary replicas of data for `RF-1`
other nodes. In total, it shares data with `2RF - 2` other nodes.

The number of nodes with whom a node shares data is the distribution
factor. In the case of Cassandra, `DF = 2RF - 2`.

== Goals

The goals this suggestion attempts to help solve include the following.

=== Goal: DF should not be tied to RF, nor N

`DF` is important for these reasons:

* The `DF` determines how many nodes are involved in re-constructing a
  lost node after failure; the higher the `DF`, the less of a
  performance impact a reconstruction has on remaining nodes.
* The `DF` determines the significance on other nodes, with respect to
  read/write load, on a node being down.
* The `DF` affects the probability of multiple failures causing data
  loss, since one looses data if any `RF` nodes within a group of `DF`
  nodes all go down.

Having `DF` tied to `RF` like Cassandra does now has its problems. A
single node failure has a significant effect on the performance
characteristics of neighboring nodes (in terms relative to the normal
level of load on the neighbors).

On large data sets, a failed node needing reconstruction is a
significant event, as it

* Increases the load on neighbors just from going down.
* Further increases the load on neighbors as they have to stream data
(adding I/O
  and cache thrashing).

This typically leads to the desire to throttle / rate limit
reconstruction, which adds to the reconstruction window in addition to
the fact that it was already naturally bottlenecking on neighbors.

The other extreme is to tie `DF` to `N`, such that the data contained
on one node, has it's secondary replicas spread out over the entire
ring. This is an unacceptable choice because the probabiliy of multiple
failures increases linearly with the cluster size.

In other words, we want `DF` to be tied to neither `RF` nor
`N`. Rather, `DF` should be chosen as a trade-off between the effects
of `DF`:

* The higher the `DF`, the higher the probability of data loss in case
  of multiple failures.
* The higher the `DF`, the faster to reconstruct/replace a lost node.
* The higher the `DF`, the less impact is seen on node failures on the
  performance requirements on other nodes.

In making this determination, one must take into account that if a
larger `DF` makes reconstruction/replacement significantly faster,
that also decreases the time window in which multiple failures can
occurr. Increasing `DF` is thus not *necessarily* increasing the total
probability of data loss (for small values of `DF`).

=== Goal: Topologically aware redundancy

We maintain the goal of being topology aware for the purpose 

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Peter Schuller
Point of clarification: My use of the term bucket is completely
unrelated to the term bucket used in the CRUSH paper.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)