Re: RFC: Cassandra Virtual Nodes

2012-04-10 Thread Sam Overton
There is now a parent ticket for this issue in JIRA: https://issues.apache.org/jira/browse/CASSANDRA-4119 Comments and contributions are still welcome! Cheers, Sam On 16 March 2012 23:38, Sam Overton s...@acunu.com wrote: Hello cassandra-dev, This is a long email. It concerns a significant

Re: RFC: Cassandra Virtual Nodes

2012-03-23 Thread Ben Coverston
The SSTable indices should still be scanned for size tiered compaction. Do I miss anything here? No I don't think you did, in fact, depending on the size of your SSTable a contiguous range (or the entire SSTable) may or may not be affected by a cleanup/move or any type of topology change.

Re: RFC: Cassandra Virtual Nodes

2012-03-23 Thread Peter Schuller
No I don't think you did, in fact, depending on the size of your SSTable a contiguous range (or the entire SSTable) may or may not be affected by a cleanup/move or any type of topology change. There is lots of room for optimization here. After loading the indexes we actually know start/end

Re: RFC: Cassandra Virtual Nodes

2012-03-23 Thread Zhu Han
On Sat, Mar 24, 2012 at 7:55 AM, Peter Schuller peter.schul...@infidyne.com wrote: No I don't think you did, in fact, depending on the size of your SSTable a contiguous range (or the entire SSTable) may or may not be affected by a cleanup/move or any type of topology change. There is lots

Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Richard Low
On 22 March 2012 05:48, Zhu Han schumi@gmail.com wrote: I second it. Is there some goals we missed which can not be achieved by assigning multiple tokens to a single node? This is exactly the proposed solution. The discussion is about how to implement this, and the methods of choosing

Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Zhu Han
On Thu, Mar 22, 2012 at 6:20 PM, Richard Low r...@acunu.com wrote: On 22 March 2012 05:48, Zhu Han schumi@gmail.com wrote: I second it. Is there some goals we missed which can not be achieved by assigning multiple tokens to a single node? This is exactly the proposed solution.

Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Peter Schuller
You would have to iterate through all sstables on the system to repair one vnode, yes: but building the tree for just one range of the data means that huge portions of the sstables files can be skipped. It should scale down linearly as the number of vnodes increases (ie, with 100 vnodes, it

Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Zhu Han
On Fri, Mar 23, 2012 at 6:54 AM, Peter Schuller peter.schul...@infidyne.com wrote: You would have to iterate through all sstables on the system to repair one vnode, yes: but building the tree for just one range of the data means that huge portions of the sstables files can be skipped. It

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Eric Evans
On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis jbel...@gmail.com wrote: It's reasonable that we can attach different levels of importance to these things.  Taking a step back, I have two main points: 1) vnodes add enormous complexity to *many* parts of Cassandra.  I'm skeptical of the

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Edward Capriolo
On Wed, Mar 21, 2012 at 9:50 AM, Eric Evans eev...@acunu.com wrote: On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis jbel...@gmail.com wrote: It's reasonable that we can attach different levels of importance to these things.  Taking a step back, I have two main points: 1) vnodes add enormous

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Tom Wilkie
Hi Edward 1) No more raid 0. If a machine is responsible for 4 vnodes they should correspond to for JBOD. So each vnode corresponds to a disk? I suppose we could have a separate data directory per disk, but I think this should be a separate, subsequent change. However, do note that making

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Jonathan Ellis
On Wed, Mar 21, 2012 at 8:50 AM, Eric Evans eev...@acunu.com wrote: I must admit I find this a little disheartening.  The discussion has barely started.  No one has had a chance to discuss implementation specifics so that the rest of us could understand *how* disruptive it would be (a

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Edward Capriolo
On Wed, Mar 21, 2012 at 3:24 PM, Tom Wilkie t...@acunu.com wrote: Hi Edward 1) No more raid 0. If a machine is responsible for 4 vnodes they should correspond to for JBOD. So each vnode corresponds to a disk?  I suppose we could have a separate data directory per disk, but I think this

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Peter Schuller
Software wise it is the same deal. Each node streams off only disk 4 to the new node. I think an implication on software is that if you want to make specific selections of partitions to move, you are effectively incompatible with deterministically generating the mapping of partition to

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Edward Capriolo
I just see vnodes as a way to make the problem smaller and by making the problem smaller the overall system is more agile. Aka rather then 1 node streaming 100 gb the 4 nodes stream 25gb. Moves by hand are not so bad because the take 1/4th the time. The most simple vnode implementation is vmware.

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Vijay
I envision vnodes as Cassandra master being a shared cache,memtables, and manager for what we today consider a Cassandra instance. It might be kind of problematic when you are moving the nodes you want the data associated with the node to move too, otherwise it will be a pain to cleanup after

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Jonathan Ellis
A friend pointed out to me privately that I came across pretty harsh in this thread. While I stand by my technical concerns, I do want to acknowledge that Sam's proposal here indicates a strong grasp of the principles involved, and a deeper level of thought into the issues than I think anyone

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Zhu Han
: Rick Branson [rbran...@datastax.com] Sent: Monday, March 19, 2012 5:16 PM To: dev@cassandra.apache.org Subject: Re: RFC: Cassandra Virtual Nodes I think if we could go back and rebuild Cassandra from scratch, vnodes would likely be implemented from the beginning. However, I'm concerned

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote: On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote: I'm guessing you're referring to Rick's proposal about ranges per node? May be, what i mean is little more simple than that... We can consider every node having a

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Eric Evans
On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton s...@acunu.com wrote: On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote: On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote: I'm guessing you're referring to Rick's proposal about ranges per node? May be, what i mean is

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jonathan Ellis
I like this idea. It feels like a good 80/20 solution -- 80% of the benefits, 20% of the effort. More like 5% of the effort. I can't even enumerate all the places full vnode support would change, but an active token range concept would be relatively limited in scope. Full vnodes feels a lot

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Eric Evans
On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis jbel...@gmail.com wrote: I like this idea.  It feels like a good 80/20 solution -- 80% of the benefits, 20% of the effort.  More like 5% of the effort.  I can't even enumerate all the places full vnode support would change, but an active token

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Rick Branson
I like this idea. It feels like a good 80/20 solution -- 80% of the benefits, 20% of the effort. More like 5% of the effort. I can't even enumerate all the places full vnode support would change, but an active token range concept would be relatively limited in scope. It only addresses

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 20 March 2012 13:37, Eric Evans eev...@acunu.com wrote: On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton s...@acunu.com wrote: On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote: May be, what i mean is little more simple than that... We can consider every node having a multiple

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jonathan Ellis
On Tue, Mar 20, 2012 at 9:08 AM, Eric Evans eev...@acunu.com wrote: On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis jbel...@gmail.com wrote: I like this idea.  It feels like a good 80/20 solution -- 80% of the benefits, 20% of the effort.  More like 5% of the effort.  I can't even enumerate

RE: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jeremiah Jordan
neighbor. etc. etc. -Jeremiah Jordan From: Rick Branson [rbran...@datastax.com] Sent: Monday, March 19, 2012 5:16 PM To: dev@cassandra.apache.org Subject: Re: RFC: Cassandra Virtual Nodes I think if we could go back and rebuild Cassandra from scratch, vnodes

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 19 March 2012 23:41, Peter Schuller peter.schul...@infidyne.com wrote: Using this ring bucket in the CRUSH topology, (with the hash function being the identity function) would give the exact same distribution properties as the virtual node strategy that I suggested previously, but of course

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Richard Low
On 20 March 2012 14:50, Rick Branson rbran...@datastax.com wrote: To support a form of DF, I think some tweaking of the replica placement could achieve this effect quite well. We could introduce a variable into replica placement, which I'm going to incorrectly call DF for the purposes of

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Peter Schuller
Each node would have a lower and an upper token, which would form a range that would be actively distributed via gossip. Read and replication requests would only be routed to a replica when the key of these operations matched the replica's token range in the gossip tables. Each node would

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Radim Kolar
Hi Radim, The number of virtual nodes for each host would be configurable by the user, in much the same way that initial_token is configurable now. A host taking a larger number of virtual nodes (tokens) would have proportionately more of the data. This is how we anticipate support for

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
Hi Peter, It's great to hear that others have come to some of the same conclusions! I think a CRUSH-like strategy for topologically aware replication/routing/locality is a great idea. I think I can see three mostly orthogonal sets of functionality that we're concerned with: a) a virtual node

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
On 19 March 2012 09:23, Radim Kolar h...@filez.com wrote: Hi Radim, The number of virtual nodes for each host would be configurable by the user, in much the same way that initial_token is configurable now. A host taking a larger number of virtual nodes (tokens) would have proportionately

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Edward Capriolo
On Mon, Mar 19, 2012 at 4:15 PM, Sam Overton s...@acunu.com wrote: On 19 March 2012 09:23, Radim Kolar h...@filez.com wrote: Hi Radim, The number of virtual nodes for each host would be configurable by the user, in much the same way that initial_token is configurable now. A host taking a

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
For OPP the problem of load balancing is more profound. Now you need vnodes per keyspace because you can not expect each keyspace to have the same distribution. With three keyspaces you are not unsure as to which was is causing the hotness. I think OPP should just go away. That's a good

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Edward Capriolo
On Mon, Mar 19, 2012 at 4:24 PM, Sam Overton s...@acunu.com wrote: For OPP the problem of load balancing is more profound. Now you need vnodes per keyspace because you can not expect each keyspace to have the same distribution. With three keyspaces you are not unsure as to which was is causing

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Rick Branson
I think if we could go back and rebuild Cassandra from scratch, vnodes would likely be implemented from the beginning. However, I'm concerned that implementing them now could be a big distraction from more productive uses of all of our time and introduce major potential stability issues into what

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
a) a virtual node partitioning scheme (to support heterogeneity and management simplicity) b) topology aware replication c) topology aware routing I would add (d) limiting the distribution factor to decrease the probability of data loss/multiple failures within a replica set. First of all,

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
Using this ring bucket in the CRUSH topology, (with the hash function being the identity function) would give the exact same distribution properties as the virtual node strategy that I suggested previously, but of course with much better topology awareness. I will have to re-read your

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
(I may comment on other things more later) As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I Actually, it does. At

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Vijay
I also did create a ticket https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the reason why I would like to see vnodes in cassandra. It can also potentially reduce the SSTable seeks which a node has to do to query data in SizeTireCompaction if extended to the filesystem. But 110%

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Rick Branson
On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller peter.schul...@infidyne.com wrote: As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Eric Evans
On Mon, Mar 19, 2012 at 9:37 PM, Vijay vijay2...@gmail.com wrote: I also did create a ticket https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the reason why I would like to see vnodes in cassandra. It can also potentially reduce the SSTable seeks which a node has to do to

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Vijay
On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote: I'm guessing you're referring to Rick's proposal about ranges per node? May be, what i mean is little more simple than that... We can consider every node having a multiple conservative ranges and moving those ranges for

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Radim Kolar
I don't like that every node will have same portion of data. 1. We are using nodes with different HW sizes (number of disks) 2. especially with ordered partitioner there tends to be hotspots and you must assign smaller portion of data to nodes holding hotspots

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Sam Overton
On 17 March 2012 11:15, Radim Kolar h...@filez.com wrote: I don't like that every node will have same portion of data. 1. We are using nodes with different HW sizes (number of disks) 2. especially with ordered partitioner there tends to be hotspots and you must assign smaller portion of

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Zhu Han
On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton s...@acunu.com wrote: Hello cassandra-dev, This is a long email. It concerns a significant change to Cassandra, so deserves a thorough introduction. *The summary is*: we believe virtual nodes are the way forward. We would like to add virtual

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Eric Evans
On Sat, Mar 17, 2012 at 11:15 AM, Radim Kolar h...@filez.com wrote: I don't like that every node will have same portion of data. 1. We are using nodes with different HW sizes (number of disks) 2.  especially with ordered partitioner there tends to be hotspots and you must assign smaller

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Edward Capriolo
I agree having smaller regions would help the rebalencing situation both with rp and bop. However i an not sure if dividing tables across disk s will give any better performance. you will have more seeking spindles and can possibly sub divide token ranges into separate files. But fs cache will

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Eric Evans
On Sat, Mar 17, 2012 at 3:22 PM, Zhu Han schumi@gmail.com wrote: On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton s...@acunu.com wrote: This is a long email. It concerns a significant change to Cassandra, so deserves a thorough introduction. *The summary is*: we believe virtual nodes are the

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Peter Schuller
*The summary is*: we believe virtual nodes are the way forward. We would like to add virtual nodes to Cassandra and we are asking for comments, criticism and collaboration! I am very happy to see some momentum on this, and I would like to go even further than what you propose. The main reasons

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Peter Schuller
Point of clarification: My use of the term bucket is completely unrelated to the term bucket used in the CRUSH paper. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)