There is now a parent ticket for this issue in JIRA:
https://issues.apache.org/jira/browse/CASSANDRA-4119
Comments and contributions are still welcome!
Cheers,
Sam
On 16 March 2012 23:38, Sam Overton s...@acunu.com wrote:
Hello cassandra-dev,
This is a long email. It concerns a significant
The SSTable indices should still be scanned for size tiered compaction.
Do I miss anything here?
No I don't think you did, in fact, depending on the size of your SSTable a
contiguous range (or the entire SSTable) may or may not be affected by a
cleanup/move or any type of topology change.
No I don't think you did, in fact, depending on the size of your SSTable a
contiguous range (or the entire SSTable) may or may not be affected by a
cleanup/move or any type of topology change. There is lots of room for
optimization here. After loading the indexes we actually know start/end
On Sat, Mar 24, 2012 at 7:55 AM, Peter Schuller peter.schul...@infidyne.com
wrote:
No I don't think you did, in fact, depending on the size of your SSTable
a
contiguous range (or the entire SSTable) may or may not be affected by a
cleanup/move or any type of topology change. There is lots
On 22 March 2012 05:48, Zhu Han schumi@gmail.com wrote:
I second it.
Is there some goals we missed which can not be achieved by assigning
multiple tokens to a single node?
This is exactly the proposed solution. The discussion is about how to
implement this, and the methods of choosing
On Thu, Mar 22, 2012 at 6:20 PM, Richard Low r...@acunu.com wrote:
On 22 March 2012 05:48, Zhu Han schumi@gmail.com wrote:
I second it.
Is there some goals we missed which can not be achieved by assigning
multiple tokens to a single node?
This is exactly the proposed solution.
You would have to iterate through all sstables on the system to repair one
vnode, yes: but building the tree for just one range of the data means that
huge portions of the sstables files can be skipped. It should scale down
linearly as the number of vnodes increases (ie, with 100 vnodes, it
On Fri, Mar 23, 2012 at 6:54 AM, Peter Schuller peter.schul...@infidyne.com
wrote:
You would have to iterate through all sstables on the system to repair
one
vnode, yes: but building the tree for just one range of the data means
that
huge portions of the sstables files can be skipped. It
On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis jbel...@gmail.com wrote:
It's reasonable that we can attach different levels of importance to
these things. Taking a step back, I have two main points:
1) vnodes add enormous complexity to *many* parts of Cassandra. I'm
skeptical of the
On Wed, Mar 21, 2012 at 9:50 AM, Eric Evans eev...@acunu.com wrote:
On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis jbel...@gmail.com wrote:
It's reasonable that we can attach different levels of importance to
these things. Taking a step back, I have two main points:
1) vnodes add enormous
Hi Edward
1) No more raid 0. If a machine is responsible for 4 vnodes they
should correspond to for JBOD.
So each vnode corresponds to a disk? I suppose we could have a
separate data directory per disk, but I think this should be a
separate, subsequent change.
However, do note that making
On Wed, Mar 21, 2012 at 8:50 AM, Eric Evans eev...@acunu.com wrote:
I must admit I find this a little disheartening. The discussion has
barely started. No one has had a chance to discuss implementation
specifics so that the rest of us could understand *how* disruptive it
would be (a
On Wed, Mar 21, 2012 at 3:24 PM, Tom Wilkie t...@acunu.com wrote:
Hi Edward
1) No more raid 0. If a machine is responsible for 4 vnodes they
should correspond to for JBOD.
So each vnode corresponds to a disk? I suppose we could have a
separate data directory per disk, but I think this
Software wise it is the same deal. Each node streams off only disk 4
to the new node.
I think an implication on software is that if you want to make
specific selections of partitions to move, you are effectively
incompatible with deterministically generating the mapping of
partition to
I just see vnodes as a way to make the problem smaller and by making the
problem smaller the overall system is more agile. Aka rather then 1 node
streaming 100 gb the 4 nodes stream 25gb. Moves by hand are not so bad
because the take 1/4th the time.
The most simple vnode implementation is vmware.
I envision vnodes as Cassandra master being a shared cache,memtables,
and manager for what we today consider a Cassandra instance.
It might be kind of problematic when you are moving the nodes you want the
data associated with the node to move too, otherwise it will be a pain to
cleanup after
A friend pointed out to me privately that I came across pretty harsh
in this thread. While I stand by my technical concerns, I do want to
acknowledge that Sam's proposal here indicates a strong grasp of the
principles involved, and a deeper level of thought into the issues
than I think anyone
: Rick Branson [rbran...@datastax.com]
Sent: Monday, March 19, 2012 5:16 PM
To: dev@cassandra.apache.org
Subject: Re: RFC: Cassandra Virtual Nodes
I think if we could go back and rebuild Cassandra from scratch, vnodes
would likely be implemented from the beginning. However, I'm concerned
On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote:
On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote:
I'm guessing you're referring to Rick's proposal about ranges per node?
May be, what i mean is little more simple than that... We can consider
every node having a
On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton s...@acunu.com wrote:
On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote:
On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote:
I'm guessing you're referring to Rick's proposal about ranges per node?
May be, what i mean is
I like this idea. It feels like a good 80/20 solution -- 80% of the
benefits, 20% of the effort. More like 5% of the effort. I can't
even enumerate all the places full vnode support would change, but an
active token range concept would be relatively limited in scope.
Full vnodes feels a lot
On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis jbel...@gmail.com wrote:
I like this idea. It feels like a good 80/20 solution -- 80% of the
benefits, 20% of the effort. More like 5% of the effort. I can't
even enumerate all the places full vnode support would change, but an
active token
I like this idea. It feels like a good 80/20 solution -- 80% of the
benefits, 20% of the effort. More like 5% of the effort. I can't
even enumerate all the places full vnode support would change, but an
active token range concept would be relatively limited in scope.
It only addresses
On 20 March 2012 13:37, Eric Evans eev...@acunu.com wrote:
On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton s...@acunu.com wrote:
On 20 March 2012 04:35, Vijay vijay2...@gmail.com wrote:
May be, what i mean is little more simple than that... We can consider
every node having a multiple
On Tue, Mar 20, 2012 at 9:08 AM, Eric Evans eev...@acunu.com wrote:
On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis jbel...@gmail.com wrote:
I like this idea. It feels like a good 80/20 solution -- 80% of the
benefits, 20% of the effort. More like 5% of the effort. I can't
even enumerate
neighbor. etc. etc.
-Jeremiah Jordan
From: Rick Branson [rbran...@datastax.com]
Sent: Monday, March 19, 2012 5:16 PM
To: dev@cassandra.apache.org
Subject: Re: RFC: Cassandra Virtual Nodes
I think if we could go back and rebuild Cassandra from scratch, vnodes
On 19 March 2012 23:41, Peter Schuller peter.schul...@infidyne.com wrote:
Using this ring bucket in the CRUSH topology, (with the hash function
being the identity function) would give the exact same distribution
properties as the virtual node strategy that I suggested previously,
but of course
On 20 March 2012 14:50, Rick Branson rbran...@datastax.com wrote:
To support a form of DF, I think some tweaking of the replica placement could
achieve this effect quite well. We could introduce a variable into replica
placement, which I'm going to incorrectly call DF for the purposes of
Each node would have a lower and an upper token, which would form a range
that would be actively distributed via gossip. Read and replication
requests would only be routed to a replica when the key of these operations
matched the replica's token range in the gossip tables. Each node would
Hi Radim,
The number of virtual nodes for each host would be configurable by the
user, in much the same way that initial_token is configurable now. A host
taking a larger number of virtual nodes (tokens) would have proportionately
more of the data. This is how we anticipate support for
Hi Peter,
It's great to hear that others have come to some of the same conclusions!
I think a CRUSH-like strategy for topologically aware
replication/routing/locality is a great idea. I think I can see three
mostly orthogonal sets of functionality that we're concerned with:
a) a virtual node
On 19 March 2012 09:23, Radim Kolar h...@filez.com wrote:
Hi Radim,
The number of virtual nodes for each host would be configurable by the
user, in much the same way that initial_token is configurable now. A host
taking a larger number of virtual nodes (tokens) would have
proportionately
On Mon, Mar 19, 2012 at 4:15 PM, Sam Overton s...@acunu.com wrote:
On 19 March 2012 09:23, Radim Kolar h...@filez.com wrote:
Hi Radim,
The number of virtual nodes for each host would be configurable by the
user, in much the same way that initial_token is configurable now. A host
taking a
For OPP the problem of load balancing is more profound. Now you need
vnodes per keyspace because you can not expect each keyspace to have
the same distribution. With three keyspaces you are not unsure as to
which was is causing the hotness. I think OPP should just go away.
That's a good
On Mon, Mar 19, 2012 at 4:24 PM, Sam Overton s...@acunu.com wrote:
For OPP the problem of load balancing is more profound. Now you need
vnodes per keyspace because you can not expect each keyspace to have
the same distribution. With three keyspaces you are not unsure as to
which was is causing
I think if we could go back and rebuild Cassandra from scratch, vnodes
would likely be implemented from the beginning. However, I'm concerned that
implementing them now could be a big distraction from more productive uses
of all of our time and introduce major potential stability issues into what
a) a virtual node partitioning scheme (to support heterogeneity and
management simplicity)
b) topology aware replication
c) topology aware routing
I would add (d) limiting the distribution factor to decrease the
probability of data loss/multiple failures within a replica set.
First of all,
Using this ring bucket in the CRUSH topology, (with the hash function
being the identity function) would give the exact same distribution
properties as the virtual node strategy that I suggested previously,
but of course with much better topology awareness.
I will have to re-read your
(I may comment on other things more later)
As a side note: vnodes fail to provide solutions to node-based limitations
that seem to me to cause a substantial portion of operational issues such
as impact of node restarts / upgrades, GC and compaction induced latency. I
Actually, it does. At
I also did create a ticket
https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the
reason why I would like to see vnodes in cassandra.
It can also potentially reduce the SSTable seeks which a node has to do to
query data in SizeTireCompaction if extended to the filesystem.
But 110%
On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller
peter.schul...@infidyne.com wrote:
As a side note: vnodes fail to provide solutions to node-based limitations
that seem to me to cause a substantial portion of operational issues such
as impact of node restarts / upgrades, GC and compaction
On Mon, Mar 19, 2012 at 9:37 PM, Vijay vijay2...@gmail.com wrote:
I also did create a ticket
https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the
reason why I would like to see vnodes in cassandra.
It can also potentially reduce the SSTable seeks which a node has to do to
On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans eev...@acunu.com wrote:
I'm guessing you're referring to Rick's proposal about ranges per node?
May be, what i mean is little more simple than that... We can consider
every node having a multiple conservative ranges and moving those ranges
for
I don't like that every node will have same portion of data.
1. We are using nodes with different HW sizes (number of disks)
2. especially with ordered partitioner there tends to be hotspots and
you must assign smaller portion of data to nodes holding hotspots
On 17 March 2012 11:15, Radim Kolar h...@filez.com wrote:
I don't like that every node will have same portion of data.
1. We are using nodes with different HW sizes (number of disks)
2. especially with ordered partitioner there tends to be hotspots and you
must assign smaller portion of
On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton s...@acunu.com wrote:
Hello cassandra-dev,
This is a long email. It concerns a significant change to Cassandra, so
deserves a thorough introduction.
*The summary is*: we believe virtual nodes are the way forward. We would
like to add virtual
On Sat, Mar 17, 2012 at 11:15 AM, Radim Kolar h...@filez.com wrote:
I don't like that every node will have same portion of data.
1. We are using nodes with different HW sizes (number of disks)
2. especially with ordered partitioner there tends to be hotspots and you
must assign smaller
I agree having smaller regions would help the rebalencing situation both
with rp and bop. However i an not sure if dividing tables across disk s
will give any better performance. you will have more seeking spindles and
can possibly sub divide token ranges into separate files. But fs cache will
On Sat, Mar 17, 2012 at 3:22 PM, Zhu Han schumi@gmail.com wrote:
On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton s...@acunu.com wrote:
This is a long email. It concerns a significant change to Cassandra, so
deserves a thorough introduction.
*The summary is*: we believe virtual nodes are the
*The summary is*: we believe virtual nodes are the way forward. We would
like to add virtual nodes to Cassandra and we are asking for comments,
criticism and collaboration!
I am very happy to see some momentum on this, and I would like to go
even further than what you propose. The main reasons
Point of clarification: My use of the term bucket is completely
unrelated to the term bucket used in the CRUSH paper.
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)
51 matches
Mail list logo