Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread Richard Low
I'm also not convinced the problems listed in the paper with removenode are
so serious. With lots of vnodes per node, removenode causes data to be
streamed into all other nodes in parallel, so is (n-1) times quicker than
replacement for n nodes. For R=3, the failure rate goes up with vnodes
(without vnodes, after the first failure, any 4 neighbouring node failures
lose quorum but for vnodes, any other node failure loses quorum) by a
factor of (n-1)/4. The increase in speed more than offsets this so in fact
vnodes with removenode give theoretically 4x higher availability than no
vnodes.

If anyone is interested in using vnodes in large clusters I'd strongly
suggest testing this out to see if the concerns in section 4.3.3 are valid.

Richard.

On 17 April 2018 at 08:29, Jeff Jirsa  wrote:

> There are two huge advantages
>
> 1) during expansion / replacement / decom, you stream from far more
> ranges. Since streaming is single threaded per stream, this enables you to
> max out machines during streaming where single token doesn’t
>
> 2) when adjusting the size of a cluster, you can often grow incrementally
> without rebalancing
>
> Streaming entire wholly covered/contained/owned sstables during range
> movements is probably a huge benefit in many use cases that may make the
> single threaded streaming implementation less of a concern, and likely
> works reasonably well without major changes to LCS in particular  - I’m
> fairly confident there’s a JIRA for this, if not it’s been discussed in
> person among various operators for years as an obvious future improvement.
>
> --
> Jeff Jirsa
>
>
> > On Apr 17, 2018, at 8:17 AM, Carl Mueller 
> wrote:
> >
> > Do Vnodes address anything besides alleviating cluster planners from
> doing
> > token range management on nodes manually? Do we have a centralized list
> of
> > advantages they provide beyond that?
> >
> > There seem to be lots of downsides. 2i index performance, the above
> > availability, etc.
> >
> > I also wonder if in vnodes (and manually managed tokens... I'll return to
> > this) the node recovery scenarios are being hampered by sstables having
> the
> > hash ranges of the vnodes intermingled in the same set of sstables. I
> > wondered in another thread in vnodes why sstables are separated into sets
> > by the vnode ranges they represent. For a manually managed contiguous
> token
> > range, you could separate the sstables into a fixed number of sets, kind
> of
> > vnode-light.
> >
> > So if there was rebalancing or reconstruction, you could sneakernet or
> > reliably send entire sstable sets that would belong in a range.
> >
> > I also thing this would improve compactions and repairs too. Compactions
> > would be naturally parallelizable in all compaction schemes, and repairs
> > would have natural subsets to do merkle tree calculations.
> >
> > Granted sending sstables might result in "overstreaming" due to data
> > replication across the sstables, but you wouldn't have CPU and random I/O
> > to look up the data. Just sequential transfers.
> >
> > For manually managed tokens with subdivided sstables, if there was
> > rebalancing, you would have the "fringe" edges of the hash range
> subdivided
> > already, and you would only need to deal with the data in the border
> areas
> > of the token range, and again could sneakernet / dumb transfer the tables
> > and then let the new node remove the unneeded in future repairs.
> > (Compaction does not remove data that is not longer managed by a node,
> only
> > repair does? Or does only nodetool clean do that?)
> >
> > Pre-subdivided sstables for manually maanged tokens would REALLY pay big
> > dividends in large-scale cluster expansion. Say you wanted to double or
> > triple the cluster. Since the sstables are already split by some numeric
> > factor that has lots of even divisors (60 for RF 2,3,4,5), you simply
> bulk
> > copy the already-subdivided sstables for the new nodes' hash ranges and
> > you'd basically be done. In AWS EBS volumes, that could just be a drive
> > detach / drive attach.
> >
> >
> >
> >
> >> On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves 
> wrote:
> >>
> >> Great write up. Glad someone finally did the math for us. I don't think
> >> this will come as a surprise for many of the developers. Availability is
> >> only one issue raised by vnodes. Load distribution and performance are
> also
> >> pretty big concerns.
> >>
> >> I'm always a proponent for fixing vnodes, and removing them as a default
> >> until we do. Happy to help on this and we have ideas in mind that at
> some
> >> point I'll create tickets for...
> >>
> >>> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, 
> wrote:
> >>>
> >>> If the blob link on github doesn't work for the pdf (looks like mobile
> >>> might not like it), try:
> >>>
> >>>
> >>> https://github.com/jolynch/python_performance_toolkit/
> >> 

Re: [VOTE] Release Apache Cassandra 2.0.8

2014-05-06 Thread Richard Low
There's a small mistake in CHANGES.txt - these changes from 1.2 branch were
already in 2.0.7:

 * Continue assassinating even if the endpoint vanishes (CASSANDRA-6787)
 * Schedule schema pulls on change (CASSANDRA-6971)
 * Non-droppable verbs shouldn't be dropped from OTC (CASSANDRA-6980)
 * Shutdown batchlog executor in SS#drain() (CASSANDRA-7025)

Richard.

On 6 May 2014 02:12, Sylvain Lebresne sylv...@datastax.com wrote:

 Since a fair amount of bug fixes have been committed since 2.0.7 I propose
 the
 following artifacts for release as 2.0.8.

 sha1: 7dbbe9233ce83c2a473ba2510c827a661de99400
 Git:

 http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/2.0.8-tentative
 Artifacts:

 https://repository.apache.org/content/repositories/orgapachecassandra-1011/org/apache/cassandra/apache-cassandra/2.0.8/
 Staging repository:
 https://repository.apache.org/content/repositories/orgapachecassandra-1011/

 The artifacts as well as the debian package are also available here:
 http://people.apache.org/~slebresne/

 The vote will be open for 72 hours (longer if needed).

 [1]: http://goo.gl/G3O7pF (CHANGES.txt)
 [2]: http://goo.gl/xBvQJU (NEWS.txt)



Re: [VOTE] Release Apache Cassandra 2.0.7

2014-04-16 Thread Richard Low
+1 (non-binding) on getting 2.0.7 out soon


On 16 April 2014 07:44, Sylvain Lebresne sylv...@datastax.com wrote:

 On Mon, Apr 14, 2014 at 8:32 PM, Pavel Yaskevich pove...@gmail.com
 wrote:

  Can I push new release of the thrift-server before we roll 2.0.7?
 

 When the vote email goes, the artifacts are already rolled per-se, so
 pushing anything
 means re-rolling the artifacts and vote. So I guess the question is, does
 this fix some
 regression compared to 2.0.6? If it doesn't, then I'd say that 2.0.7 has
 been long enough
 in the coming as it is that I'd rather get it out, it has enough important
 fixes over 2.0.6.

 --
 Sylvain


 
 
  On Mon, Apr 14, 2014 at 10:57 AM, Jonathan Ellis jbel...@gmail.com
  wrote:
 
   +1
   On Apr 14, 2014 10:39 AM, Sylvain Lebresne sylv...@datastax.com
  wrote:
  
sha1: 7dbbe9233ce83c2a473ba2510c827a661de99400
Git:
   
   
  
 
 http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/2.0.7-tentative
Artifacts:
   
   
  
 
 https://repository.apache.org/content/repositories/orgapachecassandra-1009/org/apache/cassandra/apache-cassandra/2.0.7/
Staging repository:
   
  
 
 https://repository.apache.org/content/repositories/orgapachecassandra-1009/
   
The artifacts as well as the debian package are also available here:
http://people.apache.org/~slebresne/
   
The vote will be open for 72 hours (longer if needed).
   
[1]: http://goo.gl/6yg6Xh (CHANGES.txt)
[2]: http://goo.gl/GxmBC9 (NEWS.txt)
   
  
 



Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-11 Thread Richard Low
+1 Although lots of people are still using thrift, it's not a good use of
time to maintain two interfaces when one is clearly better. But, yes,
retaining thrift for some time is important.


On 11 March 2014 17:27, sankalp kohli kohlisank...@gmail.com wrote:

 RIP Thrift :)
 +1 with We will retain it for backwards compatibility. Hopefully most
 people will move out of thrift by 2.1


 On Tue, Mar 11, 2014 at 10:18 AM, Brandon Williams dri...@gmail.com
 wrote:

  As someone who has written a thrift wrapper, +1
 
 
  On Tue, Mar 11, 2014 at 12:00 PM, Jonathan Ellis jbel...@gmail.com
  wrote:
 
   CQL3 is almost two years old now and has proved to be the better API
   that Cassandra needed.  CQL drivers have caught up with and passed the
   Thrift ones in terms of features, performance, and usability.  CQL is
   easier to learn and more productive than Thrift.
  
   With static columns and LWT batch support [1] landing in 2.0.6, and
   UDT in 2.1 [2], I don't know of any use cases for Thrift that can't be
   done in CQL.  Contrawise, CQL makes many things easy that are
   difficult to impossible in Thrift.  New development is overwhelmingly
   done using CQL.
  
   To date we have had an unofficial and poorly defined policy of add
   support for new features to Thrift when that is 'easy.'  However,
   even relatively simple Thrift changes can create subtle complications
   for the rest of the server; for instance, allowing Thrift range
   tombtones would make filter conversion for CASSANDRA-6506 more
   difficult.
  
   Thus, I think it's time to officially close the book on Thrift.  We
   will retain it for backwards compatibility, but we will commit to
   adding no new features or changes to the Thrift API after 2.1.0.  This
   will help send an unambiguous message to users and eliminate any
   remaining confusion from supporting two APIs.  If any new use cases
   come to light that can be done with Thrift but not CQL, we will commit
   to supporting those in CQL.
  
   (To a large degree, this merely formalizes what is already de facto
   reality.  Most thrift clients have not even added support for
   atomic_batch_mutate and cas from 2.0, and popular clients like
   Astyanax are migrating to the native protocol.)
  
   Reasonable?
  
   [1] https://issues.apache.org/jira/browse/CASSANDRA-6561
   [2] https://issues.apache.org/jira/browse/CASSANDRA-5590
  
   --
   Jonathan Ellis
   Project Chair, Apache Cassandra
   co-founder, http://www.datastax.com
   @spyced
  
 



Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Richard Low
On 22 March 2012 05:48, Zhu Han schumi@gmail.com wrote:

 I second it.

 Is there some goals we missed which can not be achieved by assigning
 multiple tokens to a single node?

This is exactly the proposed solution.  The discussion is about how to
implement this, and the methods of choosing tokens and replication
strategy.

Richard.


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Richard Low
On 20 March 2012 14:50, Rick Branson rbran...@datastax.com wrote:

 To support a form of DF, I think some tweaking of the replica placement could 
 achieve this effect quite well. We could introduce a variable into replica 
 placement, which I'm going to incorrectly call DF for the purposes of 
 illustration. The key range for a node would be sub-divided by DF (1 by 
 default) and this would be used to further distribution replica selection 
 based on this sub-partition.

 Currently, the offset formula works out to be something like this:

 offset = replica

 For RandomPartitioner, DF placement might look something like:

 offset = replica + (token % DF)

 Now, I realize replica selection is actually much more complicated than this, 
 but these formulas are for illustration purposes.

 Modifying replica placement  the partitioners to support this seems 
 straightforward, but I'm unsure of what's required to get it working for ring 
 management operations. On the surface, it does seem like this could be added 
 without any kind of difficult migration support.

 Thoughts?

This solution increases the DF, which has the advantage of providing
some balancing when a node is down temporarily.  The reads and writes
it would have served are now distributed around ~DF nodes.

However, it doesn't have any distributed rebuild.  In fact, any
distribution mechanism with one token per node cannot have distributed
rebuild.  Should a node fail, the next node in the ring has twice the
token range so must have twice the data.  This node will limit the
rebuild time - 'nodetool removetoken' will have to replicate the data
of the failed node onto this node.

Increasing the distribution factor without speeding up rebuild
increases the failure probability - both for data loss or being unable
to reach required consistency levels.  The failure probability is a
trade-off between rebuild time and distribution factor.  Lower rebuild
time helps, and lower distribution factor helps.

Cassandra as it is now has the longest rebuild time and lowest
possible distribution factor.  The original vnodes scheme is the other
extreme - shortest rebuild time and largest possible distribution
factor.
 It turns out that the rebuild time is more important, so this
decreases failure probability (with some assumptions you can show it
decreases by a factor RF! - I'll spare you the math but can send it if
you're interested).

This scheme has the longest rebuild time and a (tuneable) distribution
factor, but larger than the lowest.  That necessarily increases the
failure probability over both Cassandra now and vnode schemes, so I'd
be very careful about choosing it.

Richard.


Announcing Acunu

2011-01-31 Thread Richard Low
Hello,

Just thought I'd drop everyone a quick line to let you know that Acunu
are looking for some talented devs to work on Cassandra.

Acunu are working on a storage platform for Big Data, including a
modified version of Cassandra on top of a native in-kernel key-value
store, with a bunch of deployment, management and monitoring tools.
In the coming months we're looking to open source our core storage
engine and submit patches back to the project.

Acunu's not ready for production use yet, but we're expanding our beta
right now, and are looking for people to put it through its paces. You
can read more at http://www.acunu.com/

Thanks

Richard

--
Richard Low
Acunu | http://www.acunu.com | @acunu