from:"benjamin roth"

Re: Repair scheduling tools

2018-04-05 Thread benjamin roth

I don't say reaper is the problem. I don't want to do wrong to Reaper but
in the end it is "just" an instrumentation for CS's built in repairs that
slices and schedules, right?
The problem I see is that the built in repairs are rather inefficient (for
many, maybe not all use cases) due to many reasons. To name some of them:

- Overstreaming as only whole partitions are repaired, not single mutations
- Race conditions in merkle tree calculation on nodes taking part in a
repair session
- Every stream creates a SSTable, needing to be compacted
- Possible SSTable creation floods can even kill a node due to "too many
open files" - yes we had that
- Incremental repairs have issues

Today we had a super simple case where I first ran 'nodetool repair' on a
super small system keyspace and then ran a 'scrape-repair':
- nodetool took 4 minutes on a single node
- scraping took 1 sec repairing all nodes together

In the beginning I was twisting my brain how this could be optimized in CS
- in the end going with scraping solved every problem we had.

2018-04-05 20:32 GMT+02:00 Jonathan Haddad <j...@jonhaddad.com>:

> To be fair, reaper in 2016 only worked with 2.0 and was just sitting
> around, more or less.
>
> Since then we've had 401 commits changing tens of thousands of lines of
> code, dealing with fault tolerance, repair retries, scalability, etc.
> We've had 1 reaper node managing repairs across dozens of clusters and
> thousands of nodes.  It's a totally different situation today.
>
>
> On Thu, Apr 5, 2018 at 11:17 AM benjamin roth <brs...@gmail.com> wrote:
>
> > That would be totally awesome!
> >
> > Not sure if it helps here but for completeness:
> > We completely "dumped" regular repairs - no matter if 'nodetool repair'
> or
> > reaper - and run our own tool that does simply CL_ALL scraping over the
> > whole cluster.
> > It runs now for over a year in production and the only problem we
> > encountered was that we got timeouts when scraping (too) large /
> tombstoned
> > partitions. It turned out that the large partitions weren't even readable
> > with CQL / cqlsh / DevCenter. So that wasn't a problem of the repair. It
> > was rather a design problem. Storing data that can't be read doesn't make
> > sense anyway.
> >
> > What I can tell from our experience:
> > - It works much more reliable than what we had before - also more
> reliable
> > than reaper (state of 2016)
> > - It runs totally smooth and much faster than regular repairs as it only
> > streams what needs to be streamed
> > - It's easily manageable, interruptible, resumable on a very fine-grained
> > level. The only thing you need to do is to store state (KS/CF/Last Token)
> > in a simple storage like redis
> > - It works even pretty well when populating a empty node e.g. when
> changing
> > RFs / bootstrapping DCs
> > - You can easily control the cluster-load by tuning the concurrency of
> the
> > scrape process
> >
> > I don't see a reason for us to ever go back to built-in repairs if they
> > don't improve immensely. In many cases (especially with MVs) they are
> true
> > resource killers.
> >
> > Just my 2 cent and experience.
> >
> > 2018-04-04 17:00 GMT+02:00 Ben Bromhead <b...@instaclustr.com>:
> >
> > > +1 to including the implementation in Cassandra itself. Makes managed
> > > repair a first-class citizen, it nicely rounds out Cassandra's
> > consistency
> > > story and makes it 1000x more likely that repairs will get run.
> > >
> > >
> > >
> > >
> > > On Wed, Apr 4, 2018 at 10:45 AM Jon Haddad <j...@jonhaddad.com> wrote:
> > >
> > > > Implementation details aside, I’m firmly in the “it would be nice of
> C*
> > > > could take care of it” camp.  Reaper is pretty damn easy to use and
> > > people
> > > > *still* don’t put it in prod.
> > > >
> > > >
> > > > > On Apr 4, 2018, at 4:16 AM, Rahul Singh <
> > rahul.xavier.si...@gmail.com>
> > > > wrote:
> > > > >
> > > > > I understand the merits of both approaches. In working with other
> DBs
> > > In
> > > > the “old country” of SQL, we often had to write indexing sequences
> > > manually
> > > > for important tables. It was “built into the product” but in order to
> > > > leverage the maximum benefits of indices we had to have different
> > indices
> > > > other than the clustered (physical index). The process still sucked.
> > It’s
> > > > never perfect.
> &g

Re: Repair scheduling tools

2018-04-05 Thread benjamin roth

That would be totally awesome!

Not sure if it helps here but for completeness:
We completely "dumped" regular repairs - no matter if 'nodetool repair' or
reaper - and run our own tool that does simply CL_ALL scraping over the
whole cluster.
It runs now for over a year in production and the only problem we
encountered was that we got timeouts when scraping (too) large / tombstoned
partitions. It turned out that the large partitions weren't even readable
with CQL / cqlsh / DevCenter. So that wasn't a problem of the repair. It
was rather a design problem. Storing data that can't be read doesn't make
sense anyway.

What I can tell from our experience:
- It works much more reliable than what we had before - also more reliable
than reaper (state of 2016)
- It runs totally smooth and much faster than regular repairs as it only
streams what needs to be streamed
- It's easily manageable, interruptible, resumable on a very fine-grained
level. The only thing you need to do is to store state (KS/CF/Last Token)
in a simple storage like redis
- It works even pretty well when populating a empty node e.g. when changing
RFs / bootstrapping DCs
- You can easily control the cluster-load by tuning the concurrency of the
scrape process

I don't see a reason for us to ever go back to built-in repairs if they
don't improve immensely. In many cases (especially with MVs) they are true
resource killers.

Just my 2 cent and experience.

2018-04-04 17:00 GMT+02:00 Ben Bromhead :

> +1 to including the implementation in Cassandra itself. Makes managed
> repair a first-class citizen, it nicely rounds out Cassandra's consistency
> story and makes it 1000x more likely that repairs will get run.
>
>
>
>
> On Wed, Apr 4, 2018 at 10:45 AM Jon Haddad  wrote:
>
> > Implementation details aside, I’m firmly in the “it would be nice of C*
> > could take care of it” camp.  Reaper is pretty damn easy to use and
> people
> > *still* don’t put it in prod.
> >
> >
> > > On Apr 4, 2018, at 4:16 AM, Rahul Singh 
> > wrote:
> > >
> > > I understand the merits of both approaches. In working with other DBs
> In
> > the “old country” of SQL, we often had to write indexing sequences
> manually
> > for important tables. It was “built into the product” but in order to
> > leverage the maximum benefits of indices we had to have different indices
> > other than the clustered (physical index). The process still sucked. It’s
> > never perfect.
> > >
> > > The JVM is already fraught with GC issues and putting another process
> > being managed in the same heapspace is what I’m worried about.
> Technically
> > the process could be in the same binary but started as a side Car or in
> the
> > same main process.
> > >
> > > Consider a process called “cassandra-agent” that’s sitting around with
> a
> > scheduler based on config or a Cassandra table. Distributed in the same
> > release. Shell / service scripts would start it. The end user knows it
> only
> > by examining the .sh files. This opens possibilities of including a GUI
> > hosted in the same process without cluttering the core coolness of
> > Cassandra.
> > >
> > > Best,
> > >
> > > --
> > > Rahul Singh
> > > rahul.si...@anant.us
> > >
> > > Anant Corporation
> > >
> > > On Apr 4, 2018, 2:50 AM -0400, Dor Laor , wrote:
> > >> We at Scylla, implemented repair in a similar way to the Cassandra
> > reaper.
> > >> We do
> > >> that using an external application, written in go that manages repair
> > for
> > >> multiple clusters
> > >> and saves the data in an external Scylla cluster. The logic resembles
> > the
> > >> reaper one with
> > >> some specific internal sharding optimizations and uses the Scylla rest
> > api.
> > >>
> > >> However, I have doubts it's the ideal way. After playing a bit with
> > >> CockroachDB, I realized
> > >> it's super nice to have a single binary that repairs itself, provides
> a
> > GUI
> > >> and is the core DB.
> > >>
> > >> Even while distributed, you can elect a leader node to manage the
> > repair in
> > >> a consistent
> > >> way so the complexity can be reduced to a minimum. Repair can write
> its
> > >> status to the
> > >> system tables and to provide an api for progress, rate control, etc.
> > >>
> > >> The big advantage for repair to embedded in the core is that there is
> no
> > >> need to expose
> > >> internal state to the repair logic. So an external program doesn't
> need
> > to
> > >> deal with different
> > >> version of Cassandra, different repair capabilities of the core (such
> as
> > >> incremental on/off)
> > >> and so forth. A good database should schedule its own repair, it knows
> > >> whether the shreshold
> > >> of hintedhandoff was cross or not, it knows whether nodes where
> > replaced,
> > >> etc,
> > >>
> > >> My 2 cents. Dor
> > >>
> > >> On Tue, Apr 3, 2018 at 11:13 PM, Dinesh Joshi <
> > >> dinesh.jo...@yahoo.com.invalid> wrote:
> > >>
> > >>> Simon,
> > >>>

Re: State of Materialized Views

2017-07-24 Thread benjamin roth

Hi Josh,

Who is "we" in this case?

Best,
Ben

2017-07-24 15:41 GMT+02:00 Josh McKenzie :

> >
> > The initial contributors turned their back on MVs
>
>
> We're working on the following MV-related issues in the 4.0 time-frame:
> CASSANDRA-13162
> CASSANDRA-13547
> CASSANDRA-13127
> CASSANDRA-13409
> CASSANDRA-12952
> CASSANDRA-13069
> CASSANDRA-12888
>
> We're also keeping our eye on CASSANDRA-13657
>
> This is by no means an exhaustive list, but we're hoping it'll help take
> care of some of the more pressing / critical issues with the feature.
> Automated de-normalization on a Dynamo EC architecture is a Hard Problem.
>
>
> On Thu, Jul 20, 2017 at 9:56 PM, kurt greaves 
> wrote:
>
> > I'm going to do my best to review all the changes Zhao is making under
> > CASSANDRA-11500 ,
> > but yeah definitely need a committer nominee as well. On that note, Zhao
> is
> > going to try address a lot of the current issues I listed above in
> #11500.
> > Thanks Zhao!
> >
>

Re: State of Materialized Views

2017-07-17 Thread benjamin roth

Hi Kurt,

First of all thanks for this elaborate post.

At this moment, I don't want to come up with a solution for all MV issues
but I would like to point out, why I was quite active some time ago and why
I pulled myself back.

As you also mentioned in different words, it seems to me that MVs are an
orphan in CS. They started out as a shiny and promising feature, but ... .
When I came to CS, MVs were one of the reasons why I gave CS in general and
3.0 in special a try. But when I started to work with MVs in production -
willing to overcome the "little obstacles" and the fact they are "not quite
stable" - I started to realize that there is almost no support from the
community. The initial contributors turned their back on MVs. All that
remained is a 95% ready feature, a lot of public documentation but no
disclaimer that says "Please Do Not Use MVs". And every time when a
discussion pops up around MVs the bottom line is:

- All or most of involved people have not much experience in MVs
- Original contributors are not involved
- It seems to me, discussions are more based on assumptions or superficial
knowledge than on real knowledge/experience/research/proofs
- Bringing in code changes is difficult for the same reasons. Nobody likes
to take over the "old heritage" or take over responsibility for it. And it
seems that nobody feels confident enough to bring in critical changes
- I don't want to touch this critical part in the code path, I know we have
tests but ...

Initially I was very eager to contribute and to help MV to get mature but
over time it turned out it is very cumbersome and frustrating. Additionally
I have very little time left in my daily routine to work on CS. So I
decided to work on a solution that solved our specific problems with CS and
MVs. I am not really happy with it but it actually works quite well.

To be honest, I also had in the back of my head to write a posing similar
to yours. I would really like to contribute and bring MVs forward, but not
at all costs. I see many problems with MVs, even some that haven't even
been mentioned, yet. But I do not want to come up with half-baked
assumptions. What really lacks for MVs is a reproducible code-based proof
what works and what does not. One example is the question "Why can I add
only a single column to an MV PK". I have read arguments of which I think
they are not quite right or "somehow incomplete". There are a lot of
arguments and discussions that are totally scattered across JIRA and it
seems to me that every contributor knows a little bit of this and a little
bit of that and remember this post or that post. I was already thinking of
setting up super-reduced "storage mock" to prove / find edge cases in MV
fail-and-repair scenarios to answer questions like these with code instead
of sentences like "I think that... " or "I can remember a comment of ...".
Unfortunately dtests are super painful things like that because a) they are
f* slow b) it is super complicated to simulate a certain situation. I
also did not see a simple way to do this with the CS unit test suite as I
didn't see a way to boot and control multiple storages there.

*What I miss is a central consensus about "MV business rules" + a central
set of proofs and or tests that support these rules and proof or falsify
assumptions in a reproducible way.*

The reason why I did not already come up with sth like that:
- Time
- Frustration

If I can see that there are more people who feel like that and are willing
to work together to find a solid solution, my level of frustration could
turn into motivation again.

--
Last but not least for those who care:
One of the solutions I created was to implement our own version of Tickler
(full table scans with CL_ALL to enforce read repair) to get rid of these
damned built-in repairs which simply don't work well (especially) for MVs.
To only name a few numbers:
- We could bring down the repair time of a KS with RF=5 from 5 hours to 5
minutes. Really. I could not believe it.
- No more "compaction storms" or piling up compaction queues or compactions
falling behind
- No more SSTables piling up. Before it was normal that the number of
SSTables went up from 300-400 to 5000 and more. After: No noticeable
change. (Btw that was the reason for CASSANDRA-12730. This isn't even bound
to MVs, they maybe only amplify the impact of the underlying design)
- We now repair the whole cluster in 16h (10 nodes, 400-450gb load each,
14KS). Before we had single keyspaces that took more than a day to finish.
Sometimes they took even 3 days with reaper because of "Too many
compactions"
- It showed us problems in our model. We had data that was not readable at
all due to massive tombstones + read timeouts
... if someone is interested in more details, just ping me.

- Benjamin


2017-07-17 6:22 GMT+02:00 kurt greaves :

> wall of text inc.
> *tl;dr: *Aiming to come to some conclusions about what we are doing with
> MV's and how we are going

Re: Integrating vendor-specific code and developing plugins

2017-05-15 Thread benjamin roth

Absolutely

+ Separate repos for separate modules also separate responsibility. IMHO it
makes a heterogenuous structure more manageable. Both in a technical and a
human or insitutional way.

2017-05-15 13:54 GMT+02:00 Jonathan Haddad :

> There's a handful of issues I can think of with shipping everything
> in-tree.  I'll try to brain dump them here.
>
> * What's included when shipped in tree?
>
> Does every idea get merged in? Do we need 30 different Seed providers?  Who
> judges what's valuable enough to add?  Who maintains it when it needs
> updating?  If the maintainer can't be found, is it removed?  Shipped
> broken?  Does the contributed plugins go through the same review process?
> Do the contributors need to be committers?  Would CASSANDRA-12627 be merged
> in even if nobody saw the value?
>
> * Language support
>
> Cassandra is based on Java 8.  Do we now merge in Scala, Kotlin, Jython?
>
> * Plugins are now tied directly to cassandra release cycle
>
> This one bugs me quite a bit.  With a new plugin, there's going to be a lot
> of rapid iterations.  Spark releases once every 3 months - a lot of the
> packages seem to be released at a much higher frequency.
>
> * In Python, the standard lib is where modules go to die
>
> I forgot where I heard this, but it's pretty accurate.  Including
> everything, "batteries includes", just ends up shipping some pretty
> terrible batteries.  The best stuff for python is on pypi.
>
> Rust deliberately made the decision to limit the std to avoid this
> problem.  There's a "nursery" [1] area for ideas to evolve independently,
> and when some code reaches a high enough level of maturity, it can get
> merged in.  There's also a packages system for third party, non std
> destined code.
>
> Anyways - I'm very +1 on a package system where codebases can independently
> evolve without needing to be part of the project itself.  It's a proven
> model for shipping high quality, useful code, and sometimes is even one of
> the best aspects of a project.  That said, it's quite a bit of work just to
> get going and someone would have to manage that.
>
> Jon
>
> [1] https://github.com/rust-lang-nursery
>
>
> On Sun, May 14, 2017 at 9:03 PM Jeff Jirsa  wrote:
>
> > On Fri, May 12, 2017 at 9:31 PM, J. D. Jordan  >
> > wrote:
> >
> > > In tree would foster more committers which is a good thing.
> >
> >
> > Definitely.
> >
> > But I also agree that not being able to actually run unit tests is a bad
> > > thing. What if we asked people who want to contribute these types of
> > > optimizations to provide the ASF with a Jenkins slave we could test
> them
> > > on, if they want them in tree?
> > >
> >
> > I think SOME FORM of jenkins/unit/dtests need to exist, whether it's ASF
> > puppet controlled or test output explicitly provided by the maintainer.
> >
> >
> > > Otherwise one good thing about out of tree is that the maintainer can
> > > specify "this plugin has been tested and works with Apache Cassandra
> > > version X.Y.Z". If it is in tree it is assumed it will work. If it is
> out
> > > of tree then the plugin can more easily notify a user what version it
> was
> > > last tested with.  And users won't be surprised if they upgrade and
> > things
> > > break.
> > >
> >
> > My main desire is that I'd like to see us do better at helping third
> party
> > contributors be successful in contributing, and to me that means
> something
> > more official. I like the spark packages model. I like the apache httpd
> > model (with LOTS of official extensions in-tree, but a lot externally
> > packaged as well). I'm not a fan of telling people to build and
> distribute
> > their own JARs - it doesn't help the contributors, it doesn't help the
> > users, and it doesn't help the project.
> >
> > - Jeff
> >
>

Re: Documentation contributors guide

2017-03-17 Thread benjamin roth

Isn't there a way to script that with just a few lines of python or
whatever?

2017-03-17 21:03 GMT+01:00 Jeff Jirsa :

>
>
> On 2017-03-17 12:33 (-0700), Stefan Podkowinski  wrote:
>
> > As you can see there's a large part about using GitHub for editing on
> > the page. I'd like to know what you think about that and if you'd agree
> > to accept PRs for such purposes.
> >
>
> The challenge of github PRs isn't that we don't want them, it's that we
> can't merge them - the apache github repo is a read only mirror (the master
> is on ASF infrastructure).
>
> Personally, I'd rather have a Github PR than no patch, but I'd much rather
> have a JIRA patch than a Github PR, because ultimately the committer is
> going to have to manually transform the Github PR into a .patch file and
> commit it with a special commit message to close the Github PR (or hope
> that the contributor closes it for us, because committers can't even close
> PRs at this point).
>
> > I'd also like to add another section for committers that describes the
> > required steps to actually publish the latest trunk to our website. I
> > know that svn has been mentioned somewhere, but I would appreciate if
> > someone either adds that section or just shares some details in this
> thread.
>
> The repo is at https://svn.apache.org/repos/asf/cassandra/ - there's a
> doc at https://svn.apache.org/repos/asf/cassandra/site/src/README that
> describes it.
>

Re: Code quality, principles and rules

2017-03-17 Thread benjamin roth

I think you can refactor any project with little risk and increase test
coverage.
What is needed:
Rules. Discipline. Perseverance. Small iterations. Small iterations. Small
iterations.

   - Refactor in the smallest possible unit
   - Split large classes into smaller ones. Remove god classes by pulling
   out single methods or aspects. Maybe factor out method by method.
   - Maintain compatibility. Build facades, adapters, proxy objects for
   compatibility during refactoring process. Do not break interfaces if not
   really necessary or risky.
   - Push states into corners. E.g. when refactoring a single method, pass
   global state as parameter. So this single method becomes testable.

If you iterate like this maybe 1000 times, you will most likely break much
fewer things than doing a big bang refactor. You make code testable in
small steps.

Global state is the biggest disease, history of programming has ever seen.
Singletons are also not supergreat to test and static methods should be
avoided at all costs if they contain state.
Tested idempotent static methods should not be a problem.

>From my experience, you don't need a bloated DI framework to make a class
testable that depends somehow on static methods or singletons.
You just have to push the bad guys into a corner where they don't harm and
can be killed without risk in the very end.
E.g. instead of calling SomeClass.instance.doWhatEver() spread here and
there it can be encapsulated in a single method like
TestableClass.doWhatever() {SomeClass.instance.doWhatEver()}
Or the whole singleton is retrieved through TestableClass.getSomeClass().
So you can either mock the hell out of it or you inject a non-singleton
instance of that class at test-runtime.


2017-03-17 19:19 GMT+01:00 Jason Brown :

> As someone who spent a lot of time looking at the singletons topic in the
> past, Blake brings a great perspective here. Figuring out and communicating
> how best to test with the system we have (and of course incrementally
> making that system easier to work with/test) seems like an achievable goal.
>
> On Fri, Mar 17, 2017 at 10:17 AM, Edward Capriolo 
> wrote:
>
> > On Fri, Mar 17, 2017 at 12:33 PM, Blake Eggleston 
> > wrote:
> >
> > > I think we’re getting a little ahead of ourselves talking about DI
> > > frameworks. Before that even becomes something worth talking about,
> we’d
> > > need to have made serious progress on un-spaghettifying Cassandra in
> the
> > > first place. It’s an extremely tall order. Adding a DI framework right
> > now
> > > would be like throwing gasoline on a raging tire fire.
> > >
> > > Removing singletons seems to come up every 6-12 months, and usually
> > > abandoned once people figure out how difficult they are to remove
> > properly.
> > > I do think removing them *should* be a long term goal, but we really
> need
> > > something more immediately actionable. Otherwise, nothing’s going to
> > > happen, and we’ll be having this discussion again in a year or so when
> > > everyone’s angry that Cassandra 5.0 still isn’t ready for production, a
> > > year after it’s release.
> > >
> > > That said, the reason singletons regularly get brought up is because
> > doing
> > > extensive testing of anything in Cassandra is pretty much impossible,
> > since
> > > the code is basically this big web of interconnected global state.
> > Testing
> > > anything in isolation can’t be done, which, for a distributed database,
> > is
> > > crazy. It’s a chronic problem that handicaps our ability to release a
> > > stable database.
> > >
> > > At this point, I think a more pragmatic approach would be to draft and
> > > enforce some coding standards that can be applied in day to day
> > development
> > > that drive incremental improvement of the testing and testability of
> the
> > > project. What should be tested, how it should be tested. How to write
> new
> > > code that talks to the rest of Cassandra and is testable. How to fix
> bugs
> > > in old code in a way that’s testable. We should also have some
> guidelines
> > > around refactoring the wildly untested sections, how to get started,
> what
> > > to do, what not to do, etc.
> > >
> > > Thoughts?
> >
> >
> > To make the conversation practical. There is one class I personally
> really
> > want to refactor so it can be tested:
> >
> > https://github.com/apache/cassandra/blob/trunk/src/java/
> > org/apache/cassandra/net/OutboundTcpConnection.java
> >
> > There is little coverage here. Questions like:
> > what errors cause the connection to restart?
> > when are undropable messages are dropped?
> > what happens when the queue fills up?
> > Infamous throw new AssertionError(ex); (which probably bubble up to
> > nowhere)
> > what does the COALESCED strategy do in case XYZ.
> > A nifty label (wow a label you just never see those much!)
> > outer:
> > while (!isStopped)
> >
> > Comments to jira's that probably are not explicitly

Re: Contribute to the Cassandra wiki

2017-03-13 Thread benjamin roth

Contribution Guide +1
Github WebUI +1
Pull requests +1

Rest: Inspect + Adapt

2017-03-13 19:38 GMT+01:00 Stefan Podkowinski :

> Agreed. Let's not give up on this as quickly. My suggestion is to at
> least provide a getting started guide for writing docs, before
> complaining about too few contributions. I'll try to draft something up
> this week.
>
> What people are probably not aware of is how easy it is to contribute
> docs through github. Just clone our repo, create a document and add your
> content. It's all possible through the github web UI including
> reStructuredText support for the viewer/editor. I'd even say to lower
> the barrier for contributing docs even further by accepting pull
> requests for them, so we can have a fully github based workflow for
> casual contributors.
>
>
> On 03/13/2017 05:55 PM, Jonathan Haddad wrote:
> > Ugh... Let's put a few facts out in the open before we start pushing to
> > move back to the wiki.
> >
> > First off, take a look at CASSANDRA-8700.  There's plenty of reasoning
> for
> > why the docs are now located in tree.  The TL;DR is:
> >
> > 1. Nobody used the wiki.  Like, ever.  A handful of edits per year.
> > 2. Docs in the wiki were out of sync w/ cassandra.  Trying to outline the
> > difference in implementations w/ nuanced behavior was difficult /
> > impossible.  With in-tree, you just check the docs that come w/ the
> version
> > you installed.  And you get them locally.  Huzzah!
> > 3. The in-tree docs are a million times better quality than the wiki
> *ever*
> > was.
> >
> > I urge you to try giving the in-tree docs a chance.  It may not be the
> way
> > *you* want it but I have to point out that they're the best we've seen in
> > Cassandra world.  Making them prettier won't help anything.
> >
> > I do agree that the process needs to be a bit smoother for people to add
> > stuff to the in tree side.  For instance, maybe for every features that's
> > written we start creating a corresponding JIRA for the documentation.
> Not
> > every developer wants to write docs, and that's fair.  The accompanying
> > JIRA would serve as a way for 2 or more people to collaborate on the
> > feature & the docs in tandem.  It may also be beneficial to use the
> dev-ml
> > to say "hey, i'm working on feature X, anyone want to help me write the
> > docs for it?  check out CASSANDRA-XYZ"
> >
> > Part of CASSANDRA-8700 was to shut down the wiki.  I still advocate for
> > this. At the very minimum we should make it read only with a big notice
> > that points people to the in-tree docs.
> >
> > On Mon, Mar 13, 2017 at 8:49 AM Jeremy Hanna  >
> > wrote:
> >
> >> The moinmoin wiki was preferred but because of spam, images couldn’t be
> >> attached.  The options were to use confluence or have a moderated list
> of
> >> individuals be approved to update the wiki.  The decision was made to go
> >> with the latter because of the preference to stick with moinmoin rather
> >> than confluence.  That’s my understanding of the history there.  I don’t
> >> know if people would like to revisit using one or the other at this
> point,
> >> though it would take a bit of work to convert.
> >>
> >>> On Mar 13, 2017, at 9:42 AM, Nate McCall  wrote:
> >>>
>  Isn't there a way to split tech docs (aka reference) and more
>  user-generated and use-case related/content oriented docs? And maybe
> to
> >> use
>  a more modern WIKI software or scheme. The CS wiki looks like 1998.
> >>> The wiki is what ASF Infra provides by default. Agree that it is a bit
> >>> "old-school."
> >>>
> >>> I'll ask around about what other projects are doing (or folks who are
> >>> involved in other ASF projects, please chime in).
> >>
>
>

Re: Contribute to the Cassandra wiki

2017-03-13 Thread benjamin roth

First: I am positively surprised how many guys would like to contribute to
docs.

Some days ago I posted to the dev-list about doc-contribution. I think this
applies here again. From my point of view "in-tree docs" are a good choice
for technical references that go closely with the code versioning.
But for content-oriented docs like tutorials, FAQs, Knowledge base I think
this is not a good place especially if the doc-contributors are not that
deeply involved into dev/code.
For that purpose, Stefan Podkowinsky created a repo for collaboration that
"Proxies" access to the CS repo. Thats a nice gesture but IMHO that can
only work as an intermediate solution. "User-Docs" do not require a CI or
complex build + publishing process. They require a simple and "beautiful"
way to contribute. Especially if you wish to encourage more "outside" users
to contribute.

Isn't there a way to split tech docs (aka reference) and more
user-generated and use-case related/content oriented docs? And maybe to use
a more modern WIKI software or scheme. The CS wiki looks like 1998.

2017-03-12 23:26 GMT+01:00 Jeff Jirsa :

> We're trying to use the in-tree docs. Those are preferred, updating the
> wiki is OK, but the wiki is VERY out of date.
>
> --
> Jeff Jirsa
>
>
> > On Mar 12, 2017, at 3:21 PM, Long Quanzheng  wrote:
> >
> > Is the wiki still being used?
> > https://wiki.apache.org/cassandra
> > says:
> > Cassandra is moving away from this wiki for user-facing documentation in
> > favor of in-tree docs, linked below. (Pull requests welcome
> > !)
> >
> >
> > 2017-03-12 14:21 GMT-07:00 Brandon Williams :
> >
> >> I've added you.
> >>
> >> On Sun, Mar 12, 2017 at 1:43 PM, ThisHosting.Rocks! <
> >> contact@thishosting.rocks> wrote:
> >>
> >>> Hi,
> >>>
> >>>
> >>> My username is NickReiner and I'd like to contribute to the Cassandra
> >> wiki.
> >>>
> >>> Please. :)
> >>>
> >>> Nick Reiner
> >>> THR Support.
> >>>
> >>
>

Re: State of triggers

2017-03-05 Thread benjamin roth

There is a German saying:

Sometimes you don't see the woods because of the lots of trees.

Am 05.03.2017 09:25 schrieb "DuyHai Doan" <doanduy...@gmail.com>:

> No problem, distributed systems are hard to reason about, I got caught many
> times in the past
>
> On Sun, Mar 5, 2017 at 9:23 AM, benjamin roth <brs...@gmail.com> wrote:
>
> > Sorry. Answer was to fast. Maybe you are right.
> >
> > Am 05.03.2017 09:21 schrieb "benjamin roth" <brs...@gmail.com>:
> >
> > > No. You just change the partitioner. That's all
> > >
> > > Am 05.03.2017 09:15 schrieb "DuyHai Doan" <doanduy...@gmail.com>:
> > >
> > >> "How can that be achieved? I haven't done "scientific researches" yet
> > but
> > >> I
> > >> guess a "MV partitioner" could do the trick. Instead of applying the
> > >> regular partitioner, an MV partitioner would calculate the PK of the
> > base
> > >> table (which is always possible) and then apply the regular
> > partitioner."
> > >>
> > >> The main purpose of MV is to avoid the drawbacks of 2nd index
> > >> architecture,
> > >> e.g. to scan a lot of nodes to fetch the results.
> > >>
> > >> With MV, since you give the partition key, the guarantee is that
> you'll
> > >> hit
> > >> a single node.
> > >>
> > >> Now if you put MV data on the same node as base table data, you're
> doing
> > >> more-or-less the same thing as 2nd index.
> > >>
> > >> Let's take a dead simple example
> > >>
> > >> CREATE TABLE user (user_id uuid PRIMARY KEY, email text);
> > >> CREATE MV user_by_email AS SELECT * FROM user WHERE user_id IS NOT
> NULL
> > >> AND
> > >> email IS NOT NULL PRIMARY KEY((email),user_id);
> > >>
> > >> SELECT * FROM user_by_email WHERE email = xxx;
> > >>
> > >> With this query, how can you find the user_id that corresponds to
> email
> > >> 'xxx' so that your MV partitioner idea can work ?
> > >>
> > >>
> > >>
> > >> On Sun, Mar 5, 2017 at 9:05 AM, benjamin roth <brs...@gmail.com>
> wrote:
> > >>
> > >> > While I was reading the MV paragraph in your post, an idea popped
> up:
> > >> >
> > >> > The problem with MV inconsistencies and inconsistent range movement
> is
> > >> that
> > >> > the "MV contract" is broken. This only happens because base data and
> > >> > replica data reside on different hosts. If base data + replicas
> would
> > >> stay
> > >> > on the same host then a rebuild/remove would always stream both
> > matching
> > >> > parts of a base table + mv.
> > >> >
> > >> > So my idea:
> > >> > Why not make a replica ALWAYS stay local regardless where the token
> of
> > >> a MV
> > >> > would point at. That would solve these problems:
> > >> > 1. Rebuild / remove node would not break MV contract
> > >> > 2. A write always stays local:
> > >> >
> > >> > a) That means replication happens sync. That means a quorum write to
> > the
> > >> > base table guarantees instant data availability with quorum read on
> a
> > >> view
> > >> >
> > >> > b) It saves network roundtrips + request/response handling and helps
> > to
> > >> > keep a cluster healthier in case of bulk operations (like repair
> > >> streams or
> > >> > rebuild stream). Write load stays local and is not spread across the
> > >> whole
> > >> > cluster. I think it makes the load in these situations more
> > predictable.
> > >> >
> > >> > How can that be achieved? I haven't done "scientific researches" yet
> > >> but I
> > >> > guess a "MV partitioner" could do the trick. Instead of applying the
> > >> > regular partitioner, an MV partitioner would calculate the PK of the
> > >> base
> > >> > table (which is always possible) and then apply the regular
> > partitioner.
> > >> >
> > >> > I'll create a proper Jira for it on monday. Currently it's sunday
> here
> > >> and
> > >> > my family wants me back so just a few thoughts on this right now.

Re: State of triggers

2017-03-05 Thread benjamin roth

Not maybe. You are absolutely right. Bad idea. Hmpf.

Am 05.03.2017 09:23 schrieb "benjamin roth" <brs...@gmail.com>:

> Sorry. Answer was to fast. Maybe you are right.
>
> Am 05.03.2017 09:21 schrieb "benjamin roth" <brs...@gmail.com>:
>
>> No. You just change the partitioner. That's all
>>
>> Am 05.03.2017 09:15 schrieb "DuyHai Doan" <doanduy...@gmail.com>:
>>
>>> "How can that be achieved? I haven't done "scientific researches" yet
>>> but I
>>> guess a "MV partitioner" could do the trick. Instead of applying the
>>> regular partitioner, an MV partitioner would calculate the PK of the base
>>> table (which is always possible) and then apply the regular partitioner."
>>>
>>> The main purpose of MV is to avoid the drawbacks of 2nd index
>>> architecture,
>>> e.g. to scan a lot of nodes to fetch the results.
>>>
>>> With MV, since you give the partition key, the guarantee is that you'll
>>> hit
>>> a single node.
>>>
>>> Now if you put MV data on the same node as base table data, you're doing
>>> more-or-less the same thing as 2nd index.
>>>
>>> Let's take a dead simple example
>>>
>>> CREATE TABLE user (user_id uuid PRIMARY KEY, email text);
>>> CREATE MV user_by_email AS SELECT * FROM user WHERE user_id IS NOT NULL
>>> AND
>>> email IS NOT NULL PRIMARY KEY((email),user_id);
>>>
>>> SELECT * FROM user_by_email WHERE email = xxx;
>>>
>>> With this query, how can you find the user_id that corresponds to email
>>> 'xxx' so that your MV partitioner idea can work ?
>>>
>>>
>>>
>>> On Sun, Mar 5, 2017 at 9:05 AM, benjamin roth <brs...@gmail.com> wrote:
>>>
>>> > While I was reading the MV paragraph in your post, an idea popped up:
>>> >
>>> > The problem with MV inconsistencies and inconsistent range movement is
>>> that
>>> > the "MV contract" is broken. This only happens because base data and
>>> > replica data reside on different hosts. If base data + replicas would
>>> stay
>>> > on the same host then a rebuild/remove would always stream both
>>> matching
>>> > parts of a base table + mv.
>>> >
>>> > So my idea:
>>> > Why not make a replica ALWAYS stay local regardless where the token of
>>> a MV
>>> > would point at. That would solve these problems:
>>> > 1. Rebuild / remove node would not break MV contract
>>> > 2. A write always stays local:
>>> >
>>> > a) That means replication happens sync. That means a quorum write to
>>> the
>>> > base table guarantees instant data availability with quorum read on a
>>> view
>>> >
>>> > b) It saves network roundtrips + request/response handling and helps to
>>> > keep a cluster healthier in case of bulk operations (like repair
>>> streams or
>>> > rebuild stream). Write load stays local and is not spread across the
>>> whole
>>> > cluster. I think it makes the load in these situations more
>>> predictable.
>>> >
>>> > How can that be achieved? I haven't done "scientific researches" yet
>>> but I
>>> > guess a "MV partitioner" could do the trick. Instead of applying the
>>> > regular partitioner, an MV partitioner would calculate the PK of the
>>> base
>>> > table (which is always possible) and then apply the regular
>>> partitioner.
>>> >
>>> > I'll create a proper Jira for it on monday. Currently it's sunday here
>>> and
>>> > my family wants me back so just a few thoughts on this right now.
>>> >
>>> > Any feedback is appreciated!
>>> >
>>> > 2017-03-05 6:34 GMT+01:00 Edward Capriolo <edlinuxg...@gmail.com>:
>>> >
>>> > > On Sat, Mar 4, 2017 at 10:26 AM, Jeff Jirsa <jji...@gmail.com>
>>> wrote:
>>> > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > > On Mar 4, 2017, at 7:06 AM, Edward Capriolo <
>>> edlinuxg...@gmail.com>
>>> > > > wrote:
>>> > > > >
>>> > > > >> On Fri, Mar 3, 2017 at 12:04 PM, Jeff Jirsa <jji...@gmail.com>
>>> > wrote:
>>> > > > >>
>>> > >

Re: State of triggers

2017-03-05 Thread benjamin roth

Sorry. Answer was to fast. Maybe you are right.

Am 05.03.2017 09:21 schrieb "benjamin roth" <brs...@gmail.com>:

> No. You just change the partitioner. That's all
>
> Am 05.03.2017 09:15 schrieb "DuyHai Doan" <doanduy...@gmail.com>:
>
>> "How can that be achieved? I haven't done "scientific researches" yet but
>> I
>> guess a "MV partitioner" could do the trick. Instead of applying the
>> regular partitioner, an MV partitioner would calculate the PK of the base
>> table (which is always possible) and then apply the regular partitioner."
>>
>> The main purpose of MV is to avoid the drawbacks of 2nd index
>> architecture,
>> e.g. to scan a lot of nodes to fetch the results.
>>
>> With MV, since you give the partition key, the guarantee is that you'll
>> hit
>> a single node.
>>
>> Now if you put MV data on the same node as base table data, you're doing
>> more-or-less the same thing as 2nd index.
>>
>> Let's take a dead simple example
>>
>> CREATE TABLE user (user_id uuid PRIMARY KEY, email text);
>> CREATE MV user_by_email AS SELECT * FROM user WHERE user_id IS NOT NULL
>> AND
>> email IS NOT NULL PRIMARY KEY((email),user_id);
>>
>> SELECT * FROM user_by_email WHERE email = xxx;
>>
>> With this query, how can you find the user_id that corresponds to email
>> 'xxx' so that your MV partitioner idea can work ?
>>
>>
>>
>> On Sun, Mar 5, 2017 at 9:05 AM, benjamin roth <brs...@gmail.com> wrote:
>>
>> > While I was reading the MV paragraph in your post, an idea popped up:
>> >
>> > The problem with MV inconsistencies and inconsistent range movement is
>> that
>> > the "MV contract" is broken. This only happens because base data and
>> > replica data reside on different hosts. If base data + replicas would
>> stay
>> > on the same host then a rebuild/remove would always stream both matching
>> > parts of a base table + mv.
>> >
>> > So my idea:
>> > Why not make a replica ALWAYS stay local regardless where the token of
>> a MV
>> > would point at. That would solve these problems:
>> > 1. Rebuild / remove node would not break MV contract
>> > 2. A write always stays local:
>> >
>> > a) That means replication happens sync. That means a quorum write to the
>> > base table guarantees instant data availability with quorum read on a
>> view
>> >
>> > b) It saves network roundtrips + request/response handling and helps to
>> > keep a cluster healthier in case of bulk operations (like repair
>> streams or
>> > rebuild stream). Write load stays local and is not spread across the
>> whole
>> > cluster. I think it makes the load in these situations more predictable.
>> >
>> > How can that be achieved? I haven't done "scientific researches" yet
>> but I
>> > guess a "MV partitioner" could do the trick. Instead of applying the
>> > regular partitioner, an MV partitioner would calculate the PK of the
>> base
>> > table (which is always possible) and then apply the regular partitioner.
>> >
>> > I'll create a proper Jira for it on monday. Currently it's sunday here
>> and
>> > my family wants me back so just a few thoughts on this right now.
>> >
>> > Any feedback is appreciated!
>> >
>> > 2017-03-05 6:34 GMT+01:00 Edward Capriolo <edlinuxg...@gmail.com>:
>> >
>> > > On Sat, Mar 4, 2017 at 10:26 AM, Jeff Jirsa <jji...@gmail.com> wrote:
>> > >
>> > > >
>> > > >
>> > > >
>> > > > > On Mar 4, 2017, at 7:06 AM, Edward Capriolo <
>> edlinuxg...@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > >> On Fri, Mar 3, 2017 at 12:04 PM, Jeff Jirsa <jji...@gmail.com>
>> > wrote:
>> > > > >>
>> > > > >> On Fri, Mar 3, 2017 at 5:40 AM, Edward Capriolo <
>> > > edlinuxg...@gmail.com>
>> > > > >> wrote:
>> > > > >>
>> > > > >>>
>> > > > >>> I used them. I built do it yourself secondary indexes with them.
>> > They
>> > > > >> have
>> > > > >>> there gotchas, but so do all the secondary index
>> implementations.
>> > > Just
>> > > > >>> because datastax does not write a

Re: State of triggers

2017-03-05 Thread benjamin roth

No. You just change the partitioner. That's all

Am 05.03.2017 09:15 schrieb "DuyHai Doan" <doanduy...@gmail.com>:

> "How can that be achieved? I haven't done "scientific researches" yet but I
> guess a "MV partitioner" could do the trick. Instead of applying the
> regular partitioner, an MV partitioner would calculate the PK of the base
> table (which is always possible) and then apply the regular partitioner."
>
> The main purpose of MV is to avoid the drawbacks of 2nd index architecture,
> e.g. to scan a lot of nodes to fetch the results.
>
> With MV, since you give the partition key, the guarantee is that you'll hit
> a single node.
>
> Now if you put MV data on the same node as base table data, you're doing
> more-or-less the same thing as 2nd index.
>
> Let's take a dead simple example
>
> CREATE TABLE user (user_id uuid PRIMARY KEY, email text);
> CREATE MV user_by_email AS SELECT * FROM user WHERE user_id IS NOT NULL AND
> email IS NOT NULL PRIMARY KEY((email),user_id);
>
> SELECT * FROM user_by_email WHERE email = xxx;
>
> With this query, how can you find the user_id that corresponds to email
> 'xxx' so that your MV partitioner idea can work ?
>
>
>
> On Sun, Mar 5, 2017 at 9:05 AM, benjamin roth <brs...@gmail.com> wrote:
>
> > While I was reading the MV paragraph in your post, an idea popped up:
> >
> > The problem with MV inconsistencies and inconsistent range movement is
> that
> > the "MV contract" is broken. This only happens because base data and
> > replica data reside on different hosts. If base data + replicas would
> stay
> > on the same host then a rebuild/remove would always stream both matching
> > parts of a base table + mv.
> >
> > So my idea:
> > Why not make a replica ALWAYS stay local regardless where the token of a
> MV
> > would point at. That would solve these problems:
> > 1. Rebuild / remove node would not break MV contract
> > 2. A write always stays local:
> >
> > a) That means replication happens sync. That means a quorum write to the
> > base table guarantees instant data availability with quorum read on a
> view
> >
> > b) It saves network roundtrips + request/response handling and helps to
> > keep a cluster healthier in case of bulk operations (like repair streams
> or
> > rebuild stream). Write load stays local and is not spread across the
> whole
> > cluster. I think it makes the load in these situations more predictable.
> >
> > How can that be achieved? I haven't done "scientific researches" yet but
> I
> > guess a "MV partitioner" could do the trick. Instead of applying the
> > regular partitioner, an MV partitioner would calculate the PK of the base
> > table (which is always possible) and then apply the regular partitioner.
> >
> > I'll create a proper Jira for it on monday. Currently it's sunday here
> and
> > my family wants me back so just a few thoughts on this right now.
> >
> > Any feedback is appreciated!
> >
> > 2017-03-05 6:34 GMT+01:00 Edward Capriolo <edlinuxg...@gmail.com>:
> >
> > > On Sat, Mar 4, 2017 at 10:26 AM, Jeff Jirsa <jji...@gmail.com> wrote:
> > >
> > > >
> > > >
> > > >
> > > > > On Mar 4, 2017, at 7:06 AM, Edward Capriolo <edlinuxg...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >> On Fri, Mar 3, 2017 at 12:04 PM, Jeff Jirsa <jji...@gmail.com>
> > wrote:
> > > > >>
> > > > >> On Fri, Mar 3, 2017 at 5:40 AM, Edward Capriolo <
> > > edlinuxg...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >>>
> > > > >>> I used them. I built do it yourself secondary indexes with them.
> > They
> > > > >> have
> > > > >>> there gotchas, but so do all the secondary index implementations.
> > > Just
> > > > >>> because datastax does not write about something. Lets see like 5
> > > years
> > > > >> ago
> > > > >>> there was this: https://github.com/hmsonline/cassandra-triggers
> > > > >>>
> > > > >>>
> > > > >> Still in use? How'd it work? Production ready? Would you still do
> it
> > > > that
> > > > >> way in 2017?
> > > > >>
> > > > >>
> > > > >>> There is a fairly large divergence to what actual users do and
> wha

Re: State of triggers

2017-03-05 Thread benjamin roth

While I was reading the MV paragraph in your post, an idea popped up:

The problem with MV inconsistencies and inconsistent range movement is that
the "MV contract" is broken. This only happens because base data and
replica data reside on different hosts. If base data + replicas would stay
on the same host then a rebuild/remove would always stream both matching
parts of a base table + mv.

So my idea:
Why not make a replica ALWAYS stay local regardless where the token of a MV
would point at. That would solve these problems:
1. Rebuild / remove node would not break MV contract
2. A write always stays local:

a) That means replication happens sync. That means a quorum write to the
base table guarantees instant data availability with quorum read on a view

b) It saves network roundtrips + request/response handling and helps to
keep a cluster healthier in case of bulk operations (like repair streams or
rebuild stream). Write load stays local and is not spread across the whole
cluster. I think it makes the load in these situations more predictable.

How can that be achieved? I haven't done "scientific researches" yet but I
guess a "MV partitioner" could do the trick. Instead of applying the
regular partitioner, an MV partitioner would calculate the PK of the base
table (which is always possible) and then apply the regular partitioner.

I'll create a proper Jira for it on monday. Currently it's sunday here and
my family wants me back so just a few thoughts on this right now.

Any feedback is appreciated!

2017-03-05 6:34 GMT+01:00 Edward Capriolo :

> On Sat, Mar 4, 2017 at 10:26 AM, Jeff Jirsa  wrote:
>
> >
> >
> >
> > > On Mar 4, 2017, at 7:06 AM, Edward Capriolo 
> > wrote:
> > >
> > >> On Fri, Mar 3, 2017 at 12:04 PM, Jeff Jirsa  wrote:
> > >>
> > >> On Fri, Mar 3, 2017 at 5:40 AM, Edward Capriolo <
> edlinuxg...@gmail.com>
> > >> wrote:
> > >>
> > >>>
> > >>> I used them. I built do it yourself secondary indexes with them. They
> > >> have
> > >>> there gotchas, but so do all the secondary index implementations.
> Just
> > >>> because datastax does not write about something. Lets see like 5
> years
> > >> ago
> > >>> there was this: https://github.com/hmsonline/cassandra-triggers
> > >>>
> > >>>
> > >> Still in use? How'd it work? Production ready? Would you still do it
> > that
> > >> way in 2017?
> > >>
> > >>
> > >>> There is a fairly large divergence to what actual users do and what
> > other
> > >>> groups 'say' actual users do in some cases.
> > >>>
> > >>
> > >> A lot of people don't share what they're doing (for business reasons,
> or
> > >> because they don't think it's important, or because they don't know
> > >> how/where), and that's fine but it makes it hard for anyone to know
> what
> > >> features are used, or how well they're really working in production.
> > >>
> > >> I've seen a handful of "how do we use triggers" questions in IRC, and
> > they
> > >> weren't unreasonable questions, but seemed like a lot of pain, and
> more
> > >> than one of those people ultimately came back and said they used some
> > other
> > >> mechanism (and of course, some of them silently disappear, so we have
> no
> > >> idea if it worked or not).
> > >>
> > >> If anyone's actively using triggers, please don't keep it a secret.
> > Knowing
> > >> that they're being used would be a great way to justify continuing to
> > >> maintain them.
> > >>
> > >> - Jeff
> > >>
> > >
> > > "Still in use? How'd it work? Production ready? Would you still do it
> > that way in 2017?"
> > >
> > > I mean that is a loaded question. How long has cassandra had Secondary
> > > Indexes? Did they work well? Would you use them? How many times were
> > they re-written?
> >
> > It wasn't really meant to be a loaded question; I was being sincere
> >
> > But I'll answer: secondary indexes suck for many use cases, but they're
> > invaluable for their actual intended purpose, and I have no idea how many
> > times they've been rewritten but they're production ready for their
> narrow
> > use case (defined by cardinality).
> >
> > Is there a real triggers use case still? Alternative to MVs? Alternative
> > to CDC? I've never implemented triggers - since you have, what's the
> level
> > of surprise for the developer?
>
>
> :) You mention alternatives/: Lets break them down.
>
> MV:
> They seem to have a lot pf promise. IE you can use them for things other
> then equality searches, and I do think the CQL example with the top N high
> scores is pretty useful. Then again our buddy Mr Roth has a thread named
> "Rebuild / remove node with MV is inconsistent". I actually think a lot of
> the use case for mv falls into the category of "something you should
> actually be doing with storm". I can vibe with the concept of not needing a
> streaming platform, but i KNOW storm would do this correctly. I don't want
> to land on something like 2x index v1 v2 where there was

Consistent vs inconsistent range movements

2017-03-03 Thread benjamin roth

Hi,

Can anyone tell the difference between consistent + inconsistent range
movements?
What exactly makes them consistent or inconsistent?
In what situations can both of them occur?

It would be great to get a correct and deep understanding of that for
further MV improvments. My intuition tells me that a rebuild / removenode
can break MV consistency, but to prove it I need more information.
I am also happy about code references - it's just very tedious to read all
through the code to get an overview of all that without some prose
information.

Thanks in advance

Re: Lots of dtest errors on local machine

2017-03-02 Thread benjamin roth

Sorry - it was my fault. I introduced a bug and was blinded by the tons of
debug output of dtests.

2017-03-01 17:20 GMT+01:00 benjamin roth <brs...@gmail.com>:

> Hi again,
>
> I wanted to run some dtests (e.g. from materialized_views_test.py) to
> check my changes. A while ago, everything worked fine but today I ran into
> a lot of errors like this:
> https://gist.github.com/brstgt/114d76769d97dc72059f9252330c4142
>
> This happened on 2 different machines (macos + linux).
> CS Version: 4.0 (trunk of today)
> CCM: 2.6.0
>
> I deleted and reinstalled all python deps.
> I also checked the ccm logs (as long as I was able to access them, because
> dtests delete them after the test)
> In the attached log, 127.0.0.2 caused. In ccm logs I saw, that the node
> was stopped by the test, started again and seemed to boot up again
> correctly.
>
> Running some of these tests against 3.11 worked.
> Switching back to trunk/4.0 - ERROR
>
> Is this a known issue - maybe caused by removed RPC support? Am I maybe
> doing sth wrong?
>

Lots of dtest errors on local machine

2017-03-01 Thread benjamin roth

Hi again,

I wanted to run some dtests (e.g. from materialized_views_test.py) to check
my changes. A while ago, everything worked fine but today I ran into a lot
of errors like this:
https://gist.github.com/brstgt/114d76769d97dc72059f9252330c4142

This happened on 2 different machines (macos + linux).
CS Version: 4.0 (trunk of today)
CCM: 2.6.0

I deleted and reinstalled all python deps.
I also checked the ccm logs (as long as I was able to access them, because
dtests delete them after the test)
In the attached log, 127.0.0.2 caused. In ccm logs I saw, that the node was
stopped by the test, started again and seemed to boot up again correctly.

Running some of these tests against 3.11 worked.
Switching back to trunk/4.0 - ERROR

Is this a known issue - maybe caused by removed RPC support? Am I maybe
doing sth wrong?

Need feedback on CASSANDRA-13066

2017-03-01 Thread benjamin roth

Hi guys,

I started working on 13066. My intention is to offer a table-setting that
allows a operator to optimize MV streaming in some cases or simply "on
purpose - i know what i do".

MV write path streaming can be ommitted e.g. if:
- data is append only
- no PK is added to MV so no stale data can be created on race conditions

This is a first patch:
https://github.com/Jaumo/cassandra/commit/0d4ce966f129e1b29098f194b5951a86dc8c585a

Please don't consider it as final. Some tests are missing and some logic is
still missing.

When introducing a table option what would be to prefer:
- mv_fast_stream: Does what it says, maybe even a more verbose name?
- append_only: To tell how data is filled. This could also be a hint for
future optimizations like CASSANDRA-9779
 but would not allow
me just to tell CS to do that kind of streaming no matter how I treat my
data

Also still to be considered in this ticket:
- With "fast streaming" MVs MUST be repaired separately and explicitly
- With "write path repairs" MVs MUST NOT be included in KS repairs. Not
only that this is unnecessary repair-work - it could (or probably will)
break the local consistency between base table and MV.
- Manual of views that are normally repaired through the write path of the
base table should at least log a warning like "Manually repairing a
material view may lead to inconsistencies"

I'd really love to get some feedback before putting more effort in.
Thanks!

Re: Pluggable throttling of read and write queries

2017-02-20 Thread Benjamin Roth

Thanks.

Depending on the whole infrastructure and business requirements, isn't it
easier to implement throttling at the client side?
I did this once to throttle bulk inserts to migrate whole CFs from other
DBs.

2017-02-21 7:43 GMT+01:00 Jeff Jirsa <jji...@apache.org>:

>
>
> On 2017-02-20 21:35 (-0800), Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
> > Stupid question:
> > Why do you rate limit a database, especially writes. Wouldn't that cause
> a
> > lot of new issues like back pressure on the rest of your system or
> timeouts
> > in case of blocking requests?
> > Also rate limiting has to be based on per coordinator calculations and
> not
> > cluster wide. It reminds me on hinted handoff throttling.
> >
>
> If you're sharing one cluster with 10 (or 20, or 100) applications,
> breaking one application may be better than slowing down 10/20/100. In many
> cases, workloads can be throttled and still meet business goals - nightly
> analytics jobs, for example, may be fine running over the course of 3 hours
> instead of 15 minutes, especially if the slightly-higher-response-latency
> over 3 hours is better than much-worse-response-latency for that 15 minute
> window.
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

MV improvements

2017-02-15 Thread Benjamin Roth

Hi there,

I'd like to start working on some MV improvements the next days. I created
several tickets for that some weeks ago.

Which is the right branch to start working on? The tick-tock discussion
about different branches lately was a bit confusing.

Second question:
Is there anybody out there who would like to assist me or would like to
work together on that?
I have several ideas for improvements and I'd love to work on them but I
know only a small part of the whole code base and also have little
experience with the history of the code base. If not, I will start alone
but I'd appreciate any kind of support.

Thanks folks!

-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: If reading from materialized view with a consistency level of quorum am I guaranteed to have the most recent view?

2017-02-11 Thread Benjamin Roth

For MVs regarding this threads question only the partition key matters.
Different primary keys can have the same partition key. Which is the case
in the example in your last comment.

Am 10.02.2017 20:26 schrieb "Kant Kodali" <k...@peernova.com>:

@Benjamin Roth: How do you say something is a different PRIMARY KEY now?
looks like you are saying

The below is same partition key and same primary key?

PRIMARY KEY ((a, b), c, d) and
PRIMARY KEY ((a, b), d, c)

@Russell Great to see you here! As always that is spot on!

On Fri, Feb 10, 2017 at 11:13 AM, Benjamin Roth <benjamin.r...@jaumo.com>
wrote:

> Thanks a lot for that post. If I read the code right, then there is one
> case missing in your post.
> According to StorageProxy.mutateMV, local updates are NOT put into a batch
> and are instantly applied locally. So a batch is only created if remote
> mutations have to be applied and only for those mutations.
>
> 2017-02-10 19:58 GMT+01:00 DuyHai Doan <doanduy...@gmail.com>:
>
> > See my blog post to understand how MV is implemented:
> > http://www.doanduyhai.com/blog/?p=1930
> >
> > On Fri, Feb 10, 2017 at 7:48 PM, Benjamin Roth <benjamin.r...@jaumo.com>
> > wrote:
> >
> > > Same partition key:
> > >
> > > PRIMARY KEY ((a, b), c, d) and
> > > PRIMARY KEY ((a, b), d, c)
> > >
> > > PRIMARY KEY ((a), b, c) and
> > > PRIMARY KEY ((a), c, b)
> > >
> > > Different partition key:
> > >
> > > PRIMARY KEY ((a, b), c, d) and
> > > PRIMARY KEY ((a), b, d, c)
> > >
> > > PRIMARY KEY ((a), b) and
> > > PRIMARY KEY ((b), a)
> > >
> > >
> > > 2017-02-10 19:46 GMT+01:00 Kant Kodali <k...@peernova.com>:
> > >
> > > > Okies now I understand what you mean by "same" partition key.  I
> think
> > > you
> > > > are saying
> > > >
> > > > PRIMARY KEY(col1, col2, col3) == PRIMARY KEY(col2, col1, col3) // so
> > far
> > > I
> > > > assumed they are different partition keys.
> > > >
> > > > On Fri, Feb 10, 2017 at 10:36 AM, Benjamin Roth <
> > benjamin.r...@jaumo.com
> > > >
> > > > wrote:
> > > >
> > > > > There are use cases where the partition key is the same. For
> example
> > if
> > > > you
> > > > > need a sorting within a partition or a filtering different from
the
> > > > > original clustering keys.
> > > > > We actually use this for some MVs.
> > > > >
> > > > > If you want "dumb" denormalization with simple append only cases
> (or
> > > more
> > > > > general cases that don't require a read before write on update)
you
> > are
> > > > > maybe better off with batched denormalized atomics writes.
> > > > >
> > > > > The main benefit of MVs is if you need denormalization to sort or
> > > filter
> > > > by
> > > > > a non-primary key field.
> > > > >
> > > > > 2017-02-10 19:31 GMT+01:00 Kant Kodali <k...@peernova.com>:
> > > > >
> > > > > > yes thanks for the clarification.  But why would I ever have MV
> > with
> > > > the
> > > > > > same partition key? if it is the same partition key I could just
> > read
> > > > > from
> > > > > > the base table right? our MV Partition key contains the columns
> > from
> > > > the
> > > > > > base table partition key but in a different order plus an
> > additional
> > > > > column
> > > > > > (which is allowed as of today)
> > > > > >
> > > > > > On Fri, Feb 10, 2017 at 10:23 AM, Benjamin Roth <
> > > > benjamin.r...@jaumo.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > It depends on your model.
> > > > > > > If the base table + MV have the same partition key, then the
MV
> > > > > mutations
> > > > > > > are applied synchronously, so they are written as soon the
> write
> > > > > request
> > > > > > > returns.
> > > > > > > => In this case you can rely on the R+F > RF
> > > > > > >
> > > > > > > If the partition key of the MV is different, the partition of
> the
> > > MV
> > > > is
> &

Re: If reading from materialized view with a consistency level of quorum am I guaranteed to have the most recent view?

2017-02-10 Thread Benjamin Roth

Thanks a lot for that post. If I read the code right, then there is one
case missing in your post.
According to StorageProxy.mutateMV, local updates are NOT put into a batch
and are instantly applied locally. So a batch is only created if remote
mutations have to be applied and only for those mutations.

2017-02-10 19:58 GMT+01:00 DuyHai Doan <doanduy...@gmail.com>:

> See my blog post to understand how MV is implemented:
> http://www.doanduyhai.com/blog/?p=1930
>
> On Fri, Feb 10, 2017 at 7:48 PM, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
> > Same partition key:
> >
> > PRIMARY KEY ((a, b), c, d) and
> > PRIMARY KEY ((a, b), d, c)
> >
> > PRIMARY KEY ((a), b, c) and
> > PRIMARY KEY ((a), c, b)
> >
> > Different partition key:
> >
> > PRIMARY KEY ((a, b), c, d) and
> > PRIMARY KEY ((a), b, d, c)
> >
> > PRIMARY KEY ((a), b) and
> > PRIMARY KEY ((b), a)
> >
> >
> > 2017-02-10 19:46 GMT+01:00 Kant Kodali <k...@peernova.com>:
> >
> > > Okies now I understand what you mean by "same" partition key.  I think
> > you
> > > are saying
> > >
> > > PRIMARY KEY(col1, col2, col3) == PRIMARY KEY(col2, col1, col3) // so
> far
> > I
> > > assumed they are different partition keys.
> > >
> > > On Fri, Feb 10, 2017 at 10:36 AM, Benjamin Roth <
> benjamin.r...@jaumo.com
> > >
> > > wrote:
> > >
> > > > There are use cases where the partition key is the same. For example
> if
> > > you
> > > > need a sorting within a partition or a filtering different from the
> > > > original clustering keys.
> > > > We actually use this for some MVs.
> > > >
> > > > If you want "dumb" denormalization with simple append only cases (or
> > more
> > > > general cases that don't require a read before write on update) you
> are
> > > > maybe better off with batched denormalized atomics writes.
> > > >
> > > > The main benefit of MVs is if you need denormalization to sort or
> > filter
> > > by
> > > > a non-primary key field.
> > > >
> > > > 2017-02-10 19:31 GMT+01:00 Kant Kodali <k...@peernova.com>:
> > > >
> > > > > yes thanks for the clarification.  But why would I ever have MV
> with
> > > the
> > > > > same partition key? if it is the same partition key I could just
> read
> > > > from
> > > > > the base table right? our MV Partition key contains the columns
> from
> > > the
> > > > > base table partition key but in a different order plus an
> additional
> > > > column
> > > > > (which is allowed as of today)
> > > > >
> > > > > On Fri, Feb 10, 2017 at 10:23 AM, Benjamin Roth <
> > > benjamin.r...@jaumo.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > It depends on your model.
> > > > > > If the base table + MV have the same partition key, then the MV
> > > > mutations
> > > > > > are applied synchronously, so they are written as soon the write
> > > > request
> > > > > > returns.
> > > > > > => In this case you can rely on the R+F > RF
> > > > > >
> > > > > > If the partition key of the MV is different, the partition of the
> > MV
> > > is
> > > > > > probably placed on a different host (or said differently it
> cannot
> > be
> > > > > > guaranteed that it is on the same host). In this case, the MV
> > updates
> > > > are
> > > > > > executed async in a logged batch. So it can be guaranteed they
> will
> > > be
> > > > > > applied eventually but not at the time the write request returns.
> > > > > > => You cannot rely and there is no possibility to absolutely
> > > guarantee
> > > > > > anything, not matter what CL you choose. A MV update may always
> > > "arrive
> > > > > > late". I guess it has been implemented like this to not block in
> > case
> > > > of
> > > > > > remote request to prefer the cluster sanity over consistency.
> > > > > >
> > > > > > Is it now 100% clear?
> > > > > >
> > > > > > 2017-02-10 19:17 GMT+01:00 Kant Kodali <

Re: If reading from materialized view with a consistency level of quorum am I guaranteed to have the most recent view?

2017-02-10 Thread Benjamin Roth

No your example has different PRIMARY keys but same PARTITION keys. Same
partition keys generate same token and will always go to the same host.

2017-02-10 19:58 GMT+01:00 Kant Kodali <k...@peernova.com>:

> In that case I can't even say same partition key == same row key
>
> The below would be different partition keys according to you right?
>
> PRIMARY KEY ((a, b), c, d) and
> PRIMARY KEY ((a, b), d, c, e)
>
> On Fri, Feb 10, 2017 at 10:48 AM, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
> > Same partition key:
> >
> > PRIMARY KEY ((a, b), c, d) and
> > PRIMARY KEY ((a, b), d, c)
> >
> > PRIMARY KEY ((a), b, c) and
> > PRIMARY KEY ((a), c, b)
> >
> > Different partition key:
> >
> > PRIMARY KEY ((a, b), c, d) and
> > PRIMARY KEY ((a), b, d, c)
> >
> > PRIMARY KEY ((a), b) and
> > PRIMARY KEY ((b), a)
> >
> >
> > 2017-02-10 19:46 GMT+01:00 Kant Kodali <k...@peernova.com>:
> >
> > > Okies now I understand what you mean by "same" partition key.  I think
> > you
> > > are saying
> > >
> > > PRIMARY KEY(col1, col2, col3) == PRIMARY KEY(col2, col1, col3) // so
> far
> > I
> > > assumed they are different partition keys.
> > >
> > > On Fri, Feb 10, 2017 at 10:36 AM, Benjamin Roth <
> benjamin.r...@jaumo.com
> > >
> > > wrote:
> > >
> > > > There are use cases where the partition key is the same. For example
> if
> > > you
> > > > need a sorting within a partition or a filtering different from the
> > > > original clustering keys.
> > > > We actually use this for some MVs.
> > > >
> > > > If you want "dumb" denormalization with simple append only cases (or
> > more
> > > > general cases that don't require a read before write on update) you
> are
> > > > maybe better off with batched denormalized atomics writes.
> > > >
> > > > The main benefit of MVs is if you need denormalization to sort or
> > filter
> > > by
> > > > a non-primary key field.
> > > >
> > > > 2017-02-10 19:31 GMT+01:00 Kant Kodali <k...@peernova.com>:
> > > >
> > > > > yes thanks for the clarification.  But why would I ever have MV
> with
> > > the
> > > > > same partition key? if it is the same partition key I could just
> read
> > > > from
> > > > > the base table right? our MV Partition key contains the columns
> from
> > > the
> > > > > base table partition key but in a different order plus an
> additional
> > > > column
> > > > > (which is allowed as of today)
> > > > >
> > > > > On Fri, Feb 10, 2017 at 10:23 AM, Benjamin Roth <
> > > benjamin.r...@jaumo.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > It depends on your model.
> > > > > > If the base table + MV have the same partition key, then the MV
> > > > mutations
> > > > > > are applied synchronously, so they are written as soon the write
> > > > request
> > > > > > returns.
> > > > > > => In this case you can rely on the R+F > RF
> > > > > >
> > > > > > If the partition key of the MV is different, the partition of the
> > MV
> > > is
> > > > > > probably placed on a different host (or said differently it
> cannot
> > be
> > > > > > guaranteed that it is on the same host). In this case, the MV
> > updates
> > > > are
> > > > > > executed async in a logged batch. So it can be guaranteed they
> will
> > > be
> > > > > > applied eventually but not at the time the write request returns.
> > > > > > => You cannot rely and there is no possibility to absolutely
> > > guarantee
> > > > > > anything, not matter what CL you choose. A MV update may always
> > > "arrive
> > > > > > late". I guess it has been implemented like this to not block in
> > case
> > > > of
> > > > > > remote request to prefer the cluster sanity over consistency.
> > > > > >
> > > > > > Is it now 100% clear?
> > > > > >
> > > > > > 2017-02-10 19:17 GMT+01:00 Kant Kodali <k...@peernova.com>:
>

Re: If reading from materialized view with a consistency level of quorum am I guaranteed to have the most recent view?

2017-02-10 Thread Benjamin Roth

Same partition key:

PRIMARY KEY ((a, b), c, d) and
PRIMARY KEY ((a, b), d, c)

PRIMARY KEY ((a), b, c) and
PRIMARY KEY ((a), c, b)

Different partition key:

PRIMARY KEY ((a, b), c, d) and
PRIMARY KEY ((a), b, d, c)

PRIMARY KEY ((a), b) and
PRIMARY KEY ((b), a)


2017-02-10 19:46 GMT+01:00 Kant Kodali <k...@peernova.com>:

> Okies now I understand what you mean by "same" partition key.  I think you
> are saying
>
> PRIMARY KEY(col1, col2, col3) == PRIMARY KEY(col2, col1, col3) // so far I
> assumed they are different partition keys.
>
> On Fri, Feb 10, 2017 at 10:36 AM, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
> > There are use cases where the partition key is the same. For example if
> you
> > need a sorting within a partition or a filtering different from the
> > original clustering keys.
> > We actually use this for some MVs.
> >
> > If you want "dumb" denormalization with simple append only cases (or more
> > general cases that don't require a read before write on update) you are
> > maybe better off with batched denormalized atomics writes.
> >
> > The main benefit of MVs is if you need denormalization to sort or filter
> by
> > a non-primary key field.
> >
> > 2017-02-10 19:31 GMT+01:00 Kant Kodali <k...@peernova.com>:
> >
> > > yes thanks for the clarification.  But why would I ever have MV with
> the
> > > same partition key? if it is the same partition key I could just read
> > from
> > > the base table right? our MV Partition key contains the columns from
> the
> > > base table partition key but in a different order plus an additional
> > column
> > > (which is allowed as of today)
> > >
> > > On Fri, Feb 10, 2017 at 10:23 AM, Benjamin Roth <
> benjamin.r...@jaumo.com
> > >
> > > wrote:
> > >
> > > > It depends on your model.
> > > > If the base table + MV have the same partition key, then the MV
> > mutations
> > > > are applied synchronously, so they are written as soon the write
> > request
> > > > returns.
> > > > => In this case you can rely on the R+F > RF
> > > >
> > > > If the partition key of the MV is different, the partition of the MV
> is
> > > > probably placed on a different host (or said differently it cannot be
> > > > guaranteed that it is on the same host). In this case, the MV updates
> > are
> > > > executed async in a logged batch. So it can be guaranteed they will
> be
> > > > applied eventually but not at the time the write request returns.
> > > > => You cannot rely and there is no possibility to absolutely
> guarantee
> > > > anything, not matter what CL you choose. A MV update may always
> "arrive
> > > > late". I guess it has been implemented like this to not block in case
> > of
> > > > remote request to prefer the cluster sanity over consistency.
> > > >
> > > > Is it now 100% clear?
> > > >
> > > > 2017-02-10 19:17 GMT+01:00 Kant Kodali <k...@peernova.com>:
> > > >
> > > > > So R+W > RF doesnt apply for reads on MV right because say I set
> > QUORUM
> > > > > level consistency for both reads and writes then there can be a
> > > scenario
> > > > > where a write is successful to the base table and then say
> > immediately
> > > I
> > > > do
> > > > > a read through MV but prior to MV getting the update from the base
> > > table.
> > > > > so there isn't any way to make sure to read after MV had been
> > > > successfully
> > > > > updated. is that correct?
> > > > >
> > > > > On Fri, Feb 10, 2017 at 6:30 AM, Benjamin Roth <
> > > benjamin.r...@jaumo.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Kant
> > > > > >
> > > > > > Is it clear now?
> > > > > > Sorry for the confusion!
> > > > > >
> > > > > > Have a nice one
> > > > > >
> > > > > > Am 10.02.2017 09:17 schrieb "Kant Kodali" <k...@peernova.com>:
> > > > > >
> > > > > > thanks!
> > > > > >
> > > > > > On Thu, Feb 9, 2017 at 8:51 PM, Benjamin Roth <
> > > benjamin.r...@jaumo.com
> > > > >
> > > > > > wrote:
> > > > > >
>

Re: If reading from materialized view with a consistency level of quorum am I guaranteed to have the most recent view?

2017-02-10 Thread Benjamin Roth

There are use cases where the partition key is the same. For example if you
need a sorting within a partition or a filtering different from the
original clustering keys.
We actually use this for some MVs.

If you want "dumb" denormalization with simple append only cases (or more
general cases that don't require a read before write on update) you are
maybe better off with batched denormalized atomics writes.

The main benefit of MVs is if you need denormalization to sort or filter by
a non-primary key field.

2017-02-10 19:31 GMT+01:00 Kant Kodali <k...@peernova.com>:

> yes thanks for the clarification.  But why would I ever have MV with the
> same partition key? if it is the same partition key I could just read from
> the base table right? our MV Partition key contains the columns from the
> base table partition key but in a different order plus an additional column
> (which is allowed as of today)
>
> On Fri, Feb 10, 2017 at 10:23 AM, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
> > It depends on your model.
> > If the base table + MV have the same partition key, then the MV mutations
> > are applied synchronously, so they are written as soon the write request
> > returns.
> > => In this case you can rely on the R+F > RF
> >
> > If the partition key of the MV is different, the partition of the MV is
> > probably placed on a different host (or said differently it cannot be
> > guaranteed that it is on the same host). In this case, the MV updates are
> > executed async in a logged batch. So it can be guaranteed they will be
> > applied eventually but not at the time the write request returns.
> > => You cannot rely and there is no possibility to absolutely guarantee
> > anything, not matter what CL you choose. A MV update may always "arrive
> > late". I guess it has been implemented like this to not block in case of
> > remote request to prefer the cluster sanity over consistency.
> >
> > Is it now 100% clear?
> >
> > 2017-02-10 19:17 GMT+01:00 Kant Kodali <k...@peernova.com>:
> >
> > > So R+W > RF doesnt apply for reads on MV right because say I set QUORUM
> > > level consistency for both reads and writes then there can be a
> scenario
> > > where a write is successful to the base table and then say immediately
> I
> > do
> > > a read through MV but prior to MV getting the update from the base
> table.
> > > so there isn't any way to make sure to read after MV had been
> > successfully
> > > updated. is that correct?
> > >
> > > On Fri, Feb 10, 2017 at 6:30 AM, Benjamin Roth <
> benjamin.r...@jaumo.com>
> > > wrote:
> > >
> > > > Hi Kant
> > > >
> > > > Is it clear now?
> > > > Sorry for the confusion!
> > > >
> > > > Have a nice one
> > > >
> > > > Am 10.02.2017 09:17 schrieb "Kant Kodali" <k...@peernova.com>:
> > > >
> > > > thanks!
> > > >
> > > > On Thu, Feb 9, 2017 at 8:51 PM, Benjamin Roth <
> benjamin.r...@jaumo.com
> > >
> > > > wrote:
> > > >
> > > > > Yes it is
> > > > >
> > > > > Am 10.02.2017 00:46 schrieb "Kant Kodali" <k...@peernova.com>:
> > > > >
> > > > > > If reading from materialized view with a consistency level of
> > quorum
> > > am
> > > > I
> > > > > > guaranteed to have the most recent view? other words is w + r > n
> > > > > contract
> > > > > > maintained for MV's as well for both reads and writes?
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Benjamin Roth
> > Prokurist
> >
> > Jaumo GmbH · www.jaumo.com
> > Wehrstraße 46 · 73035 Göppingen · Germany
> > Phone +49 7161 304880-6 · Fax +49 7161 304880-1
> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
> >
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: If reading from materialized view with a consistency level of quorum am I guaranteed to have the most recent view?

2017-02-10 Thread Benjamin Roth

It depends on your model.
If the base table + MV have the same partition key, then the MV mutations
are applied synchronously, so they are written as soon the write request
returns.
=> In this case you can rely on the R+F > RF

If the partition key of the MV is different, the partition of the MV is
probably placed on a different host (or said differently it cannot be
guaranteed that it is on the same host). In this case, the MV updates are
executed async in a logged batch. So it can be guaranteed they will be
applied eventually but not at the time the write request returns.
=> You cannot rely and there is no possibility to absolutely guarantee
anything, not matter what CL you choose. A MV update may always "arrive
late". I guess it has been implemented like this to not block in case of
remote request to prefer the cluster sanity over consistency.

Is it now 100% clear?

2017-02-10 19:17 GMT+01:00 Kant Kodali <k...@peernova.com>:

> So R+W > RF doesnt apply for reads on MV right because say I set QUORUM
> level consistency for both reads and writes then there can be a scenario
> where a write is successful to the base table and then say immediately I do
> a read through MV but prior to MV getting the update from the base table.
> so there isn't any way to make sure to read after MV had been successfully
> updated. is that correct?
>
> On Fri, Feb 10, 2017 at 6:30 AM, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
> > Hi Kant
> >
> > Is it clear now?
> > Sorry for the confusion!
> >
> > Have a nice one
> >
> > Am 10.02.2017 09:17 schrieb "Kant Kodali" <k...@peernova.com>:
> >
> > thanks!
> >
> > On Thu, Feb 9, 2017 at 8:51 PM, Benjamin Roth <benjamin.r...@jaumo.com>
> > wrote:
> >
> > > Yes it is
> > >
> > > Am 10.02.2017 00:46 schrieb "Kant Kodali" <k...@peernova.com>:
> > >
> > > > If reading from materialized view with a consistency level of quorum
> am
> > I
> > > > guaranteed to have the most recent view? other words is w + r > n
> > > contract
> > > > maintained for MV's as well for both reads and writes?
> > > >
> > > > Thanks!
> > > >
> > >
> >
>

-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: If reading from materialized view with a consistency level of quorum am I guaranteed to have the most recent view?

2017-02-10 Thread Benjamin Roth

Hi Kant

Is it clear now?
Sorry for the confusion!

Have a nice one

Am 10.02.2017 09:17 schrieb "Kant Kodali" <k...@peernova.com>:

thanks!

On Thu, Feb 9, 2017 at 8:51 PM, Benjamin Roth <benjamin.r...@jaumo.com>
wrote:

> Yes it is
>
> Am 10.02.2017 00:46 schrieb "Kant Kodali" <k...@peernova.com>:
>
> > If reading from materialized view with a consistency level of quorum am
I
> > guaranteed to have the most recent view? other words is w + r > n
> contract
> > maintained for MV's as well for both reads and writes?
> >
> > Thanks!
> >
>

Re: If reading from materialized view with a consistency level of quorum am I guaranteed to have the most recent view?

2017-02-10 Thread Benjamin Roth

See my latest comment

2017-02-10 14:33 GMT+01:00 Salih Gedik <m...@salih.xyz>:

> I agree with Brian. As far as I am concerned an update of materialized
> view is an async operation. Therefore I don't believe that you'd get most
> up to date data.
>
> Salih Gedik
>
>
> > On 10 Feb 2017, at 16:11, Brian Hess <brianmh...@gmail.com> wrote:
> >
> > This is not true.
> >
> > You cannot provide a ConsistencyLevel for the Materialized Views on a
> table when you do a write. That is, you do not explicitly write to a
> Materialized View, but implicitly write to it via the base table. There is
> not consistency guarantee other than eventual  between the base table and
> the Materialized View. That is, the coordinator only acknowledges the write
> when the proper number of replicas in the base table have acknowledged
> successful writing. There is no waiting or acknowledgement for any
> Materialized Views on that table.
> >
> > Therefore, while you can specify a Consistency Level on read since you
> are reading directly from the Materialized View as a table, you cannot
> specify a Consistency Level on wrote for the Materialized View. So, you
> cannot apply the R+W>RF formula.
> >
> > >Brian
> >
> >> On Feb 10, 2017, at 3:17 AM, Kant Kodali <k...@peernova.com> wrote:
> >>
> >> thanks!
> >>
> >> On Thu, Feb 9, 2017 at 8:51 PM, Benjamin Roth <benjamin.r...@jaumo.com>
> >> wrote:
> >>
> >>> Yes it is
> >>>
> >>> Am 10.02.2017 00:46 schrieb "Kant Kodali" <k...@peernova.com>:
> >>>
> >>>> If reading from materialized view with a consistency level of quorum
> am I
> >>>> guaranteed to have the most recent view? other words is w + r > n
> >>> contract
> >>>> maintained for MV's as well for both reads and writes?
> >>>>
> >>>> Thanks!
> >>>
>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: If reading from materialized view with a consistency level of quorum am I guaranteed to have the most recent view?

2017-02-10 Thread Benjamin Roth

Basically MV updates are processed when a mutation is applied. For MV
replicas that stay local, MV mutations are also applied directly (see
StorageProxy.mutateMV). For remote MV mutations, a batch is created.
Maybe I am wrong but code is somehow contradictory:

TableView.pushViewReplicaUpdates:

// now actually perform the writes and wait for them to complete
> asyncWriteBatchedMutations(wrappers, localDataCenter, Stage.VIEW_MUTATION);


Comment says, caller waits for updates, but code executes async. So you are
right in the case MV has a different PK than base table and replica goes to
a remote host. If base PK == mv PK then updates are always local and sync.

Did I still miss sth?

2017-02-10 14:11 GMT+01:00 Brian Hess <brianmh...@gmail.com>:

> This is not true.
>
> You cannot provide a ConsistencyLevel for the Materialized Views on a
> table when you do a write. That is, you do not explicitly write to a
> Materialized View, but implicitly write to it via the base table. There is
> not consistency guarantee other than eventual  between the base table and
> the Materialized View. That is, the coordinator only acknowledges the write
> when the proper number of replicas in the base table have acknowledged
> successful writing. There is no waiting or acknowledgement for any
> Materialized Views on that table.
>
> Therefore, while you can specify a Consistency Level on read since you are
> reading directly from the Materialized View as a table, you cannot specify
> a Consistency Level on wrote for the Materialized View. So, you cannot
> apply the R+W>RF formula.
>
> >Brian
>
> > On Feb 10, 2017, at 3:17 AM, Kant Kodali <k...@peernova.com> wrote:
> >
> > thanks!
> >
> > On Thu, Feb 9, 2017 at 8:51 PM, Benjamin Roth <benjamin.r...@jaumo.com>
> > wrote:
> >
> >> Yes it is
> >>
> >> Am 10.02.2017 00:46 schrieb "Kant Kodali" <k...@peernova.com>:
> >>
> >>> If reading from materialized view with a consistency level of quorum
> am I
> >>> guaranteed to have the most recent view? other words is w + r > n
> >> contract
> >>> maintained for MV's as well for both reads and writes?
> >>>
> >>> Thanks!
> >>>
> >>
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: If reading from materialized view with a consistency level of quorum am I guaranteed to have the most recent view?

2017-02-09 Thread Benjamin Roth

Yes it is

Am 10.02.2017 00:46 schrieb "Kant Kodali" :

> If reading from materialized view with a consistency level of quorum am I
> guaranteed to have the most recent view? other words is w + r > n contract
> maintained for MV's as well for both reads and writes?
>
> Thanks!
>

Re: Why does CockroachDB github website say Cassandra has no Availability on datacenter failure?

2017-02-07 Thread Benjamin Roth

Btw this isn't the Bronx either. It's not incorrect to be polite.

Am 07.02.2017 13:45 schrieb "Bernardo Sanchez" <
bernard...@pointclickcare.com>:

> guys this isn't twitter. stop your stupid posts
>
> From: benjamin.le...@datastax.com
> Sent: February 7, 2017 7:43 AM
> To: dev@cassandra.apache.org
> Reply-to: dev@cassandra.apache.org
> Subject: Re: Why does CockroachDB github website say Cassandra has no
> Availability on datacenter failure?
>
>
> Do not get angry for that. It does not worth it. :-)
>
> On Tue, Feb 7, 2017 at 1:11 PM, Kant Kodali  wrote:
>
> > lol. But seriously are they even allowed to say something that is not
> true
> > about another product ?
> >
> > On Tue, Feb 7, 2017 at 4:05 AM, kurt greaves 
> wrote:
> >
> > > Marketing never lies. Ever
> > >
> >
>

Re: Wrapping up tick-tock

2017-01-14 Thread Benjamin Roth

gt; > all
> >> died
> >> > > out without reaching a robust consensus.
> >> > >
> >> > > In those threads we saw several reasonable options proposed, but
> > from
> >> my
> >> > > perspective they all operated in a kind of theoretical fantasy
> > land of
> >> > > testing and development resources.  In particular, it takes
> > around a
> >> > > person-week of effort to verify that a release is ready.  That
> > is,
> >> going
> >> > > through all the test suites, inspecting and re-running failing
> > tests to
> >> > see
> >> > > if there is a product problem or a flaky test.
> >> > >
> >> > > (I agree that in a perfect world this wouldn’t be necessary
> > because
> >> your
> >> > > test ci is always green, but see my previous framing of the
> > perfect
> >> world
> >> > > as a fantasy land.  It’s also worth noting that this is a common
> >> problem
> >> > > for large OSS projects, not necessarily something to beat
> > ourselves up
> >> > > over, but in any case, that's our reality right now.)
> >> > >
> >> > > I submit that any process that assumes a monthly release cadence
> > is not
> >> > > realistic from a resourcing standpoint for this validation.
> > Notably,
> >> we
> >> > > have struggled to marshal this for 3.10 for two months now.
> >> > >
> >> > > Therefore, I suggest first that we collectively roll up our
> > sleeves to
> >> > vet
> >> > > 3.10 as the last tick-tock release.  Stick a fork in it, it’s
> > done.  No
> >> > > more tick-tock.
> >> > >
> >> > > I further suggest that in place of tick tock we go back to our
> > old
> >> model
> >> > of
> >> > > yearly-ish releases with as-needed bug fix releases on stable
> > branches,
> >> > > probably bi-monthly.  This amortizes the release validation
> > problem
> >> over
> >> > a
> >> > > longer development period.  And of course we remain free to ramp
> > back
> >> up
> >> > to
> >> > > the more rapid cadence envisioned by the other proposals if we
> > increase
> >> > our
> >> > > pool of QA effort or we are able to eliminate flakey tests to
> > the point
> >> > > that a long validation process becomes unnecessary.
> >> > >
> >> > > (While a longer dev period could mean a correspondingly more
> > painful
> >> test
> >> > > validation process at the end, my experience is that most of the
> >> > validation
> >> > > cost is “fixed” in the form of flaky tests and thus does not
> > increase
> >> > > proportionally to development time.)
> >> > >
> >> > > Thoughts?
> >> > >
> >> > > --
> >> > > Jonathan Ellis
> >> > > co-founder, http://www.datastax.com
> >> > > @spyced
> >> > >
> >> >
> >>
> >
> >
> >
> >
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: [VOTE] 3.X branch feature freeze

2017-01-13 Thread Benjamin Roth

Progress:
Yes and no. I made a patch that made our cluster stable performance wise
but introduces a consistency issue I am aware of. We can deal with it and I
prefer this over severe performance problems. But this is nothing you can
offer to regular users. I created a bunch of tickets related to this.
Making bootstrap + decommision performant + consistent should not be much
effort.
Making repairs performant + consistent + fix incremental repairs will be
probably more effort. I wanted to investigate more before xmas but did not
find the time until now. It is on my agenda and I appreciate any support.

Tickets I created recently:
https://issues.apache.org/jira/browse/CASSANDRA-13073
https://issues.apache.org/jira/browse/CASSANDRA-13066
https://issues.apache.org/jira/browse/CASSANDRA-13065
https://issues.apache.org/jira/browse/CASSANDRA-13064

Also important:
https://issues.apache.org/jira/browse/CASSANDRA-12888


2017-01-13 18:55 GMT+01:00 Jonathan Haddad <j...@jonhaddad.com>:

> +1 (non binding) to feature freeze.
>
> I also like the idea of stabilizing MVs.  Ben, you've probably been the
> most vocal about the issues, have you made any progress towards making them
> work any better during bootstrap / etc?  Any idea of fixing them is a major
> undertaking?
>
> Jon
>
> On Fri, Jan 13, 2017 at 9:39 AM Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
> +1 also I appreciate any effort on MV stability. It is an official 3.x
> feature but not production ready for the masses.
>
> Am 13.01.2017 18:34 schrieb "Jonathan Ellis" <jbel...@gmail.com>:
>
> > +1
> >
> > On Fri, Jan 13, 2017 at 11:21 AM, Aleksey Yeschenko <alek...@apache.org>
> > wrote:
> >
> > > Hi all!
> > >
> > > It seems like we have a general consensus on ending tick-tock at 3.11,
> > and
> > > moving
> > > on to stabilisation-only for 3.11.x series.
> > >
> > > In light of this, I suggest immediate feature freeze in the 3.X branch.
> > >
> > > Meaning that only bug fixes go to the 3.11/3.X branch from now on.
> > >
> > > All new features that haven’t be committed yet should go to trunk only
> > > (4.0), if the vote passes.
> > >
> > > What do you think?
> > >
> > > Thanks.
> > >
> > > --
> > > AY
> >
> >
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: [VOTE] 3.X branch feature freeze

2017-01-13 Thread Benjamin Roth

+1 also I appreciate any effort on MV stability. It is an official 3.x
feature but not production ready for the masses.

Am 13.01.2017 18:34 schrieb "Jonathan Ellis" :

> +1
>
> On Fri, Jan 13, 2017 at 11:21 AM, Aleksey Yeschenko 
> wrote:
>
> > Hi all!
> >
> > It seems like we have a general consensus on ending tick-tock at 3.11,
> and
> > moving
> > on to stabilisation-only for 3.11.x series.
> >
> > In light of this, I suggest immediate feature freeze in the 3.X branch.
> >
> > Meaning that only bug fixes go to the 3.11/3.X branch from now on.
> >
> > All new features that haven’t be committed yet should go to trunk only
> > (4.0), if the vote passes.
> >
> > What do you think?
> >
> > Thanks.
> >
> > --
> > AY
>
>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: CASSANDRA-12888: Streaming and MVs

2016-12-07 Thread Benjamin Roth

Grmpf! 1000+ consecutive must be wrong. I guess I mixed sth up. But it
repaired over and over again for 1 or 2 days.

2016-12-07 9:01 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>:

> Hi Paolo,
>
> First of all thanks for your review!
>
> I had the same concerns as you but I thought it is beeing handled
> correctly (which does in some situations) but I found one that creates the
> inconsistencies you mentioned. That is kind of split brain syndrom, when
> multiple nodes fail between repairs. See here: https://cl.ly/3t0X1c0q1L1h.
>
> I am not happy about it but I support your decision. We should then add
> another dtest to test this scenario as existing dtests don't.
>
> Some issues unfortunately remain:
> - 12888 is not resolved
> - MV repairs may be still f slow. Imagine an inconsistency of a single
> cell (also may be due to a validation race condition, see CASSANDRA-12991)
> on a big partition. I had issues with reaper and a 30min timeout leading to
> 1000+ (yes!) consecutive repairs of a single subrange because it always
> timed out and I recognized very late. When I deployed 12888 on my system,
> this remaining subrange was repaired in a snap
> - I guess rebuild works the same as repair and has to go through the write
> path, right?
>
> => The MV repair may induce so much overhead that it is maybe cheaper to
> kill and replace a inconsistent node than to repair it. But that may
> introduce inconsistencies again. All in all it is not perfect. All this
> does not really un-frustrate me a 100%.
>
> Do you have any more thoughts?
>
> Unfortunately I have very little time these days as my second child was
> born on monday. So thanks for your support so far. Maybe I have some ideas
> on this issues during the next days and I will work on that ticket probably
> next week to come to a solution that is at least deployable. I'd also
> appreciate your opinion on CASSANDRA-12991.
>
> 2016-12-07 2:53 GMT+01:00 Paulo Motta <pauloricard...@gmail.com>:
>
>> Hello Benjamin,
>>
>> Thanks for your effort on this investigation! For bootstraps and range
>> transfers, I think we can indeed simplify and stream base tables and MVs
>> as
>> ordinary tables, unless there is some caveat I'm missing (I didn't find
>> any
>> special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV
>> design doc, please correct me if I'm wrong).
>>
>> Regarding repair of base tables, applying mutations via the write path is
>> a
>> matter of correctness, given that the base table updates needs to
>> potentially remove previously referenced keys in the views, so repairing
>> only the base table may leave unreferenced keys in the views, breaking the
>> MV contract. Furthermore, these unreferenced keys may be propagated to
>> other replicas and never removed if you repair only the view. If you don't
>> do overwrites in the base table, this is probably not a problem but the DB
>> cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as
>> you already noticed repairing only the base table is probably faster so I
>> don't see a reason to repair the base and MVs separately since this is
>> potentially more costly. I believe your frustration is mostly due to the
>> bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are
>> fixed repair on base table should work just fine.
>>
>> Based on this, I propose:
>> - Fix CASSANDRA-12905 with your original patch that retries acquiring the
>> MV lock instead of throwing WriteTimeoutException during streaming, since
>> this is blocking 3.10.
>> - Fix CASSANDRA-12888 by doing sstable-based streaming for base tables
>> while still applying MV updates in the paired replicas.
>> - Create new ticket to use ordinary streaming for non-repair MV stream
>> sessions and keep current behavior for MV streaming originating from
>> repair.
>> - Create new ticket to include only the base tables and not MVs in
>> keyspace-level repair, since repairing the base already repairs the views
>> to avoid people shooting themselves in the foot.
>>
>> Please let me know what do you think. Any suggestions or feedback is
>> appreciated.
>>
>> Cheers,
>>
>> Paulo
>>
>> 2016-12-02 8:27 GMT-02:00 Benjamin Roth <benjamin.r...@jaumo.com>:
>>
>> > As I haven't received a single reply on that, I went over to implement
>> and
>> > test it on my own with our production cluster. I had a real pain with
>> > bringing up a new node, so I had to move on.
>> >
>> > Result:
>> > Works like a charm. I ran many dtests that relate in any

Re: CASSANDRA-12888: Streaming and MVs

2016-12-07 Thread Benjamin Roth

Hi Paolo,

First of all thanks for your review!

I had the same concerns as you but I thought it is beeing handled correctly
(which does in some situations) but I found one that creates the
inconsistencies you mentioned. That is kind of split brain syndrom, when
multiple nodes fail between repairs. See here: https://cl.ly/3t0X1c0q1L1h.

I am not happy about it but I support your decision. We should then add
another dtest to test this scenario as existing dtests don't.

Some issues unfortunately remain:
- 12888 is not resolved
- MV repairs may be still f slow. Imagine an inconsistency of a single
cell (also may be due to a validation race condition, see CASSANDRA-12991)
on a big partition. I had issues with reaper and a 30min timeout leading to
1000+ (yes!) consecutive repairs of a single subrange because it always
timed out and I recognized very late. When I deployed 12888 on my system,
this remaining subrange was repaired in a snap
- I guess rebuild works the same as repair and has to go through the write
path, right?

=> The MV repair may induce so much overhead that it is maybe cheaper to
kill and replace a inconsistent node than to repair it. But that may
introduce inconsistencies again. All in all it is not perfect. All this
does not really un-frustrate me a 100%.

Do you have any more thoughts?

Unfortunately I have very little time these days as my second child was
born on monday. So thanks for your support so far. Maybe I have some ideas
on this issues during the next days and I will work on that ticket probably
next week to come to a solution that is at least deployable. I'd also
appreciate your opinion on CASSANDRA-12991.

2016-12-07 2:53 GMT+01:00 Paulo Motta <pauloricard...@gmail.com>:

> Hello Benjamin,
>
> Thanks for your effort on this investigation! For bootstraps and range
> transfers, I think we can indeed simplify and stream base tables and MVs as
> ordinary tables, unless there is some caveat I'm missing (I didn't find any
> special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV
> design doc, please correct me if I'm wrong).
>
> Regarding repair of base tables, applying mutations via the write path is a
> matter of correctness, given that the base table updates needs to
> potentially remove previously referenced keys in the views, so repairing
> only the base table may leave unreferenced keys in the views, breaking the
> MV contract. Furthermore, these unreferenced keys may be propagated to
> other replicas and never removed if you repair only the view. If you don't
> do overwrites in the base table, this is probably not a problem but the DB
> cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as
> you already noticed repairing only the base table is probably faster so I
> don't see a reason to repair the base and MVs separately since this is
> potentially more costly. I believe your frustration is mostly due to the
> bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are
> fixed repair on base table should work just fine.
>
> Based on this, I propose:
> - Fix CASSANDRA-12905 with your original patch that retries acquiring the
> MV lock instead of throwing WriteTimeoutException during streaming, since
> this is blocking 3.10.
> - Fix CASSANDRA-12888 by doing sstable-based streaming for base tables
> while still applying MV updates in the paired replicas.
> - Create new ticket to use ordinary streaming for non-repair MV stream
> sessions and keep current behavior for MV streaming originating from
> repair.
> - Create new ticket to include only the base tables and not MVs in
> keyspace-level repair, since repairing the base already repairs the views
> to avoid people shooting themselves in the foot.
>
> Please let me know what do you think. Any suggestions or feedback is
> appreciated.
>
> Cheers,
>
> Paulo
>
> 2016-12-02 8:27 GMT-02:00 Benjamin Roth <benjamin.r...@jaumo.com>:
>
> > As I haven't received a single reply on that, I went over to implement
> and
> > test it on my own with our production cluster. I had a real pain with
> > bringing up a new node, so I had to move on.
> >
> > Result:
> > Works like a charm. I ran many dtests that relate in any way with
> storage,
> > stream, bootstrap, ... with good results.
> > The bootstrap finished in under 5:30h, not a single error log during
> > bootstrap. Also afterwards, repairs run smooth, cluster seems to operate
> > quite well.
> >
> > I still need:
> >
> >- Reviews (see 12888, 12905, 12984)
> >- Some opinion if I did the CDC case right. IMHO CDC is not required
> on
> >bootstrap and we don't need to send the mutations through the write
> path
> >just to write the commit log. This will also break

Re: Collecting slow queries

2016-12-05 Thread Benjamin Roth

That should be exactly what this guy was asking for.

Btw.: This is really a great feature! Stumbled across it by accident when
watching logs. Thanks Yoshi!

2016-12-05 22:41 GMT+01:00 Yoshi Kimoto <yy.kim...@gmail.com>:

> This? : https://issues.apache.org/jira/browse/CASSANDRA-12403
>
> 2016-12-06 6:36 GMT+09:00 Jeff Jirsa <jeff.ji...@crowdstrike.com>:
>
> > Should we reopen 6226? Tracing 0.1% doesn’t help find the outliers that
> > are slow but don’t time out (slow query log could help find large
> > partitions for users with infrequent but painful large partitions, far
> > easier than dumping sstables to json to identify them).
> >
> >
> > On 12/5/16, 1:28 PM, "sankalp kohli" <kohlisank...@gmail.com> wrote:
> >
> > >This is duped by a JIRA which is fixed in 3.2
> > >
> > > https://issues.apache.org/jira/browse/CASSANDRA-6226
> > >
> > >On Mon, Dec 5, 2016 at 12:15 PM, Jan <cne...@yahoo.com.invalid> wrote:
> > >
> > >> HI Folks;
> > >> is there a way for 'Collecting slow queries'  in the Apache Cassandra.
> > ?I
> > >> am aware of the DSE product offering such an option, but need the
> > solution
> > >> on Apache Cassandra.
> > >> ThanksJan
> >
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: Failed Dtest will block cutting releases

2016-12-04 Thread Benjamin Roth

Hi Michael,

Thanks for this update. As a newbie it helped me to understand the
organization and processes a little bit better.

I don't know how many CS-devs know this but I love this rule (actually the
whole book):
http://programmer.97things.oreilly.com/wiki/index.php/The_Boy_Scout_Rule

I personally, to be honest, am not the kind of guy that walks through lists
and looks for issues that could be picked up and done, but if I encounter
anything (test, some weird code, design, whatever) that deserves to be
improved, analyzed or fixed and I have a little time left, I try to improve
or fix it.

At that time I am still quite new around here and in process of
understanding the whole picture of Cassandras behaviour, code, processes
and organization. I hope you can forgive me if I don't perfectly get the
point every time right now - but I am eager to learn and improve.

Thanks for your patience!

2016-12-04 19:33 GMT+01:00 Michael Shuler <mich...@pbandjelly.org>:

> Thanks for your thoughts on testing Apache Cassandra, I share them.
>
> I just wanted to note that the known_failure() annotations were recently
> removed from cassandra-dtest [0], due to lack of annotation removal when
> bugs fixed, and the internal webapp that we were using to parse has been
> broken for quite some time, with no fix in sight. The webapp was removed
> and we dropped all the known_failure() annotations.
>
> The test-failure JIRA label [1] is what we've been using during test run
> triage. Those tickets assigned to 'DS Test Eng' need figuring out if
> it's a test problem or Cassandra problem. Typically, the Unassigned
> tickets were determined to be possibly a Cassandra issue. If you enjoy
> test analysis and fixing them, please, jump in and analyze/fix them!
>
> [0] https://github.com/riptano/cassandra-dtest/pull/1399
> [1]
> https://issues.apache.org/jira/issues/?jql=project%20%
> 3D%20CASSANDRA%20AND%20labels%20%3D%20test-failure%20AND%
> 20resolution%20%3D%20unresolved
>
> --
> Kind regards,
> Michael Shuler
>
> On 12/04/2016 02:07 AM, Benjamin Roth wrote:
> > Sorry for jumping in so boldly before.
> >
> > TL;DR:
> >
> >- I didn't mean to delete every flaky test just like that
> >- To improve quality, each failing test has to be analyzed
> individually
> >for release
> >
> > More thoughts on that:
> >
> > I had a closer look on some of the tests tagged as flaky and realized
> that
> > the situation here is more complex than I thought before.
> > Of course I didn't mean to delete all the flaky tests just like that.
> Maybe
> > I should rephrase it a bit to "If a (flaky) test can't really prove
> > something, then it is better not to have it". If a test does prove
> > something depends on its intention, its implementation and on how flaky
> it
> > really is and first of all: Why.
> >
> > These dtests are maybe blessing and curse at the same time. On the one
> hand
> > there are things you cannot test with a unit test, so you need them for
> > certain cases. On the other hand, dtest do not only test the desired
> case.
> >
> >- They test the test environment (ccm, server hickups) and more or
> less
> >all components of the CS daemon that are somehow involved as well.
> >- This exposes the test to many more error sources than the bare test
> >case and that creates of course a lot of "unreliability" in general
> and
> >causes flaky results.
> >- It makes it hard to pin down the failures to a certain cause like
> >   - Flaky test implementation
> >   - Flaky bugs in SUT
> >   - Unreliable test environment
> >- Analyzing every failure is a pain. But a simple "retry and skip
> over"
> >_may_ mask a real problem.
> >
> > => Difficult situation!
> >
> > From my own projects and non-CS experience I can tell:
> > Flaky tests give me a bad feeling and always leave a certain smell. I've
> > also just skipped them with that reason "Yes, I know it's flaky, I don't
> > really care about it". But it simply does not feel right.
> >
> > A real life example from another project:
> > Some weeks ago I wrote functional tests to test the integration of
> > SeaweedFS as a blob store backend in an image upload process. Test case
> was
> > roughly to upload an image, check if it exists on both old and new image
> > storage, delete it, check it again. The test existed for years. I simply
> > added some assertions to check the existance of the uploaded files on the
> > new storage. Funnyhow, I must have hit some corner case by that and from
> > that moment on

Re: Failed Dtest will block cutting releases

2016-12-04 Thread Benjamin Roth

Sorry for jumping in so boldly before.

TL;DR:

   - I didn't mean to delete every flaky test just like that
   - To improve quality, each failing test has to be analyzed individually
   for release

More thoughts on that:

I had a closer look on some of the tests tagged as flaky and realized that
the situation here is more complex than I thought before.
Of course I didn't mean to delete all the flaky tests just like that. Maybe
I should rephrase it a bit to "If a (flaky) test can't really prove
something, then it is better not to have it". If a test does prove
something depends on its intention, its implementation and on how flaky it
really is and first of all: Why.

These dtests are maybe blessing and curse at the same time. On the one hand
there are things you cannot test with a unit test, so you need them for
certain cases. On the other hand, dtest do not only test the desired case.

   - They test the test environment (ccm, server hickups) and more or less
   all components of the CS daemon that are somehow involved as well.
   - This exposes the test to many more error sources than the bare test
   case and that creates of course a lot of "unreliability" in general and
   causes flaky results.
   - It makes it hard to pin down the failures to a certain cause like
  - Flaky test implementation
  - Flaky bugs in SUT
  - Unreliable test environment
   - Analyzing every failure is a pain. But a simple "retry and skip over"
   _may_ mask a real problem.

=> Difficult situation!

>From my own projects and non-CS experience I can tell:
Flaky tests give me a bad feeling and always leave a certain smell. I've
also just skipped them with that reason "Yes, I know it's flaky, I don't
really care about it". But it simply does not feel right.

A real life example from another project:
Some weeks ago I wrote functional tests to test the integration of
SeaweedFS as a blob store backend in an image upload process. Test case was
roughly to upload an image, check if it exists on both old and new image
storage, delete it, check it again. The test existed for years. I simply
added some assertions to check the existance of the uploaded files on the
new storage. Funnyhow, I must have hit some corner case by that and from
that moment on, the test was flaky. Simple URL checks started to time out
from time to time. That made me really curios. To cut a long story short:
After having checked a whole lot of things, it turned out that not the test
was flaky and also not the shiny new storagy, it was the LVS loadbalancer.
The loadbalancer dropped connections reproducibly which happened more
likely with increasing concurrency. Finally we removed LVS completely and
replaced it by DNS-RR + VRRP, which completely solved the problem and the
tests ran happily ever after.

Usually there is no pure black and white.

   - Sometimes testing whole systems reveals problems you'd never
   have found without them
   - Sometimes they cause false alerts
   - Sometimes, skipping them masks real problems
   - Sometimes it sucks if a false alert blocks your release

If you want to be really safe, you have to analyze every single failure and
decide of what kind this failure is or could be and if a retry will prove
sth or not. At least when you are at a release gate. I think this should be
worth it.

There's a reason for this thread and there's a reason why people ask every
few days which CS version is production stable. Things have to improve over
time. This applies to test implementations, test environments, release
processes, and so on. One way to do this is to become a little bit stricter
(and a bit better) with every release. Making all tests pass at least once
before a release should be a rather low hanging fruit. Reducing the total
number of flaky tests or the "flaky-fail-rate" may be another future goal.

Btw, the fact of the day:
I grepped through dtests and found out that roughly 11% of all tests are
flagged with "known_failure" and roughly 8% of all tests are flagged with
"flaky". Quite impressive.

2016-12-03 15:52 GMT+01:00 Edward Capriolo <edlinuxg...@gmail.com>:

> I think it is fair to run a flakey test again. If it is determine it flaked
> out due to a conflict with another test or something ephemeral in a long
> process it is not worth blocking a release.
>
> Just deleting it is probably not a good path.
>
> I actually enjoy writing fixing, tweeking, tests so pinge offline or
> whatever.
>
> On Saturday, December 3, 2016, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
> > Excuse me if I jump into an old thread, but from my experience, I have a
> > very clear opinion about situations like that as I encountered them
> before:
> >
> > Tests are there to give *certainty*.
> > *Would you like to pass a crossing with a green light if you cannot be
> sure
> >

Re: Failed Dtest will block cutting releases

2016-12-02 Thread Benjamin Roth

Excuse me if I jump into an old thread, but from my experience, I have a
very clear opinion about situations like that as I encountered them before:

Tests are there to give *certainty*.
*Would you like to pass a crossing with a green light if you cannot be sure
if green really means green?*
Do you want to rely on tests that are green, red, green, red? What if a red
is a real red and you missed it because you simply ignore it because it's
flaky?

IMHO there are only 3 options how to deal with broken/red tests:
- Fix the underlying issue
- Fix the test
- Delete the test

If I cannot trust a test, it is better not to have it at all. Otherwise
people are staring at red lights and start to drive.

This causes:
- Uncertainty
- Loss of trust
- Confusion
- More work
- *Less quality*

Just as an example:
Few days ago I created a patch. Then I ran the utest and 1 test failed.
Hmmm, did I break it? I had to check it twice by checking out the former
state, running the tests again just to recognize that it wasn't me who made
it fail. That's annoying.

Sorry again, I'm rather new here but what I just read reminded me much of
situations I have been in years ago.
So: +1, John

2016-12-03 7:48 GMT+01:00 sankalp kohli <kohlisank...@gmail.com>:

> Hi,
> I dont see any any update on this thread. We will go ahead and make
> Dtest a blocker for cutting releasing for anything after 3.10.
>
> Please respond if anyone has an objection to this.
>
> Thanks,
> Sankalp
>
>
>
> On Mon, Nov 21, 2016 at 11:57 AM, Josh McKenzie <jmcken...@apache.org>
> wrote:
>
> > Caveat: I'm strongly in favor of us blocking a release on a non-green
> test
> > board of either utest or dtest.
> >
> >
> > > put something in prod which is known to be broken in obvious ways
> >
> > In my experience the majority of fixes are actually shoring up
> low-quality
> > / flaky tests or fixing tests that have been invalidated by a commit but
> do
> > not indicate an underlying bug. Inferring "tests are failing so we know
> > we're asking people to put things in prod that are broken in obvious
> ways"
> > is hyperbolic. A more correct statement would be: "Tests are failing so
> we
> > know we're shipping with a test that's failing" which is not helpful.
> >
> > Our signal to noise ratio with tests has been very poor historically;
> we've
> > been trying to address that through aggressive triage and assigning out
> > test failures however we need far more active and widespread community
> > involvement if we want to truly *fix* this problem long-term.
> >
> > On Mon, Nov 21, 2016 at 2:33 PM, Jonathan Haddad <j...@jonhaddad.com>
> > wrote:
> >
> > > +1.  Kind of silly to put advise people to put something in prod which
> is
> > > known to be broken in obvious ways
> > >
> > > On Mon, Nov 21, 2016 at 11:31 AM sankalp kohli <kohlisank...@gmail.com
> >
> > > wrote:
> > >
> > > > Hi,
> > > > We should not cut a releases if Dtest are not passing. I won't
> > block
> > > > 3.10 on this since we are just discussing this.
> > > >
> > > > Please provide feedback on this.
> > > >
> > > > Thanks,
> > > > Sankalp
> > > >
> > >
> >
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

CASSANDRA-12888: Streaming and MVs

2016-12-02 Thread Benjamin Roth

As I haven't received a single reply on that, I went over to implement and
test it on my own with our production cluster. I had a real pain with
bringing up a new node, so I had to move on.

Result:
Works like a charm. I ran many dtests that relate in any way with storage,
stream, bootstrap, ... with good results.
The bootstrap finished in under 5:30h, not a single error log during
bootstrap. Also afterwards, repairs run smooth, cluster seems to operate
quite well.

I still need:

   - Reviews (see 12888, 12905, 12984)
   - Some opinion if I did the CDC case right. IMHO CDC is not required on
   bootstrap and we don't need to send the mutations through the write path
   just to write the commit log. This will also break incremental repairs.
   Instead for CDC the sstables are streamed like normal but mutations are
   written to commitlog additionally. The worst I see is that the node crashes
   and the commitlogs for that repair streams are replayed leading to
   duplicate writes, which is not really crucial and not a regular case. Any
   better ideas?
   - Docs have to be updated (12985) if patch is accepted

I really appreciate ANY feedback. IMHO the impact of that fixes is immense
and maybe will be a huge step to get MVs production ready.

Thank you very much,
Benjamin


-- Forwarded message --
From: Benjamin Roth <benjamin.r...@jaumo.com>
Date: 2016-11-29 17:04 GMT+01:00
Subject: Streaming and MVs
To: dev@cassandra.apache.org


I don't know where else to discuss this issue, so I post it here.

I am trying to get CS to run stable with MVs since the beginning of july.
Normal reads + write do work as expected but when it comes to repairs or
bootstrapping it still feels far far away from what I would call fast and
stable. The other day I just wanted to bootstrap a new node. I tried it 2
times.
First time the bootstrap failed due to WTEs. I fixed this issue by not
timing out in streams but then it turned out that the bootstrap (load
roughly 250-300 GB) didn't even finish in 24h. What if I really had a
problem and had to get up some nodes fast? No way!

I think the root cause of it all is the way streams are handled on tables
with MVs.
Sending them to the regular write path implies so many bottlenecks and
sometimes also redundant writes. Let me explain:

1. Bootstrap
During a bootstrap, all ranges from all KS and all CFs that will belong to
the new node will be streamed. MVs are treated like all other CFs and all
ranges that will move to the new node will also be streamed during
bootstrap.
Sending streams of the base tables through the write path will have the
following negative impacts:

   - Writes are sent to the commit log. Not necessary. When node is stopped
   during bootstrap, bootstrap will simply start over. No need to recover from
   commit logs. Non-MV tables won't have a CL anyway
   - MV mutations will not be applied instantly but send to the batch log.
   This is of course necessary during the range movement (if PK of MV differs
   from base table) but what happens: The batchlog will be completely flooded.
   This leads to ridiculously large batchlogs (I observerd BLs with 60GB
   size), zillions of compactions and quadrillions of tombstones. This is a
   pure resource killer, especially because BL uses a CF as a queue.
   - Applying every mutation separately causes read-before-writes during MV
   mutation. This is of course an order of magnitude slower than simply
   streaming down an SSTable. This effect becomes even worse while bootstrap
   progresses and creates more and more (uncompacted) SSTables. Many of them
   wont ever be compacted because the batchlog eats all the resources
   available for compaction
   - Streaming down the MV tables AND applying the mutations of the
   basetables leads to redundant writes. Redundant writes are local if PK of
   the MV == PK of the base table and - even worse - remote if not. Remote MV
   updates will impact nodes that aren't even part of the bootstrap.
   - CDC should also not be necessary during bootstrap, should it? TBD

2. Repair
Negative impact is similar to bootstrap but, ...

   - Sending repairs through write path will not mark the streamed tables
   as repaired. See CASSANDRA-12888. Doing NOT so will instantly solve that
   issue. Much simpler with any other solution
   - It will change the "repair design" a bit. Repairing a base table will
   not automatically repair the MV. But is this bad at all? To be honest as a
   newbie it was very hard for me to understand what I had to do to be sure
   that everything is repaired correctly. Recently I was told NOT to repair MV
   CFs but only to repair the base tables. This means one cannot just call
   "nodetool repair $keyspace" - this is complicated, not transparent and it
   sucks. I changed the behaviour in my own branch and let run the dtests for
   MVs. 2 tests failed:
  - base_replica_repair_test of course faile

Streaming and MVs

2016-11-29 Thread Benjamin Roth

I don't know where else to discuss this issue, so I post it here.

I am trying to get CS to run stable with MVs since the beginning of july.
Normal reads + write do work as expected but when it comes to repairs or
bootstrapping it still feels far far away from what I would call fast and
stable. The other day I just wanted to bootstrap a new node. I tried it 2
times.
First time the bootstrap failed due to WTEs. I fixed this issue by not
timing out in streams but then it turned out that the bootstrap (load
roughly 250-300 GB) didn't even finish in 24h. What if I really had a
problem and had to get up some nodes fast? No way!

I think the root cause of it all is the way streams are handled on tables
with MVs.
Sending them to the regular write path implies so many bottlenecks and
sometimes also redundant writes. Let me explain:

1. Bootstrap
During a bootstrap, all ranges from all KS and all CFs that will belong to
the new node will be streamed. MVs are treated like all other CFs and all
ranges that will move to the new node will also be streamed during
bootstrap.
Sending streams of the base tables through the write path will have the
following negative impacts:

   - Writes are sent to the commit log. Not necessary. When node is stopped
   during bootstrap, bootstrap will simply start over. No need to recover from
   commit logs. Non-MV tables won't have a CL anyway
   - MV mutations will not be applied instantly but send to the batch log.
   This is of course necessary during the range movement (if PK of MV differs
   from base table) but what happens: The batchlog will be completely flooded.
   This leads to ridiculously large batchlogs (I observerd BLs with 60GB
   size), zillions of compactions and quadrillions of tombstones. This is a
   pure resource killer, especially because BL uses a CF as a queue.
   - Applying every mutation separately causes read-before-writes during MV
   mutation. This is of course an order of magnitude slower than simply
   streaming down an SSTable. This effect becomes even worse while bootstrap
   progresses and creates more and more (uncompacted) SSTables. Many of them
   wont ever be compacted because the batchlog eats all the resources
   available for compaction
   - Streaming down the MV tables AND applying the mutations of the
   basetables leads to redundant writes. Redundant writes are local if PK of
   the MV == PK of the base table and - even worse - remote if not. Remote MV
   updates will impact nodes that aren't even part of the bootstrap.
   - CDC should also not be necessary during bootstrap, should it? TBD

2. Repair
Negative impact is similar to bootstrap but, ...

   - Sending repairs through write path will not mark the streamed tables
   as repaired. See CASSANDRA-12888. Doing NOT so will instantly solve that
   issue. Much simpler with any other solution
   - It will change the "repair design" a bit. Repairing a base table will
   not automatically repair the MV. But is this bad at all? To be honest as a
   newbie it was very hard for me to understand what I had to do to be sure
   that everything is repaired correctly. Recently I was told NOT to repair MV
   CFs but only to repair the base tables. This means one cannot just call
   "nodetool repair $keyspace" - this is complicated, not transparent and it
   sucks. I changed the behaviour in my own branch and let run the dtests for
   MVs. 2 tests failed:
  - base_replica_repair_test of course failes due to the design change
  - really_complex_repair_test fails because it intentionally times out
  the batch log. IMHO this is a bearable situation. It is comparable to
  resurrected tombstones when running a repair after GCGS expired. You also
  would not expect this to be magically fixed. gcgs default is 10
days and I
  can expect that anybody also repairs its MVs during that period, not only
  the base table

3. Rebuild
Same like bootstrap, isn't it?

Did I forget any cases?
What do you think?

-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

42 matches

Mail list logo