Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

2020-11-24 Thread Ekaterina Dimitrova
I am +1 on Benjamin’s proposal
and less interruptions during upgrades. For more visibility maybe we can
also write a short article about the options and the tradeoffs, further to
NEWS.txt (that’s not something to decide now, of course :-) )


On Tue, 24 Nov 2020 at 9:13, Benjamin Lerer 
wrote:

> Paulo, what you propose with the yaml seems different from default to
> *correctness*. It means to me that we are forcing the user to choose
> between *correctness *and *performance*. Most of us have a good
> understanding of the problem and it is a hard choice for us. I imagine that
> most of the users do not fully understand LWTs and will not know what to
> choose. Some might not even use LWTs and will suddenly be forced to make a
> choice that they do not understand. It does not feel right to me to push
> them to make that choice.
>
> I also agree with Benedict and Mick that it is a risky thing to do.
>
> something that can bring a cluster down upon an unprepared user.
>
>
> I do not think that it will be the case (feel free to correct me Benedict).
> The impact will probably be an increase in the number of write/read
> timeouts for the LWTs read/writes. For a heavy load that would cause the
> services depending on those queries to become unreliable. On the other hand
> the impact of the current problem is that we can hit some correctness issue
> without even knowing it.
>
> We need to choose between two imperfect solutions and we have some
> difficulties to agree on which one to choose.
>
> Benedict suggested that Sylvain and I made the choice. Sylvain did not want
> to make the final call.
> I chose correctness. If it is a problem and people prefer to vote. It is
> perfectly fine for me too :-)
>
> I just want us to move forward.
>
>
>
> On Tue, Nov 24, 2020 at 12:52 PM Mick Semb Wever  wrote:
>
> > > I think the keyword there is "normally" - if we can't say _certainly_,
> > > then this is probably an unsafe change to make.
> > >
> > > I can imagine any number of hacky upgrade processes that would be
> > > dangerous with this change.
> > >
> >
> >
> > I agree. We just don't know what users are doing, this is risky.
> >
> > IMO the same applies to a performance degradation, i.e. something that
> can
> > bring a cluster down upon an unprepared user. Despite our best efforts
> with
> > NEWS.txt we should still look after such users. IMHO the imperfection of
> > LWTs on past branches we have to carry. I'm well aware this is easier
> said
> > than done, even for far simpler changes. Having the flag there to switch
> to
> > "correct LWT" is still a huge win for users.
> >
>


Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

2020-11-24 Thread Michael Semb Wever


> Benedict suggested that Sylvain and I made the choice. Sylvain did not want
> to make the final call.
> I chose correctness. If it is a problem and people prefer to vote. It is
> perfectly fine for me too :-)


+1
Appreciate it having been raised for exposure and discussion Benjamin, and 
happy to leave the final say to those carrying the work on, especially in this 
case :-)

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

2020-11-24 Thread Paulo Motta
Fair points. I retract the yaml suggestion and +1 to go with the
correctness route.

Em ter., 24 de nov. de 2020 às 11:13, Benjamin Lerer <
benjamin.le...@datastax.com> escreveu:

> Paulo, what you propose with the yaml seems different from default to
> *correctness*. It means to me that we are forcing the user to choose
> between *correctness *and *performance*. Most of us have a good
> understanding of the problem and it is a hard choice for us. I imagine that
> most of the users do not fully understand LWTs and will not know what to
> choose. Some might not even use LWTs and will suddenly be forced to make a
> choice that they do not understand. It does not feel right to me to push
> them to make that choice.
>
> I also agree with Benedict and Mick that it is a risky thing to do.
>
> something that can bring a cluster down upon an unprepared user.
>
>
> I do not think that it will be the case (feel free to correct me Benedict).
> The impact will probably be an increase in the number of write/read
> timeouts for the LWTs read/writes. For a heavy load that would cause the
> services depending on those queries to become unreliable. On the other hand
> the impact of the current problem is that we can hit some correctness issue
> without even knowing it.
>
> We need to choose between two imperfect solutions and we have some
> difficulties to agree on which one to choose.
>
> Benedict suggested that Sylvain and I made the choice. Sylvain did not want
> to make the final call.
> I chose correctness. If it is a problem and people prefer to vote. It is
> perfectly fine for me too :-)
>
> I just want us to move forward.
>
>
>
> On Tue, Nov 24, 2020 at 12:52 PM Mick Semb Wever  wrote:
>
> > > I think the keyword there is "normally" - if we can't say _certainly_,
> > > then this is probably an unsafe change to make.
> > >
> > > I can imagine any number of hacky upgrade processes that would be
> > > dangerous with this change.
> > >
> >
> >
> > I agree. We just don't know what users are doing, this is risky.
> >
> > IMO the same applies to a performance degradation, i.e. something that
> can
> > bring a cluster down upon an unprepared user. Despite our best efforts
> with
> > NEWS.txt we should still look after such users. IMHO the imperfection of
> > LWTs on past branches we have to carry. I'm well aware this is easier
> said
> > than done, even for far simpler changes. Having the flag there to switch
> to
> > "correct LWT" is still a huge win for users.
> >
>


Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

2020-11-24 Thread Benjamin Lerer
Paulo, what you propose with the yaml seems different from default to
*correctness*. It means to me that we are forcing the user to choose
between *correctness *and *performance*. Most of us have a good
understanding of the problem and it is a hard choice for us. I imagine that
most of the users do not fully understand LWTs and will not know what to
choose. Some might not even use LWTs and will suddenly be forced to make a
choice that they do not understand. It does not feel right to me to push
them to make that choice.

I also agree with Benedict and Mick that it is a risky thing to do.

something that can bring a cluster down upon an unprepared user.


I do not think that it will be the case (feel free to correct me Benedict).
The impact will probably be an increase in the number of write/read
timeouts for the LWTs read/writes. For a heavy load that would cause the
services depending on those queries to become unreliable. On the other hand
the impact of the current problem is that we can hit some correctness issue
without even knowing it.

We need to choose between two imperfect solutions and we have some
difficulties to agree on which one to choose.

Benedict suggested that Sylvain and I made the choice. Sylvain did not want
to make the final call.
I chose correctness. If it is a problem and people prefer to vote. It is
perfectly fine for me too :-)

I just want us to move forward.



On Tue, Nov 24, 2020 at 12:52 PM Mick Semb Wever  wrote:

> > I think the keyword there is "normally" - if we can't say _certainly_,
> > then this is probably an unsafe change to make.
> >
> > I can imagine any number of hacky upgrade processes that would be
> > dangerous with this change.
> >
>
>
> I agree. We just don't know what users are doing, this is risky.
>
> IMO the same applies to a performance degradation, i.e. something that can
> bring a cluster down upon an unprepared user. Despite our best efforts with
> NEWS.txt we should still look after such users. IMHO the imperfection of
> LWTs on past branches we have to carry. I'm well aware this is easier said
> than done, even for far simpler changes. Having the flag there to switch to
> "correct LWT" is still a huge win for users.
>


Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

2020-11-24 Thread Mick Semb Wever
> I think the keyword there is "normally" - if we can't say _certainly_,
> then this is probably an unsafe change to make.
>
> I can imagine any number of hacky upgrade processes that would be
> dangerous with this change.
>


I agree. We just don't know what users are doing, this is risky.

IMO the same applies to a performance degradation, i.e. something that can
bring a cluster down upon an unprepared user. Despite our best efforts with
NEWS.txt we should still look after such users. IMHO the imperfection of
LWTs on past branches we have to carry. I'm well aware this is easier said
than done, even for far simpler changes. Having the flag there to switch to
"correct LWT" is still a huge win for users.


Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

2020-11-24 Thread Benedict Elliott Smith
I think the keyword there is "normally" - if we can't say _certainly_, then 
this is probably an unsafe change to make.

I can imagine any number of hacky upgrade processes that would be dangerous 
with this change.

But, happy to defer to the consensus of others.



On 24/11/2020, 11:04, "Paulo Motta"  wrote:

 In this case the breaking change is a feature, not a bug. The exact
intention of this is to require manual intervention to raise awareness
about the potential performance degradation. This sounds reasonable, once
we already broke the contract of not introducing performance regressions in
a minor.

I don't see how this can pose an outage risk to the cluster given upgrades
are normally performed in a rolling restart fashion, so the worst that
could happen is the first node in the sequence not starting, so the upgrade
would not proceed. In my view this would be far less harmful than figuring
out about a performance regression after all your nodes are upgraded.

Nevertheless, I'm pretty fine on retracting the suggestion to move forward
with the proposal if you feel strongly about it.

Em ter., 24 de nov. de 2020 às 07:26, Benedict Elliott Smith <
bened...@apache.org> escreveu:

> In my parlance the config property would be a breaking change, whereas the
> LWT behaviour would be a performance regression.  This latter might cause
> partial outages or service degradation, but refusing to start a prod
> cluster without manual intervention is potentially a much worse situation,
> and even more surprising for a patch upgrade.
>
> On 24/11/2020, 01:05, "Paulo Motta"  wrote:
>
> Isn't the plan to change LWT implementation (and performance
> expectation)
> in a patch version? This is a breaking change by itself, I'm just
> proposing
> to make the trade-off choice explicit in the yaml to prevent 
unexpected
> performance degradation during upgrade (for users who are not aware of
> the
> change).
>
> Just to make it clear, I'm proposing having a "lwt_legacy_mode: false"
> uncommented in the default yaml with a descriptive comment about
> CASSANDRA-12126, so new users will always get the new behavior, but
> users
> using a yaml template based on a previous 3.X version will not be able
> to
> start the node because this property will be missing. I believe the
> majority of operators will just update their yaml with
> "lwt_legacy_mode:
> false" and move on with their upgrades, but people wanting to keep the
> previous performance will become aware of the breaking change and set
> it to
> true.
>
> Em seg., 23 de nov. de 2020 às 21:07, Benedict Elliott Smith <
> bened...@apache.org> escreveu:
>
> > What do you mean by minor upgrade? We can't break patch upgrades for
> any
> > of 3.x, as this could also cause surprise outages.
> >
> > On 23/11/2020, 23:51, "Paulo Motta" 
> wrote:
> >
> >  I was thinking about the YAML requirement during the 3.X minor
> > upgrade to
> > make the decision explicit (need to update yaml) rather than
> implicit
> > (by
> > upgrading you agree with the change), since the latter can go
> > unnoticed by
> > those who don't pay attention to NEWS.txt
> >
> > Em seg., 23 de nov. de 2020 às 20:03, Benedict Elliott Smith <
> > bened...@apache.org> escreveu:
> >
> > > What's the value of the yaml? The user is likely to have
> upgraded to
> > > latest 3.x as part of the upgrade process to 4.0, so they'll
> already
> > have
> > > had a decision made for them. If correctness didn't break
> anything,
> > there
> > > doesn't any longer seem much point in offering a choice?
> > >
> > > On 23/11/2020, 22:45, "Brandon Williams" 
> wrote:
> > >
> > > +1 to both as well.
> > >
> > > On Mon, Nov 23, 2020, 4:42 PM Blake Eggleston
> > > 
> > > wrote:
> > >
> > > > +1 to correctness, and I like the yaml idea
> > > >
> > > > > On Nov 23, 2020, at 4:20 AM, Paulo Motta <
> > pauloricard...@gmail.com
> > > >
> > > > wrote:
> > > > >
> > > > > +1 to defaulting for correctness.
> > > > >
> > > > > In addition to that, how about making it a mandatory
> > cassandra.yaml
> > > > > property defaulting to correctness? This would make
> upgrades
> > with
> > > an old
> > > > > cassandra.yaml fail 

Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

2020-11-24 Thread Paulo Motta
 In this case the breaking change is a feature, not a bug. The exact
intention of this is to require manual intervention to raise awareness
about the potential performance degradation. This sounds reasonable, once
we already broke the contract of not introducing performance regressions in
a minor.

I don't see how this can pose an outage risk to the cluster given upgrades
are normally performed in a rolling restart fashion, so the worst that
could happen is the first node in the sequence not starting, so the upgrade
would not proceed. In my view this would be far less harmful than figuring
out about a performance regression after all your nodes are upgraded.

Nevertheless, I'm pretty fine on retracting the suggestion to move forward
with the proposal if you feel strongly about it.

Em ter., 24 de nov. de 2020 às 07:26, Benedict Elliott Smith <
bened...@apache.org> escreveu:

> In my parlance the config property would be a breaking change, whereas the
> LWT behaviour would be a performance regression.  This latter might cause
> partial outages or service degradation, but refusing to start a prod
> cluster without manual intervention is potentially a much worse situation,
> and even more surprising for a patch upgrade.
>
> On 24/11/2020, 01:05, "Paulo Motta"  wrote:
>
> Isn't the plan to change LWT implementation (and performance
> expectation)
> in a patch version? This is a breaking change by itself, I'm just
> proposing
> to make the trade-off choice explicit in the yaml to prevent unexpected
> performance degradation during upgrade (for users who are not aware of
> the
> change).
>
> Just to make it clear, I'm proposing having a "lwt_legacy_mode: false"
> uncommented in the default yaml with a descriptive comment about
> CASSANDRA-12126, so new users will always get the new behavior, but
> users
> using a yaml template based on a previous 3.X version will not be able
> to
> start the node because this property will be missing. I believe the
> majority of operators will just update their yaml with
> "lwt_legacy_mode:
> false" and move on with their upgrades, but people wanting to keep the
> previous performance will become aware of the breaking change and set
> it to
> true.
>
> Em seg., 23 de nov. de 2020 às 21:07, Benedict Elliott Smith <
> bened...@apache.org> escreveu:
>
> > What do you mean by minor upgrade? We can't break patch upgrades for
> any
> > of 3.x, as this could also cause surprise outages.
> >
> > On 23/11/2020, 23:51, "Paulo Motta" 
> wrote:
> >
> >  I was thinking about the YAML requirement during the 3.X minor
> > upgrade to
> > make the decision explicit (need to update yaml) rather than
> implicit
> > (by
> > upgrading you agree with the change), since the latter can go
> > unnoticed by
> > those who don't pay attention to NEWS.txt
> >
> > Em seg., 23 de nov. de 2020 às 20:03, Benedict Elliott Smith <
> > bened...@apache.org> escreveu:
> >
> > > What's the value of the yaml? The user is likely to have
> upgraded to
> > > latest 3.x as part of the upgrade process to 4.0, so they'll
> already
> > have
> > > had a decision made for them. If correctness didn't break
> anything,
> > there
> > > doesn't any longer seem much point in offering a choice?
> > >
> > > On 23/11/2020, 22:45, "Brandon Williams" 
> wrote:
> > >
> > > +1 to both as well.
> > >
> > > On Mon, Nov 23, 2020, 4:42 PM Blake Eggleston
> > > 
> > > wrote:
> > >
> > > > +1 to correctness, and I like the yaml idea
> > > >
> > > > > On Nov 23, 2020, at 4:20 AM, Paulo Motta <
> > pauloricard...@gmail.com
> > > >
> > > > wrote:
> > > > >
> > > > > +1 to defaulting for correctness.
> > > > >
> > > > > In addition to that, how about making it a mandatory
> > cassandra.yaml
> > > > > property defaulting to correctness? This would make
> upgrades
> > with
> > > an old
> > > > > cassandra.yaml fail unless an option is explicitly
> specified,
> > > making
> > > > > operators aware of the issue and forcing them to make a
> > choice.
> > > > >
> > > > >> Em seg., 23 de nov. de 2020 às 07:30, Benjamin Lerer <
> > > > >> benjamin.le...@datastax.com> escreveu:
> > > > >>
> > > > >> Thank you very much to everybody that provided
> feedback. It
> > > helped a
> > > > lot to
> > > > >> limit our options.
> > > > >>
> > > > >> Unfortunately, it seems that some poor soul (me,
> really!!!)
> > will
> > > have to
> > > > >> make the final call between #3 and #4.
>  

Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

2020-11-24 Thread Benedict Elliott Smith
In my parlance the config property would be a breaking change, whereas the LWT 
behaviour would be a performance regression.  This latter might cause partial 
outages or service degradation, but refusing to start a prod cluster without 
manual intervention is potentially a much worse situation, and even more 
surprising for a patch upgrade. 

On 24/11/2020, 01:05, "Paulo Motta"  wrote:

Isn't the plan to change LWT implementation (and performance expectation)
in a patch version? This is a breaking change by itself, I'm just proposing
to make the trade-off choice explicit in the yaml to prevent unexpected
performance degradation during upgrade (for users who are not aware of the
change).

Just to make it clear, I'm proposing having a "lwt_legacy_mode: false"
uncommented in the default yaml with a descriptive comment about
CASSANDRA-12126, so new users will always get the new behavior, but users
using a yaml template based on a previous 3.X version will not be able to
start the node because this property will be missing. I believe the
majority of operators will just update their yaml with "lwt_legacy_mode:
false" and move on with their upgrades, but people wanting to keep the
previous performance will become aware of the breaking change and set it to
true.

Em seg., 23 de nov. de 2020 às 21:07, Benedict Elliott Smith <
bened...@apache.org> escreveu:

> What do you mean by minor upgrade? We can't break patch upgrades for any
> of 3.x, as this could also cause surprise outages.
>
> On 23/11/2020, 23:51, "Paulo Motta"  wrote:
>
>  I was thinking about the YAML requirement during the 3.X minor
> upgrade to
> make the decision explicit (need to update yaml) rather than implicit
> (by
> upgrading you agree with the change), since the latter can go
> unnoticed by
> those who don't pay attention to NEWS.txt
>
> Em seg., 23 de nov. de 2020 às 20:03, Benedict Elliott Smith <
> bened...@apache.org> escreveu:
>
> > What's the value of the yaml? The user is likely to have upgraded to
> > latest 3.x as part of the upgrade process to 4.0, so they'll already
> have
> > had a decision made for them. If correctness didn't break anything,
> there
> > doesn't any longer seem much point in offering a choice?
> >
> > On 23/11/2020, 22:45, "Brandon Williams"  wrote:
> >
> > +1 to both as well.
> >
> > On Mon, Nov 23, 2020, 4:42 PM Blake Eggleston
> > 
> > wrote:
> >
> > > +1 to correctness, and I like the yaml idea
> > >
> > > > On Nov 23, 2020, at 4:20 AM, Paulo Motta <
> pauloricard...@gmail.com
> > >
> > > wrote:
> > > >
> > > > +1 to defaulting for correctness.
> > > >
> > > > In addition to that, how about making it a mandatory
> cassandra.yaml
> > > > property defaulting to correctness? This would make upgrades
> with
> > an old
> > > > cassandra.yaml fail unless an option is explicitly 
specified,
> > making
> > > > operators aware of the issue and forcing them to make a
> choice.
> > > >
> > > >> Em seg., 23 de nov. de 2020 às 07:30, Benjamin Lerer <
> > > >> benjamin.le...@datastax.com> escreveu:
> > > >>
> > > >> Thank you very much to everybody that provided feedback. It
> > helped a
> > > lot to
> > > >> limit our options.
> > > >>
> > > >> Unfortunately, it seems that some poor soul (me, really!!!)
> will
> > have to
> > > >> make the final call between #3 and #4.
> > > >>
> > > >> If I reformulate the question to: Do we default to
> *correctness
> > *or to
> > > >> *performance*?
> > > >>
> > > >> I would choose to default to *correctness*.
> > > >>
> > > >> Of course the situation is more complex than that but it
> seems
> > that
> > > >> somebody has to make a call and live with it. It seems to
> me that
> > being
> > > >> blamed for choosing correctness is easier to live with ;-)
> > > >>
> > > >> Benjamin
> > > >>
> > > >> PS: I tried to push the choice on Sylvain but he dodged the
> > bullet.
> > > >>
> > > >> On Sat, Nov 21, 2020 at 12:30 AM Benedict Elliott Smith <
> > > >> bened...@apache.org>
> > > >> wrote:
> > > >>
> > > >>> I think I meant #4 __‍♂️
> > > >>>
> > > >>> On 20/11/2020, 21:11, "Blake Eggleston"
> >  > >