Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance
I am +1 on Benjamin’s proposal and less interruptions during upgrades. For more visibility maybe we can also write a short article about the options and the tradeoffs, further to NEWS.txt (that’s not something to decide now, of course :-) ) On Tue, 24 Nov 2020 at 9:13, Benjamin Lerer wrote: > Paulo, what you propose with the yaml seems different from default to > *correctness*. It means to me that we are forcing the user to choose > between *correctness *and *performance*. Most of us have a good > understanding of the problem and it is a hard choice for us. I imagine that > most of the users do not fully understand LWTs and will not know what to > choose. Some might not even use LWTs and will suddenly be forced to make a > choice that they do not understand. It does not feel right to me to push > them to make that choice. > > I also agree with Benedict and Mick that it is a risky thing to do. > > something that can bring a cluster down upon an unprepared user. > > > I do not think that it will be the case (feel free to correct me Benedict). > The impact will probably be an increase in the number of write/read > timeouts for the LWTs read/writes. For a heavy load that would cause the > services depending on those queries to become unreliable. On the other hand > the impact of the current problem is that we can hit some correctness issue > without even knowing it. > > We need to choose between two imperfect solutions and we have some > difficulties to agree on which one to choose. > > Benedict suggested that Sylvain and I made the choice. Sylvain did not want > to make the final call. > I chose correctness. If it is a problem and people prefer to vote. It is > perfectly fine for me too :-) > > I just want us to move forward. > > > > On Tue, Nov 24, 2020 at 12:52 PM Mick Semb Wever wrote: > > > > I think the keyword there is "normally" - if we can't say _certainly_, > > > then this is probably an unsafe change to make. > > > > > > I can imagine any number of hacky upgrade processes that would be > > > dangerous with this change. > > > > > > > > > I agree. We just don't know what users are doing, this is risky. > > > > IMO the same applies to a performance degradation, i.e. something that > can > > bring a cluster down upon an unprepared user. Despite our best efforts > with > > NEWS.txt we should still look after such users. IMHO the imperfection of > > LWTs on past branches we have to carry. I'm well aware this is easier > said > > than done, even for far simpler changes. Having the flag there to switch > to > > "correct LWT" is still a huge win for users. > > >
Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance
> Benedict suggested that Sylvain and I made the choice. Sylvain did not want > to make the final call. > I chose correctness. If it is a problem and people prefer to vote. It is > perfectly fine for me too :-) +1 Appreciate it having been raised for exposure and discussion Benjamin, and happy to leave the final say to those carrying the work on, especially in this case :-) - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance
Fair points. I retract the yaml suggestion and +1 to go with the correctness route. Em ter., 24 de nov. de 2020 às 11:13, Benjamin Lerer < benjamin.le...@datastax.com> escreveu: > Paulo, what you propose with the yaml seems different from default to > *correctness*. It means to me that we are forcing the user to choose > between *correctness *and *performance*. Most of us have a good > understanding of the problem and it is a hard choice for us. I imagine that > most of the users do not fully understand LWTs and will not know what to > choose. Some might not even use LWTs and will suddenly be forced to make a > choice that they do not understand. It does not feel right to me to push > them to make that choice. > > I also agree with Benedict and Mick that it is a risky thing to do. > > something that can bring a cluster down upon an unprepared user. > > > I do not think that it will be the case (feel free to correct me Benedict). > The impact will probably be an increase in the number of write/read > timeouts for the LWTs read/writes. For a heavy load that would cause the > services depending on those queries to become unreliable. On the other hand > the impact of the current problem is that we can hit some correctness issue > without even knowing it. > > We need to choose between two imperfect solutions and we have some > difficulties to agree on which one to choose. > > Benedict suggested that Sylvain and I made the choice. Sylvain did not want > to make the final call. > I chose correctness. If it is a problem and people prefer to vote. It is > perfectly fine for me too :-) > > I just want us to move forward. > > > > On Tue, Nov 24, 2020 at 12:52 PM Mick Semb Wever wrote: > > > > I think the keyword there is "normally" - if we can't say _certainly_, > > > then this is probably an unsafe change to make. > > > > > > I can imagine any number of hacky upgrade processes that would be > > > dangerous with this change. > > > > > > > > > I agree. We just don't know what users are doing, this is risky. > > > > IMO the same applies to a performance degradation, i.e. something that > can > > bring a cluster down upon an unprepared user. Despite our best efforts > with > > NEWS.txt we should still look after such users. IMHO the imperfection of > > LWTs on past branches we have to carry. I'm well aware this is easier > said > > than done, even for far simpler changes. Having the flag there to switch > to > > "correct LWT" is still a huge win for users. > > >
Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance
Paulo, what you propose with the yaml seems different from default to *correctness*. It means to me that we are forcing the user to choose between *correctness *and *performance*. Most of us have a good understanding of the problem and it is a hard choice for us. I imagine that most of the users do not fully understand LWTs and will not know what to choose. Some might not even use LWTs and will suddenly be forced to make a choice that they do not understand. It does not feel right to me to push them to make that choice. I also agree with Benedict and Mick that it is a risky thing to do. something that can bring a cluster down upon an unprepared user. I do not think that it will be the case (feel free to correct me Benedict). The impact will probably be an increase in the number of write/read timeouts for the LWTs read/writes. For a heavy load that would cause the services depending on those queries to become unreliable. On the other hand the impact of the current problem is that we can hit some correctness issue without even knowing it. We need to choose between two imperfect solutions and we have some difficulties to agree on which one to choose. Benedict suggested that Sylvain and I made the choice. Sylvain did not want to make the final call. I chose correctness. If it is a problem and people prefer to vote. It is perfectly fine for me too :-) I just want us to move forward. On Tue, Nov 24, 2020 at 12:52 PM Mick Semb Wever wrote: > > I think the keyword there is "normally" - if we can't say _certainly_, > > then this is probably an unsafe change to make. > > > > I can imagine any number of hacky upgrade processes that would be > > dangerous with this change. > > > > > I agree. We just don't know what users are doing, this is risky. > > IMO the same applies to a performance degradation, i.e. something that can > bring a cluster down upon an unprepared user. Despite our best efforts with > NEWS.txt we should still look after such users. IMHO the imperfection of > LWTs on past branches we have to carry. I'm well aware this is easier said > than done, even for far simpler changes. Having the flag there to switch to > "correct LWT" is still a huge win for users. >
Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance
> I think the keyword there is "normally" - if we can't say _certainly_, > then this is probably an unsafe change to make. > > I can imagine any number of hacky upgrade processes that would be > dangerous with this change. > I agree. We just don't know what users are doing, this is risky. IMO the same applies to a performance degradation, i.e. something that can bring a cluster down upon an unprepared user. Despite our best efforts with NEWS.txt we should still look after such users. IMHO the imperfection of LWTs on past branches we have to carry. I'm well aware this is easier said than done, even for far simpler changes. Having the flag there to switch to "correct LWT" is still a huge win for users.
Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance
I think the keyword there is "normally" - if we can't say _certainly_, then this is probably an unsafe change to make. I can imagine any number of hacky upgrade processes that would be dangerous with this change. But, happy to defer to the consensus of others. On 24/11/2020, 11:04, "Paulo Motta" wrote: In this case the breaking change is a feature, not a bug. The exact intention of this is to require manual intervention to raise awareness about the potential performance degradation. This sounds reasonable, once we already broke the contract of not introducing performance regressions in a minor. I don't see how this can pose an outage risk to the cluster given upgrades are normally performed in a rolling restart fashion, so the worst that could happen is the first node in the sequence not starting, so the upgrade would not proceed. In my view this would be far less harmful than figuring out about a performance regression after all your nodes are upgraded. Nevertheless, I'm pretty fine on retracting the suggestion to move forward with the proposal if you feel strongly about it. Em ter., 24 de nov. de 2020 às 07:26, Benedict Elliott Smith < bened...@apache.org> escreveu: > In my parlance the config property would be a breaking change, whereas the > LWT behaviour would be a performance regression. This latter might cause > partial outages or service degradation, but refusing to start a prod > cluster without manual intervention is potentially a much worse situation, > and even more surprising for a patch upgrade. > > On 24/11/2020, 01:05, "Paulo Motta" wrote: > > Isn't the plan to change LWT implementation (and performance > expectation) > in a patch version? This is a breaking change by itself, I'm just > proposing > to make the trade-off choice explicit in the yaml to prevent unexpected > performance degradation during upgrade (for users who are not aware of > the > change). > > Just to make it clear, I'm proposing having a "lwt_legacy_mode: false" > uncommented in the default yaml with a descriptive comment about > CASSANDRA-12126, so new users will always get the new behavior, but > users > using a yaml template based on a previous 3.X version will not be able > to > start the node because this property will be missing. I believe the > majority of operators will just update their yaml with > "lwt_legacy_mode: > false" and move on with their upgrades, but people wanting to keep the > previous performance will become aware of the breaking change and set > it to > true. > > Em seg., 23 de nov. de 2020 às 21:07, Benedict Elliott Smith < > bened...@apache.org> escreveu: > > > What do you mean by minor upgrade? We can't break patch upgrades for > any > > of 3.x, as this could also cause surprise outages. > > > > On 23/11/2020, 23:51, "Paulo Motta" > wrote: > > > > I was thinking about the YAML requirement during the 3.X minor > > upgrade to > > make the decision explicit (need to update yaml) rather than > implicit > > (by > > upgrading you agree with the change), since the latter can go > > unnoticed by > > those who don't pay attention to NEWS.txt > > > > Em seg., 23 de nov. de 2020 às 20:03, Benedict Elliott Smith < > > bened...@apache.org> escreveu: > > > > > What's the value of the yaml? The user is likely to have > upgraded to > > > latest 3.x as part of the upgrade process to 4.0, so they'll > already > > have > > > had a decision made for them. If correctness didn't break > anything, > > there > > > doesn't any longer seem much point in offering a choice? > > > > > > On 23/11/2020, 22:45, "Brandon Williams" > wrote: > > > > > > +1 to both as well. > > > > > > On Mon, Nov 23, 2020, 4:42 PM Blake Eggleston > > > > > > wrote: > > > > > > > +1 to correctness, and I like the yaml idea > > > > > > > > > On Nov 23, 2020, at 4:20 AM, Paulo Motta < > > pauloricard...@gmail.com > > > > > > > > wrote: > > > > > > > > > > +1 to defaulting for correctness. > > > > > > > > > > In addition to that, how about making it a mandatory > > cassandra.yaml > > > > > property defaulting to correctness? This would make > upgrades > > with > > > an old > > > > > cassandra.yaml fail
Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance
In this case the breaking change is a feature, not a bug. The exact intention of this is to require manual intervention to raise awareness about the potential performance degradation. This sounds reasonable, once we already broke the contract of not introducing performance regressions in a minor. I don't see how this can pose an outage risk to the cluster given upgrades are normally performed in a rolling restart fashion, so the worst that could happen is the first node in the sequence not starting, so the upgrade would not proceed. In my view this would be far less harmful than figuring out about a performance regression after all your nodes are upgraded. Nevertheless, I'm pretty fine on retracting the suggestion to move forward with the proposal if you feel strongly about it. Em ter., 24 de nov. de 2020 às 07:26, Benedict Elliott Smith < bened...@apache.org> escreveu: > In my parlance the config property would be a breaking change, whereas the > LWT behaviour would be a performance regression. This latter might cause > partial outages or service degradation, but refusing to start a prod > cluster without manual intervention is potentially a much worse situation, > and even more surprising for a patch upgrade. > > On 24/11/2020, 01:05, "Paulo Motta" wrote: > > Isn't the plan to change LWT implementation (and performance > expectation) > in a patch version? This is a breaking change by itself, I'm just > proposing > to make the trade-off choice explicit in the yaml to prevent unexpected > performance degradation during upgrade (for users who are not aware of > the > change). > > Just to make it clear, I'm proposing having a "lwt_legacy_mode: false" > uncommented in the default yaml with a descriptive comment about > CASSANDRA-12126, so new users will always get the new behavior, but > users > using a yaml template based on a previous 3.X version will not be able > to > start the node because this property will be missing. I believe the > majority of operators will just update their yaml with > "lwt_legacy_mode: > false" and move on with their upgrades, but people wanting to keep the > previous performance will become aware of the breaking change and set > it to > true. > > Em seg., 23 de nov. de 2020 às 21:07, Benedict Elliott Smith < > bened...@apache.org> escreveu: > > > What do you mean by minor upgrade? We can't break patch upgrades for > any > > of 3.x, as this could also cause surprise outages. > > > > On 23/11/2020, 23:51, "Paulo Motta" > wrote: > > > > I was thinking about the YAML requirement during the 3.X minor > > upgrade to > > make the decision explicit (need to update yaml) rather than > implicit > > (by > > upgrading you agree with the change), since the latter can go > > unnoticed by > > those who don't pay attention to NEWS.txt > > > > Em seg., 23 de nov. de 2020 às 20:03, Benedict Elliott Smith < > > bened...@apache.org> escreveu: > > > > > What's the value of the yaml? The user is likely to have > upgraded to > > > latest 3.x as part of the upgrade process to 4.0, so they'll > already > > have > > > had a decision made for them. If correctness didn't break > anything, > > there > > > doesn't any longer seem much point in offering a choice? > > > > > > On 23/11/2020, 22:45, "Brandon Williams" > wrote: > > > > > > +1 to both as well. > > > > > > On Mon, Nov 23, 2020, 4:42 PM Blake Eggleston > > > > > > wrote: > > > > > > > +1 to correctness, and I like the yaml idea > > > > > > > > > On Nov 23, 2020, at 4:20 AM, Paulo Motta < > > pauloricard...@gmail.com > > > > > > > > wrote: > > > > > > > > > > +1 to defaulting for correctness. > > > > > > > > > > In addition to that, how about making it a mandatory > > cassandra.yaml > > > > > property defaulting to correctness? This would make > upgrades > > with > > > an old > > > > > cassandra.yaml fail unless an option is explicitly > specified, > > > making > > > > > operators aware of the issue and forcing them to make a > > choice. > > > > > > > > > >> Em seg., 23 de nov. de 2020 às 07:30, Benjamin Lerer < > > > > >> benjamin.le...@datastax.com> escreveu: > > > > >> > > > > >> Thank you very much to everybody that provided > feedback. It > > > helped a > > > > lot to > > > > >> limit our options. > > > > >> > > > > >> Unfortunately, it seems that some poor soul (me, > really!!!) > > will > > > have to > > > > >> make the final call between #3 and #4. >
Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance
In my parlance the config property would be a breaking change, whereas the LWT behaviour would be a performance regression. This latter might cause partial outages or service degradation, but refusing to start a prod cluster without manual intervention is potentially a much worse situation, and even more surprising for a patch upgrade. On 24/11/2020, 01:05, "Paulo Motta" wrote: Isn't the plan to change LWT implementation (and performance expectation) in a patch version? This is a breaking change by itself, I'm just proposing to make the trade-off choice explicit in the yaml to prevent unexpected performance degradation during upgrade (for users who are not aware of the change). Just to make it clear, I'm proposing having a "lwt_legacy_mode: false" uncommented in the default yaml with a descriptive comment about CASSANDRA-12126, so new users will always get the new behavior, but users using a yaml template based on a previous 3.X version will not be able to start the node because this property will be missing. I believe the majority of operators will just update their yaml with "lwt_legacy_mode: false" and move on with their upgrades, but people wanting to keep the previous performance will become aware of the breaking change and set it to true. Em seg., 23 de nov. de 2020 às 21:07, Benedict Elliott Smith < bened...@apache.org> escreveu: > What do you mean by minor upgrade? We can't break patch upgrades for any > of 3.x, as this could also cause surprise outages. > > On 23/11/2020, 23:51, "Paulo Motta" wrote: > > I was thinking about the YAML requirement during the 3.X minor > upgrade to > make the decision explicit (need to update yaml) rather than implicit > (by > upgrading you agree with the change), since the latter can go > unnoticed by > those who don't pay attention to NEWS.txt > > Em seg., 23 de nov. de 2020 às 20:03, Benedict Elliott Smith < > bened...@apache.org> escreveu: > > > What's the value of the yaml? The user is likely to have upgraded to > > latest 3.x as part of the upgrade process to 4.0, so they'll already > have > > had a decision made for them. If correctness didn't break anything, > there > > doesn't any longer seem much point in offering a choice? > > > > On 23/11/2020, 22:45, "Brandon Williams" wrote: > > > > +1 to both as well. > > > > On Mon, Nov 23, 2020, 4:42 PM Blake Eggleston > > > > wrote: > > > > > +1 to correctness, and I like the yaml idea > > > > > > > On Nov 23, 2020, at 4:20 AM, Paulo Motta < > pauloricard...@gmail.com > > > > > > wrote: > > > > > > > > +1 to defaulting for correctness. > > > > > > > > In addition to that, how about making it a mandatory > cassandra.yaml > > > > property defaulting to correctness? This would make upgrades > with > > an old > > > > cassandra.yaml fail unless an option is explicitly specified, > > making > > > > operators aware of the issue and forcing them to make a > choice. > > > > > > > >> Em seg., 23 de nov. de 2020 às 07:30, Benjamin Lerer < > > > >> benjamin.le...@datastax.com> escreveu: > > > >> > > > >> Thank you very much to everybody that provided feedback. It > > helped a > > > lot to > > > >> limit our options. > > > >> > > > >> Unfortunately, it seems that some poor soul (me, really!!!) > will > > have to > > > >> make the final call between #3 and #4. > > > >> > > > >> If I reformulate the question to: Do we default to > *correctness > > *or to > > > >> *performance*? > > > >> > > > >> I would choose to default to *correctness*. > > > >> > > > >> Of course the situation is more complex than that but it > seems > > that > > > >> somebody has to make a call and live with it. It seems to > me that > > being > > > >> blamed for choosing correctness is easier to live with ;-) > > > >> > > > >> Benjamin > > > >> > > > >> PS: I tried to push the choice on Sylvain but he dodged the > > bullet. > > > >> > > > >> On Sat, Nov 21, 2020 at 12:30 AM Benedict Elliott Smith < > > > >> bened...@apache.org> > > > >> wrote: > > > >> > > > >>> I think I meant #4 __♂️ > > > >>> > > > >>> On 20/11/2020, 21:11, "Blake Eggleston" > > > >