Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

Benjamin Lerer Mon, 23 Nov 2020 02:30:55 -0800

Thank you very much to everybody that provided feedback. It helped a lot to
limit our options.


Unfortunately, it seems that some poor soul (me, really!!!) will have to
make the final call between #3 and #4.

If I reformulate the question to: Do we default to *correctness *or to
*performance*?

I would choose to default to *correctness*.

Of course the situation is more complex than that but it seems that
somebody has to make a call and live with it. It seems to me that being
blamed for choosing correctness is easier to live with ;-)

Benjamin

PS: I tried to push the choice on Sylvain but he dodged the bullet.

On Sat, Nov 21, 2020 at 12:30 AM Benedict Elliott Smith <[email protected]>
wrote:

> I think I meant #4 __‍♂️
>
> On 20/11/2020, 21:11, "Blake Eggleston" <[email protected]>
> wrote:
>
>     I’d also prefer #3 over #4
>
>     > On Nov 20, 2020, at 10:03 AM, Benedict Elliott Smith <
> [email protected]> wrote:
>     >
>     > Well, I expressed a preference for #3 over #4, particularly for the
> 3.x series.  However at this point, I think the lack of a clear project
> decision means we can punt it back to you and Sylvain to make the final
> call.
>     >
>     > On 20/11/2020, 16:23, "Benjamin Lerer" <[email protected]>
> wrote:
>     >
>     >    I will try to summarize the discussion to clarify the outcome.
>     >
>     >    Mick is in favor of #4
>     >    Summanth is in favor of #4
>     >    Sylvain answer was not clear for me. I understood it like I
> prefer #3 to #4
>     >    and I am also fine with #1
>     >    Jeff is in favor of #3 and will understand #4
>     >    David is in favor #3 (fix bug and add flag to roll back to old
> behavior) in
>     >    4.0 and #4 in 3.0 and 3.11
>     >
>     >    Do not hesitate to correct me if I misunderstood your answer.
>     >
>     >    Based on these answers it seems clear that most people prefer to
> go for #3
>     >    or #4.
>     >
>     >    The choice between #3 (fix correctness opt-in to current
> behavior) and #4
>     >    (current behavior opt-in to correctness) is a bit less clear
> specially if
>     >    we consider the 3.X branches or 4.0.
>     >
>     >    Does anybody as some idea on how to choose between those 2
> choices or some
>     >    extra opinions on #3 versus #4?
>     >
>     >
>     >
>     >
>     >
>     >
>     >>    On Wed, Nov 18, 2020 at 9:45 PM David Capwell <
> [email protected]> wrote:
>     >>
>     >> I feel that #4 (fix bug and add flag to roll back to old behavior)
> is best.
>     >>
>     >> About the alternative implementation, I am fine adding it to 3.x
> and 4.0,
>     >> but should treat it as a different path disabled by default that
> you can
>     >> opt-into, with a plan to opt-in by default "eventually".
>     >>
>     >> On Wed, Nov 18, 2020 at 11:10 AM Benedict Elliott Smith <
>     >> [email protected]>
>     >> wrote:
>     >>
>     >>> Perhaps there might be broader appetite to weigh in on which major
>     >>> releases we might target for work that fixes the correctness bug
> without
>     >>> serious performance regression?
>     >>>
>     >>> i.e., if we were to fix the correctness bug now, introducing a
> serious
>     >>> performance regression (either opt-in or opt-out), but were to
> land work
>     >>> without this problem for 5.0, would there be appetite to backport
> this
>     >> work
>     >>> to any of 4.0, 3.11 or 3.0?
>     >>>
>     >>>
>     >>> On 18/11/2020, 18:31, "Jeff Jirsa" <[email protected]> wrote:
>     >>>
>     >>>    This is complicated and relatively few people on earth
> understand it,
>     >>> so
>     >>>    having little feedback is mostly expected, unfortunately.
>     >>>
>     >>>    My normal emotional response is "correctness is required,
> opt-in to
>     >>>    performance improvements that sacrifice strict correctness",
> but I'm
>     >>> also
>     >>>    sure this is going to surprise people, and would understand /
> accept
>     >> #4
>     >>>    (default to current, opt-in to correct).
>     >>>
>     >>>
>     >>>    On Wed, Nov 18, 2020 at 4:54 AM Benedict Elliott Smith <
>     >>> [email protected]>
>     >>>    wrote:
>     >>>
>     >>>> It doesn't seem like there's much enthusiasm for any of the
> options
>     >>>> available here...
>     >>>>
>     >>>> On 12/11/2020, 14:37, "Benedict Elliott Smith" <
>     >> [email protected]
>     >>>>
>     >>>> wrote:
>     >>>>
>     >>>>> Is the new implementation a separate, distinctly modularized
>     >>> new
>     >>>> body of work
>     >>>>
>     >>>>    It’s primarily a distinct, modularised and new body of work,
>     >>> however
>     >>>> there is some shared code that has been modified - namely
>     >>> PaxosState, in
>     >>>> which legacy code is maintained but modified for compatibility,
> and
>     >>> the
>     >>>> system.paxos table (which receives a new column, and slightly
>     >>> modified
>     >>>> serialization code).  It is conceptually an optimised version of
>     >> the
>     >>>> existing algorithm.
>     >>>>
>     >>>>    If there's a chance of being of value to 4.0, I can try to put
>     >>> up a
>     >>>> patch next week alongside a high level description of the changes.
>     >>>>
>     >>>>> But a performance regression is a regression, I'm not
>     >>> shrugging it
>     >>>> off.
>     >>>>
>     >>>>    I don't want to give the impression I'm shrugging off the
>     >>> correctness
>     >>>> issue either. It's a serious issue to fix, but since all
> successful
>     >>> updates
>     >>>> to the database are linearizable, I think it's likely that many
>     >>>> applications behave correctly with the present semantics, or at
>     >> least
>     >>>> encounter only transient errors. No doubt many also do not, but I
>     >>> have no
>     >>>> idea of the ratio.
>     >>>>
>     >>>>    The regression isn't itself a simple issue either - depending
>     >> on
>     >>> the
>     >>>> topology and message latencies it is not difficult to produce
>     >>> inescapable
>     >>>> contention, i.e. guaranteed timeouts - that might persist as long
>     >> as
>     >>>> clients continue to retry. It could be quite a serious degradation
>     >> of
>     >>>> service to impose on our users.
>     >>>>
>     >>>>    I don't pretend to know the correct way to make a decision
>     >>> balancing
>     >>>> these considerations, but I am perhaps more concerned about
>     >> imposing
>     >>>> service outages than I am temporarily maintaining semantics our
>     >>> users have
>     >>>> apparently accepted for years - though I absolutely share your
>     >>>> embarrassment there.
>     >>>>
>     >>>>
>     >>>>    On 12/11/2020, 12:41, "Joshua McKenzie" <[email protected]
>     >>>
>     >>> wrote:
>     >>>>
>     >>>>        Is the new implementation a separate, distinctly
>     >> modularized
>     >>> new
>     >>>> body of
>     >>>>        work or does it make substantial changes to existing
>     >>>> implementation and
>     >>>>        subsume it?
>     >>>>
>     >>>>        On Thu, Nov 12, 2020 at 3:56 AM Sylvain Lebresne <
>     >>>> [email protected]> wrote:
>     >>>>
>     >>>>> Regarding option #4, I'll remark that experience tends to
>     >>>> suggest users
>     >>>>> don't consistently read the `NEWS.txt` file on upgrade,
>     >> so
>     >>>> option #4 will
>     >>>>> likely essentially mean "LWT has a correctness issue, but
>     >>> once
>     >>>> it broke
>     >>>>> your data enough that you'll notice, you'll be able to
>     >> dig
>     >>> the
>     >>>> proper flag
>     >>>>> to fix it for next time". I guess it's better than
>     >>> nothing, of
>     >>>> course, but
>     >>>>> I'll admit that defaulting to "opt-in correctness",
>     >>> especially
>     >>>> for a
>     >>>>> feature (LWT) that exists uniquely to provide additional
>     >>>> guarantees, is
>     >>>>> something I have a hard rallying behind.
>     >>>>>
>     >>>>> But a performance regression is a regression, I'm not
>     >>> shrugging
>     >>>> it off.
>     >>>>> Still, I feel we shouldn't leave LWT with a fairly
>     >> serious
>     >>> known
>     >>>>> correctness bug and I frankly feel bad for "the project"
>     >>> that
>     >>>> this has been
>     >>>>> known for so long without action, so I'm a bit biased in
>     >>> wanting
>     >>>> to get it
>     >>>>> fixed asap.
>     >>>>>
>     >>>>> But maybe I'm overstating the urgency here, and maybe
>     >>> option #1
>     >>>> is a better
>     >>>>> way forward.
>     >>>>>
>     >>>>> --
>     >>>>> Sylvain
>     >>>>>
>     >>>>
>     >>>>
>     >>>>
>     >>>>
>     >>>
> ---------------------------------------------------------------------
>     >>>>    To unsubscribe, e-mail: [email protected]
>     >>>>    For additional commands, e-mail: [email protected]
>     >>>>
>     >>>>
>     >>>>
>     >>>>
>     >>>>
>     >>
> ---------------------------------------------------------------------
>     >>>> To unsubscribe, e-mail: [email protected]
>     >>>> For additional commands, e-mail: [email protected]
>     >>>>
>     >>>>
>     >>>
>     >>>
>     >>>
>     >>>
> ---------------------------------------------------------------------
>     >>> To unsubscribe, e-mail: [email protected]
>     >>> For additional commands, e-mail: [email protected]
>     >>>
>     >>>
>     >>
>     >
>     >
>     >
>     > ---------------------------------------------------------------------
>     > To unsubscribe, e-mail: [email protected]
>     > For additional commands, e-mail: [email protected]
>     >
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: [email protected]
>     For additional commands, e-mail: [email protected]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [DISCUSS] CASSANDRA-12126: LWTs correcteness and performance

Reply via email to