Re: [DISCUSS] CEP-10: Cluster and Code Simulations

Jeremiah D Jordan Tue, 13 Jul 2021 07:06:48 -0700

Just because it is a feature for users who are developers does not mean it is 
not a new feature?  Adding this capability is adding new functionality to what 
developers can do with Apache Cassandra.  How is that not a new feature?


Semver has been brought up a lot in conversations around what can go where.  If 
we look at how semver defines such things:

MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backwards compatible manner, and
PATCH version when you make backwards compatible bug fixes.

This change to me sounds like 2.  Adding new functionality in a backwards 
compatible manner.  I guess our issue here is that we have never actually done 
MINOR releases in the C* project, we only make MAJOR releases and PATCH 
releases.  So we need to decide where things that in semver would go in a MINOR 
version should go.  In my mind it was always that such things should only go to 
a MAJOR, as it seems less safe to relax what goes in a PATCH and allow them 
there.

-Jeremiah

> On Jul 13, 2021, at 8:47 AM, [email protected] wrote:
> 
>> I do think adding the ability to do “Cluster and Code Simulations” is a new 
>> feature.
> 
> I don’t. I understand a feature to be a user-visible change, such as new 
> functionality, and it was on this basis I endorsed the release lifecycle 
> document. I do not believe that all improvement should stop to patch 
> releases, as I do not believe this produces the highest quality outcome.
> 
> 
> 
> 
> From: Jeremiah D Jordan <[email protected]>
> Date: Tuesday, 13 July 2021 at 14:41
> To: Cassandra DEV <[email protected]>
> Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
> I do not think fixing CASSANDRA-12126 is not a new feature.  I do think 
> adding the ability to do “Cluster and Code Simulations” is a new feature.
> 
> -Jeremiah
> 
>> On Jul 13, 2021, at 8:37 AM, [email protected] wrote:
>> 
>> Nothing we’re discussing constitutes a feature. We’re discussing stability 
>> enhancements, and important bug fixes.
>> 
>> I think this disagreement is to some extent founded on our different 
>> premises about what a patch release should contain, and this seems to be the 
>> fault of incompletely specified documentation.
>> 
>> 1. The release lifecycle only forbids feature work from being developed in a 
>> patch release, and only expressly includes bug fixes. Note that, the 
>> document even has a comment by the author suggesting that features may be 
>> backported to a patch release from trunk (not something I agree with, but it 
>> demonstrates the ambiguity of the definition).
>> 2. There seems to be some conflation of size-of-change with the 
>> admissibility wrt release lifecycle – I don’t think there’s any criteria 
>> here, and it’s open to the community’s case-by-case assessment. Whatever we 
>> do to fix the bug in question will necessarily be a very significant piece 
>> of work itself, for instance.
>> 
>> My interpretation of the release lifecycle document is that it is acceptable 
>> to include this work in a patch release. My belief about its impact is that 
>> it would contribute positively to the stability of the project’s 4.0 
>> releases over the lifecycle, and improve project velocity.
>> 
>> With respect to whether we can ship a fix to 12126 without validation, I 
>> would be strongly opposed to this, and certainly would not produce a patch 
>> myself in this way. Not only would it be burdensome (given the divergences 
>> in the codebase), but I would not consider it acceptably safe (given the 
>> divergence).
>> 
>> 
>> From: Jeremiah D Jordan <[email protected]>
>> Date: Tuesday, 13 July 2021 at 14:15
>> To: Cassandra DEV <[email protected]>
>> Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
>> I tend to agree with Paulo that a major refactoring of some internal 
>> interfaces sounds like something to be explicitly avoided in a patch 
>> release.  I thought this was the type of change we all agreed we should stop 
>> letting in to patch releases, and that we would attempt to release more 
>> often (once a year) so changes that only go to trunk would get out faster?  
>> Are we really wanting to break that promise to ourselves before we even 
>> release 4.0?  To me “I think we need this feature released faster” is not a 
>> reason to put it in 4.0, it could be a reason to release 4.1 sooner.  This 
>> is where having a releasable trunk helps, as if we decided as a project that 
>> some change was worth a new major being released early the effort of doing 
>> that release is much smaller when trunk is releasable.
>> 
>> Any fix we make in 4.0 would be merged forward into trunk and could be fully 
>> verified there?  Probably not the best, but would give more confidence in a 
>> fix than otherwise without adding other major changes to 4.0?
>> 
>> -Jeremiah
>> 
>>> On Jul 13, 2021, at 7:59 AM, Benjamin Lerer <[email protected]> wrote:
>>> 
>>>> 
>>>> Furthermore, we introduced a significant performance regression in all
>>>> lines of the software by increasing the number of LWT round-trips. Unless
>>>> we intend to leave this regression for a further year without _any_ release
>>>> offering a solution, we will need suitable verification mechanisms for
>>>> whatever fixes we deliver.
>>>> 
>>>> My view is that it is unacceptable to leave such a significant regression
>>>> unaddressed in all lines of software we intend to release for the
>>>> foreseeable future.
>>> 
>>> 
>>> I would like to expand a bit on this as I believe it might be important for
>>> people to have the full picture. The fix for  CASSANDRA-12126
>>> <https://issues.apache.org/jira/browse/CASSANDRA-12126> introduced a
>>> regression by increasing the number of LWT round-trips. Nevertheless, the
>>> patch introduced a flag to allow users to revert to the previous behavior
>>> (previous performance + consistency issue).
>>> 
>>> Also the patch did not address all paxos consistency issues. There are
>>> still some issues during topologie changes (may be in some other scenarios).
>>> 
>>> My understanding of Benedict's proposal is to fix paxos once and for all
>>> without any performance regression.
>>> 
>>> That goal makes total sense to me. "Where do we do that?" is a more tricky
>>> question.
>>> 
>>> Le mar. 13 juil. 2021 à 14:46, [email protected] <[email protected]> a
>>> écrit :
>>> 
>>>> Hmm. It occurs to me I’m not entirely sure how our new release process is
>>>> going to work.
>>>> 
>>>> Will we be releasing 4.1 builds immediately, as part of shippable trunk?
>>>> Or will 4.0 be our only active line of software for the next year?
>>>> 
>>>> Either way, I bet my bottom dollar there will come some regret if we
>>>> introduce such divergence between the two most active branches we maintain,
>>>> so early in their lifecycles. If we invest significant resources in
>>>> improved testing using this framework (which I very much expect) then
>>>> branches that are not compatible will not benefit, likely reducing their
>>>> quality; and the risk of backports will increase, due to divergence.
>>>> 
>>>> Altogether, I think it would be a huge mistake. But if we will be shipping
>>>> releases soon that can fix these aforementioned regressions, I won’t
>>>> campaign for it.
>>>> 
>>>> 
>>>> 
>>>> From: [email protected] <[email protected]>
>>>> Date: Tuesday, 13 July 2021 at 13:31
>>>> To: [email protected] <[email protected]>
>>>> Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
>>>> No change is without risk; we have introduced serious regressions with bug
>>>> fixes to patch releases. The overall risk to the release lifecycle is
>>>> reduced significantly in my opinion, as we reduce the likelihood of
>>>> introducing regressions, and can use the same test infrastructure across
>>>> all of the actively developed releases, increasing our confidence in 4.0.x
>>>> releases.
>>>> 
>>>> Furthermore, we introduced a significant performance regression in all
>>>> lines of the software by increasing the number of LWT round-trips. Unless
>>>> we intend to leave this regression for a further year without _any_ release
>>>> offering a solution, we will need suitable verification mechanisms for
>>>> whatever fixes we deliver.
>>>> 
>>>> My view is that it is unacceptable to leave such a significant regression
>>>> unaddressed in all lines of software we intend to release for the
>>>> foreseeable future.
>>>> 
>>>> 
>>>> From: Paulo Motta <[email protected]>
>>>> Date: Tuesday, 13 July 2021 at 13:21
>>>> To: Cassandra DEV <[email protected]>
>>>> Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
>>>>> No, in my opinion the target should be 4.0.x. We are reaching for a
>>>> shippable trunk and this has no public API impacts. This work is IMO
>>>> central to achieving a shippable trunk, either way. The only reason I do
>>>> not target 3.x is that it would be too burdensome.
>>>> 
>>>> In my limited view of the proposal, a major refactor of internal
>>>> concurrency APIs to support the testing facility potentially risks the
>>>> stability of a minor release, something we've been wanting to avoid with
>>>> our focus on stability. So I'd prefer this to go in  trunk/4.1, otherwise
>>>> we will create precedence to including non-bugfix changes in minor
>>>> versions, something I think we should avoid.
>>>> 
>>>> In the past we've been lenient to including seemingly harmless internal
>>>> changes that caused client impact and we should be careful to avoid this in
>>>> the future. To prevent this I think we should take a strict approach and
>>>> only accept bug fixes in minor (ie. 4.0.x) versions moving forward.
>>>> 
>>>> I'd go one step further and propose that any CEPs, which are generally
>>>> about new features, major API changes or internal refactorings, should only
>>>> be allowed in subsequent major versions, unless an explicit exception is
>>>> granted.
>>>> 
>>>> Em ter., 13 de jul. de 2021 às 07:11, [email protected] <
>>>> [email protected]> escreveu:
>>>> 
>>>>> Perhaps it’s worth looking forward at the roadmap that we plan to
>>>> develop,
>>>>> and consider whether such a facility would be welcome for proving their
>>>>> safety, and we can then worry about evolving the specifics of any API(s)
>>>>> together as we deploy the capability? Looking ahead, there are very few
>>>>> major features I wouldn’t want to see exercised with this approach, given
>>>>> the choice.
>>>>> 
>>>>> The LWT Verifier by itself is an integration test that covers many of the
>>>>> affected subsystems, including sstables, memtables and repair. But we
>>>> will
>>>>> have the ability to introduce dedicated verification for each of these
>>>>> features and systems, and we will necessarily produce more robust code
>>>>> (repair is a great example of a brittle system that would be impossible
>>>> to
>>>>> produce with such an adversarial test system)
>>>>> 
>>>>> 
>>>>> *Query side improvements:*
>>>>> 
>>>>> * Storage Attached Index or SAI. The CEP can be found at
>>>>> 
>>>>> 
>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index
>>>>> * Add support for OR predicates in the CQL where clause
>>>>> * Allow to aggregate by time intervals (CASSANDRA-11871) and allow UDFs
>>>>> in GROUP BY clause
>>>>> * Ability to read the TTL and WRITE TIME of an element in a collection
>>>>> (CASSANDRA-8877)
>>>>> * Multi-Partition LWTs
>>>>> * Materialized views hardening: Addressing the different Materialized
>>>>> Views issues (see CASSANDRA-15921 and [1] for some of the work involved)
>>>>> 
>>>>> *Security improvements:*
>>>>> 
>>>>> * SSTables encryption (CASSANDRA-9633)
>>>>> * Add support for Dynamic Data Masking (CEP pending)
>>>>> * Allow the creation of roles that have the ability to assign arbitrary
>>>>> privileges, or scoped privileges without also granting those roles access
>>>>> to database objects.
>>>>> * Filter rows from system and system_schema based on users permissions
>>>>> (CASSANDRA-15871)
>>>>> 
>>>>> *Performance improvements:*
>>>>> 
>>>>> * Trie-based index format (CEP pending)
>>>>> * Trie-based memtables (CEP pending)
>>>>> * Paxos improvements: Paxos / LWT implementation that would enable the
>>>>> database to serve serial writes with two round-trips and serial reads
>>>> with
>>>>> one round-trip in the uncontended case
>>>>> 
>>>>> *Safety/Usability improvements:*
>>>>> 
>>>>> * Guardrails. The CEP can be found at
>>>>> 
>>>>> 
>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/%28DRAFT%29+-+CEP-3%3A+Guardrails
>>>>> * Add ability to track state in repair (CASSANDRA-15399)
>>>>> * Repair coordinator improvements (CASSANDRA-15399)
>>>>> * Make incremental backup configurable per keyspace and table
>>>>> (CASSANDRA-15402)
>>>>> * Add ability to blacklist a CQL partition so all requests are ignored
>>>>> (CASSANDRA-12106)
>>>>> * Add default and required keyspace replication options
>>>> (CASSANDRA-14557)
>>>>> * Transactional Cluster Metadata: Use of transactions to propagate
>>>>> cluster metadata
>>>>> * Downgrade-ability: Ability to downgrade to downgrade in the event
>>>> that
>>>>> a serious issue has been identified
>>>>> 
>>>>> *Pluggability improvements:*
>>>>> 
>>>>> * Pluggable schema manager (CEP pending)
>>>>> * Pluggable filesystem (CEP pending)
>>>>> * Pluggable authenticator for CQLSH (CASSANDRA-16456). A CEP draft can
>>>> be
>>>>> found at
>>>>> 
>>>>> 
>>>> https://docs.google.com/document/d/1_G-OZCAEmDyuQuAN2wQUYUtZBEJpMkHWnkYELLhqvKc/edit
>>>>> * Memtable API (CEP pending). The goal being to allow improvements such
>>>>> as CASSANDRA-13981 to be easily plugged into Cassandra
>>>>> 
>>>>> *Memtable pluggable implementation:*
>>>>> 
>>>>> * Enable Cassandra for Persistent Memory (CASSANDRA-13981)
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> From: [email protected] <[email protected]>
>>>>> Date: Tuesday, 13 July 2021 at 10:51
>>>>> To: [email protected] <[email protected]>
>>>>> Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
>>>>> Ach, editing code in the email editor isn’t smart when editors all have
>>>>> different meanings for key combinations (accidentally hit send), but you
>>>>> get the idea. The simulator would intercept these thread executions, the
>>>>> memory accesses for the annotated field, and evaluate them so that in
>>>> some
>>>>> cases the assertions would fail.
>>>>> 
>>>>> This is obviously a toy example that is not very interesting, but the
>>>> main
>>>>> real example we have is too complicated to produce a snippet to
>>>>> demonstrate. In my view, the long term outcome of this work is likely the
>>>>> enablement of many unit tests that are a little more complicated than
>>>> this,
>>>>> on less obvious code.
>>>>> 
>>>>> But the headline goal of the CEP is not. By itself, the LWT Verifier
>>>>> demonstrates the power and utility of the work. I don’t believe it is
>>>>> terribly helpful to focus on secondary justifications like the example I
>>>>> gave. For me, the _ability_ to prove the correctness of difficult but
>>>>> critical systems is justification enough, whether or not we deliver a
>>>>> simple API as part of the CEP.
>>>>> 
>>>>> 
>>>>> 
>>>>> From: [email protected] <[email protected]>
>>>>> Date: Tuesday, 13 July 2021 at 10:43
>>>>> To: [email protected] <[email protected]>
>>>>> Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
>>>>>> Should target release be 4.1. (not 4.0.x) ?
>>>>> 
>>>>> 
>>>>> 
>>>>> No, in my opinion the target should be 4.0.x. We are reaching for a
>>>>> shippable trunk and this has no public API impacts. This work is IMO
>>>>> central to achieving a shippable trunk, either way. The only reason I do
>>>>> not target 3.x is that it would be too burdensome.
>>>>> 
>>>>>> My concern is that changing code and tests at the same time risks
>>>>> regressions…
>>>>> 
>>>>> 
>>>>> 
>>>>> I’ve never heard this position before. Would you care to elaborate? It is
>>>>> quite normal for us to update tests alongside changes to the code.
>>>>> 
>>>>>> And seconding Benjamin's comments… some documentation on how to write a
>>>>> test, and a simple test example, that this CEP then allows us to write
>>>>> would help a lot (a la "working backwards").
>>>>> 
>>>>> 1) This work is to _enable_ the development of tests, with the only test
>>>>> originally planned to arrive alongside it the fairly sophisticated LWT
>>>>> Verifier. This is something we have sorely needed as a project, as we
>>>> have
>>>>> had serious correctness violations for multiple years. This broad
>>>> category
>>>>> of integrated test for verifying correctness is the main goal of the work
>>>>> and is not easily condensed into an example snippet.
>>>>> 2) It is _possible_ that some simple and fluid APIs will be introduced in
>>>>> a later phase of this work, but they haven’t been designed yet, so I
>>>> cannot
>>>>> share snippets.
>>>>> 
>>>>> In principle, however, you would be able to do something like:
>>>>> 
>>>>> @Nemesis volatile int x = 0;
>>>>> int foo() {
>>>>>  x = x + 1;
>>>>>  return x;
>>>>> }
>>>>> 
>>>>> @Test
>>>>> void test() {
>>>>>  Future<?> f1 = executor.submit(() -> foo());
>>>>>  Future<?> f2 = executor.submit(() -> foo());
>>>>>  Assert.assertTrue(f1.get() == 1 || f2.get() == 1);
>>>>> }
>>>>> 
>>>>> 
>>>>> From: Mick Semb Wever <[email protected]>
>>>>> Date: Tuesday, 13 July 2021 at 10:28
>>>>> To: [email protected] <[email protected]>
>>>>> Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
>>>>>> 
>>>>>> To achieve this, significant modifications will be required to the
>>>>> codebase, mostly cleaning up existing abstractions. Specifically, we will
>>>>> need to be able to mock executors, any blocking concurrency primitives,
>>>>> time, filesystem access and internode streaming.
>>>>>> 
>>>>>> The work is – in large part – already complete, with JIRA and PRs to
>>>>> follow in the coming weeks. Of course, the work is subject to the usual
>>>>> community input and review, so this does not preclude changes to the work
>>>>> (even significant ones, if they are warranted). I know a lot of incoming
>>>>> CEP are likely to be backed up by significant off-list development as a
>>>>> result of the focus on a shippable 4.0. Hopefully this is just a
>>>> temporary
>>>>> growing pain, particularly as we move towards a shippable trunk.
>>>>>> 
>>>>>> I hope this work will be of huge value to the project, particularly as
>>>>> we race to catch up on years of limited feature development.
>>>>>> 
>>>>>> JIRA and PRs will follow, but I wanted to kick-off discussion in
>>>> advance.
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Should target release be 4.1. (not 4.0.x) ?
>>>>> 
>>>>> I'd be interested in seeing a rough timeline/plan of how the proposed
>>>>> changes are to be defined in JIRAs and ordered.
>>>>> 
>>>>> I'd like to hear a bit more about the test plan. Not so much about how
>>>>> the CEP itself improves testability of the project, but for example
>>>>> the testing required to be in place to introduce the changes of the
>>>>> CEP (and if it already exists, where). My concern is that changing
>>>>> code and tests at the same time risks regressions…
>>>>> 
>>>>> And seconding Benjamin's comments… some documentation on how to write
>>>>> a test, and a simple test example, that this CEP then allows us to
>>>>> write would help a lot (a la "working backwards").
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

Re: [DISCUSS] CEP-10: Cluster and Code Simulations

Reply via email to