Re: CEP-15 multi key transaction syntax

2022-06-15 Thread bened...@apache.org
Ok, so I am not a huge fan of static rows, but I disagree with your analysis.

First some history: static rows are an efficiency sop to those who migrated 
from the historical wide row world, where you could have “global” partition 
state fetched with every query, and to support the deprecation of thrift and 
its horrible data model something needed to give – static rows were the result.

However, is the concept generally consistent? I think so. At least, your 
example seem fine to me, and I can’t see how they violate the “relational 
model” (whatever that may be). If it helps, you can think of the static columns 
actually creating a second table, so that you now have two separate tables with 
the same partition key. These tables are implicitly related via a “full outer 
join” on the partition key, and you can imagine that you are generally querying 
a view of this relation.

In this case, you would expect the outcome you see AFAICT. If you have no 
restriction on the results, and you have no regular rows and one static row, 
you would see a single static row result with null regular columns (and a count 
of 1 row). If you imposed a restriction on regular columns, you would not see 
the static column as the null regular columns would not match the condition.

> In LWT, a static row appears to exist when there is no regular row matching 
> WHERE

I assume you mean the IF clause matches against a static row if you UPDATE tbl 
SET v = a WHERE p = b IF s = c. This could be an inconsistency, but I think it 
is not. Recall, UPDATE in CQL is not UPDATE in SQL. SQL would do nothing if the 
row doesn’t exist, whatever the IF clause might say. CQL is really performing 
UPSERT.

So, what happens when the WHERE clause doesn’t match a primary key with UPSERT? 
A row is created. In this case, if you consider that this empty nascent row is 
used to join with the static “table” for evaluating the IF condition, to decide 
what you UPSERT, then it all makes sense – to me, anyway.

> NULLs are first-class values, distinguishable from unset values

Could you give an example?


From: Konstantin Osipov 
Date: Wednesday, 15 June 2022 at 20:56
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
* bened...@apache.org  [22/06/15 18:38]:
> I expect LET to behave like SELECT, and I don’t expect this work to modify 
> the behaviour of normal CQL expressions. Do you think there is something 
> wrong or inconsistent about the behaviours you mention?
>
> Static columns are a bit weird, but at the very least the following would 
> permit the user to reliably obtain a static value, if it exists:
>
> LET x = some_static_column FROM table WHERE partitionKey = someKey LIMIT 1
>
> This could be mixed with a clustering key query
>
> LET y = some_regular_column FROM table WHERE partitionKey = someKey AND 
> clusteringKey = someOtherKey

I think static rows should not be selectable outside clustering
rows. This violates relational model. Unfortunately currently they
sometimes are.

Here's an example:


> create table t (p int, c int, r int, s int static, primary key(p, c));
OK
> insert into t (p, s) values (1,1) if not exists;
+-+--+--+--+--+
| [applied]   | p| c| s| r|
|-+--+--+--+--|
| True| null | null | null | null |
+-+--+--+--+--+
> -- that's right, there is a row now; what row though?
> select count(*) from t;
+-+
|   count |
|-|
|   1 |
+-+
> -- let's add more rows
> insert into t (p, c, s) values (1,1,1) if not exists;
+-+--+--+--+--+
| [applied]   | p| c| s| r|
|-+--+--+--+--|
| True| null | null | null | null |
+-+--+--+--+--+
> -- we did not add more rows?
> select count(*) from t;
+-+
|   count |
|-|
|   1 |
+-+

In LWT, a static row appears to exist when there is no regular row
matching WHERE. It would be nice to somehow either be consistent
in LET with existing SELECTs, or, so to speak, be consistently
inconsistent, i.e. consistent with some other vendor, and not come
up with a whole new semantics for static rows, different from LWT
and SELECTs.

This is why I was making all these comments about missing rows
-there is no incongruence in classic SQL, any vendor, because a)
there are no static rows b) NULLs are first-class values,
distinguishable from unset values.


--
Konstantin Osipov, Moscow, Russia


Re: Cassandra project biweekly status update 2022-06-14

2022-06-15 Thread bened...@apache.org
> I agree a broader consensus beyond those on the jira ticket should be sought 
> before committing the patch that bumps a new major.

Broader consensus should be sought on any ticket that breaks backwards 
compatibility – even if we already have bumped major version.

A major version bump should NOT be taken as carte blanche to break users, we 
should determine it for eadh case on a balance of benefit/cost.



From: Mick Semb Wever 
Date: Wednesday, 15 June 2022 at 17:44
To: dev@cassandra.apache.org 
Subject: Re: Cassandra project biweekly status update 2022-06-14
I'm going to jump off the email list to JIRA for this one - we've had a 
discussion ongoing about when we cut a Major vs. a Minor, what qualifies as an 
API, etc on CASSANDRA-16844 
(https://issues.apache.org/jira/browse/CASSANDRA-16844). Expect something to 
formally hit the dev mailing list about this soon, but until then we can keep 
going on the JIRA ticket.


I need to take some blame here, for leading people down a bit of a garden path.

The idea was that trunk is by default the next minor, and that when a patch 
lands that warrants a bump to the next major then the patch includes that 
change to build.xml's base.version.

The devil is in the detail here, and it becomes a lot more clearer when reading 
CASSANDRA-16844.  I'm appreciative when we can tackle these things in a lazy 
manner as they arise as real examples often bring that extra clarity.

I agree a broader consensus beyond those on the jira ticket should be sought 
before committing the patch that bumps a new major. The broader audience may 
also help propose better solutions that don't require a major change (as was 
done in 16844), and help coordinate with other tickets also warranting a new 
major…


Re: CEP-15 multi key transaction syntax

2022-06-15 Thread bened...@apache.org
I expect LET to behave like SELECT, and I don’t expect this work to modify the 
behaviour of normal CQL expressions. Do you think there is something wrong or 
inconsistent about the behaviours you mention?

Static columns are a bit weird, but at the very least the following would 
permit the user to reliably obtain a static value, if it exists:

LET x = some_static_column FROM table WHERE partitionKey = someKey LIMIT 1

This could be mixed with a clustering key query

LET y = some_regular_column FROM table WHERE partitionKey = someKey AND 
clusteringKey = someOtherKey


From: Konstantin Osipov 
Date: Wednesday, 15 June 2022 at 14:04
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
* bened...@apache.org  [22/06/15 10:00]:
> It sounds like we’re zeroing in on a solution.
>
> To draw attention back to Jon’s email, I think the last open question at this 
> point is the scope of identifiers declared by LET, and how we handle name 
> clashes with table columns in an UPDATE.
>
> I think we have basically two options:
>
> 1. Require LET for all input parameters to an assignment in UPDATE
> 2. Add some additional syntax to local variables to identify them, e.g. 
> 


I'm curious, regardless of the syntax you choose, will LET or
SELECT return the static row if there is no match for the
clustering key, or return NULL row?

I am asking because SELECT currently does not return any rows if
there is no clustering key matching the WHERE clause, but a conditional UPDATE
chooses the static row to check conditions instead, if it's present.

--
Konstantin Osipov, Moscow, Russia


Re: CEP-15 multi key transaction syntax

2022-06-14 Thread bened...@apache.org
+1

From: Blake Eggleston 
Date: Tuesday, 14 June 2022 at 21:46
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
I'd lean towards 3, where the statement doesn't parse because `miles` is 
ambiguous


On Jun 14, 2022, at 1:40 PM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

To be clear, the concerning situation is

BEGIN TRANSACTION
  LET miles = miles_driven, running=is_running FROM cars WHERE model=’pinto’
  IF running THEN
UPDATE cars SET miles_driven = miles + 30 WHERE model='pinto';
  END IF
COMMIT TRANSACTION

But where there’s some additional column also called miles in cars


From: bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>>
Date: Tuesday, 14 June 2022 at 21:37
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
Duplicate declarations are usually rejected by languages, so I think that’s 
fine?

Option 1 would involve something like

BEGIN TRANSACTION
  LET car_miles = miles_driven, running=is_running FROM cars WHERE model=’pinto’
  LET user_miles = miles_driven FROM users WHERE name=’blake’
  SELECT running, car_miles, user_miles
  IF running THEN
UPDATE users SET miles_driven = user_miles + 30 WHERE name='blake';
UPDATE cars SET miles_driven = car_miles + 30 WHERE model='pinto';
  END IF
COMMIT TRANSACTION



From: Derek Chen-Becker mailto:de...@chen-becker.org>>
Date: Tuesday, 14 June 2022 at 21:27
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
Just to make sure I'm understanding correctly, I've been thinking of LET like a 
variable declaration and assignment, but is that the right mental model? For 
example, this is a valid statement:

BEGIN TRANSACTION
  LET miles = miles_driven, running=is_running FROM cars WHERE model=’pinto’
  SELECT running, miles   # let the user know if the transaction takes any 
action
  IF running THEN
UPDATE users SET miles_driven = miles_driven + 30 WHERE name='blake';
UPDATE cars SET miles_driven = miles_driven + 30 WHERE model='pinto';
  END IF
COMMIT TRANSACTION

But this isn't, because we're trying to bind to "miles" twice

BEGIN TRANSACTION
  LET miles = miles_driven, running=is_running FROM cars WHERE model=’pinto’
  LET miles = miles_driven FROM users WHERE name=’blake’ # duplicate binding 
for "miles"
  SELECT running, miles   # let the user know if the transaction takes any 
action
  IF running THEN
UPDATE users SET miles_driven = miles_driven + 30 WHERE name='blake';
UPDATE cars SET miles_driven = miles_driven + 30 WHERE model='pinto';
  END IF
COMMIT TRANSACTION

I think that's option #1, but I'm a little confused now that I'm looking at 
some of the examples.

Cheers,

Derek

On Tue, Jun 14, 2022 at 1:58 PM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
It sounds like we’re zeroing in on a solution.

To draw attention back to Jon’s email, I think the last open question at this 
point is the scope of identifiers declared by LET, and how we handle name 
clashes with table columns in an UPDATE.

I think we have basically two options:

1. Require LET for all input parameters to an assignment in UPDATE
2. Add some additional syntax to local variables to identify them, e.g. 


Any other ideas?



From: Derek Chen-Becker mailto:de...@chen-becker.org>>
Date: Tuesday, 14 June 2022 at 20:31
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
Sorry, that was in reference to the "Would you require a LIMIT 1 clause if the 
key did not fully specify a row?" question, so I think we're in agreement here.

Cheers,

Derek

On Tue, Jun 14, 2022 at 1:27 PM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
> It seems like we would want to start with restrictions on number of rows, 
> uniqueness, homogeneity of results, etc

I am not keen on any hard limit on the number of rows, I anticipate a 
configurable guardrail for rejecting queries that are too expensive. I think 
the normal CQL restrictions are likely to apply (must include partition key), 
plus (initially) no range scans, and the aforementioned restrictions on what 
order statements must occur in the transaction.


From: Derek Chen-Becker mailto:de...@chen-becker.org>>
Date: Tuesday, 14 June 2022 at 18:42
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
"MIXED" means, "hey, this might not be my standard PGSQL transaction" :)

I do think that surprise is a meaningful measure, from the perspective of an 

Re: CEP-15 multi key transaction syntax

2022-06-14 Thread bened...@apache.org
(or 3. Let schema updates break the statement – this might actually be 
preferable, so long as it fails-fast rather than corrupts behaviour)

From: bened...@apache.org 
Date: Tuesday, 14 June 2022 at 20:58
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
It sounds like we’re zeroing in on a solution.

To draw attention back to Jon’s email, I think the last open question at this 
point is the scope of identifiers declared by LET, and how we handle name 
clashes with table columns in an UPDATE.

I think we have basically two options:

1. Require LET for all input parameters to an assignment in UPDATE
2. Add some additional syntax to local variables to identify them, e.g. 


Any other ideas?



From: Derek Chen-Becker 
Date: Tuesday, 14 June 2022 at 20:31
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
Sorry, that was in reference to the "Would you require a LIMIT 1 clause if the 
key did not fully specify a row?" question, so I think we're in agreement here.

Cheers,

Derek

On Tue, Jun 14, 2022 at 1:27 PM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
> It seems like we would want to start with restrictions on number of rows, 
> uniqueness, homogeneity of results, etc

I am not keen on any hard limit on the number of rows, I anticipate a 
configurable guardrail for rejecting queries that are too expensive. I think 
the normal CQL restrictions are likely to apply (must include partition key), 
plus (initially) no range scans, and the aforementioned restrictions on what 
order statements must occur in the transaction.


From: Derek Chen-Becker mailto:de...@chen-becker.org>>
Date: Tuesday, 14 June 2022 at 18:42
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
"MIXED" means, "hey, this might not be my standard PGSQL transaction" :)

I do think that surprise is a meaningful measure, from the perspective of an 
individual developer coming to Cassandra from any arbitrary RDBMS. My own 
experience is that a non-trivial number of developers are essentially blindly 
following guidance given to them by someone else when it comes to features like 
transactions, so making syntax that looks superficially similar to SQL 
transactions but acts subtly different (or uses slightly different syntax) is 
going to be surprising. I think we get diminishing marginal returns on "it 
looks just like SQL!" when we start to venture further into territory where 
even different RDMBSs disagree. I would rather use some syntax that is clearly 
Cassandra-specific, even if the structure would be similar to a SQL 
transaction, just to ensure that developers understand that it's different and 
actually look at the docs.

I completely agree on focusing on clarity and consistency, and I think 
considering how we think it might evolve is good, but that can't be an 
open-ended exercise. My primary concern is how we can start getting incremental 
improvements into end users' hands more quickly, since the alternative right 
now is to basically roll your own, right?

Cheers,

Derek

On Mon, Jun 13, 2022 at 4:16 PM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
What on earth does MIXED mean?

I agree with the sentiment we should minimise surprise, but everyone is 
surprised differently so it becomes a sort of pointless rubrik, everyone 
claiming it supports their view. I think it is only useful in cases where there 
is clear agreement that something is surprising, but unhelpful when choosing 
between subtle variations on approach.

The main goal IMO should be clarity and consistency, so that the user can 
reason about the constructs easily, and so we can evolve them.

For instance, we should be sure to consider how the syntax will look if we *do* 
offer interactive transactions, or JOINs, or anything else we might add in 
future.


From: Derek Chen-Becker mailto:de...@chen-becker.org>>
Date: Monday, 13 June 2022 at 23:09
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
On Mon, Jun 13, 2022 at 1:57 PM Blake Eggleston 
mailto:beggles...@apple.com>> wrote:
I prefer an approach that supports an accurate mental model of what’s happening 
behind the scenes. I think that should be a design priority for the syntax. 
We’ll be able to build things on top of accord, but the core multi-key cas 
operation isn’t going to change too much.

+1, the principle of least surprise tells me that if this doesn't behave 
exactly like SQL transactions (for whatever SQL actually means), it could be 
more clear to not try and emulate it halfway

BEGIN MIXED TRANSACTION?

Derek



On Jun 13, 2022, at 12:14 PM, Blake Eggleston 
mailto:beggl

Re: CEP-15 multi key transaction syntax

2022-06-14 Thread bened...@apache.org
It sounds like we’re zeroing in on a solution.

To draw attention back to Jon’s email, I think the last open question at this 
point is the scope of identifiers declared by LET, and how we handle name 
clashes with table columns in an UPDATE.

I think we have basically two options:

1. Require LET for all input parameters to an assignment in UPDATE
2. Add some additional syntax to local variables to identify them, e.g. 


Any other ideas?



From: Derek Chen-Becker 
Date: Tuesday, 14 June 2022 at 20:31
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
Sorry, that was in reference to the "Would you require a LIMIT 1 clause if the 
key did not fully specify a row?" question, so I think we're in agreement here.

Cheers,

Derek

On Tue, Jun 14, 2022 at 1:27 PM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
> It seems like we would want to start with restrictions on number of rows, 
> uniqueness, homogeneity of results, etc

I am not keen on any hard limit on the number of rows, I anticipate a 
configurable guardrail for rejecting queries that are too expensive. I think 
the normal CQL restrictions are likely to apply (must include partition key), 
plus (initially) no range scans, and the aforementioned restrictions on what 
order statements must occur in the transaction.


From: Derek Chen-Becker mailto:de...@chen-becker.org>>
Date: Tuesday, 14 June 2022 at 18:42
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
"MIXED" means, "hey, this might not be my standard PGSQL transaction" :)

I do think that surprise is a meaningful measure, from the perspective of an 
individual developer coming to Cassandra from any arbitrary RDBMS. My own 
experience is that a non-trivial number of developers are essentially blindly 
following guidance given to them by someone else when it comes to features like 
transactions, so making syntax that looks superficially similar to SQL 
transactions but acts subtly different (or uses slightly different syntax) is 
going to be surprising. I think we get diminishing marginal returns on "it 
looks just like SQL!" when we start to venture further into territory where 
even different RDMBSs disagree. I would rather use some syntax that is clearly 
Cassandra-specific, even if the structure would be similar to a SQL 
transaction, just to ensure that developers understand that it's different and 
actually look at the docs.

I completely agree on focusing on clarity and consistency, and I think 
considering how we think it might evolve is good, but that can't be an 
open-ended exercise. My primary concern is how we can start getting incremental 
improvements into end users' hands more quickly, since the alternative right 
now is to basically roll your own, right?

Cheers,

Derek

On Mon, Jun 13, 2022 at 4:16 PM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
What on earth does MIXED mean?

I agree with the sentiment we should minimise surprise, but everyone is 
surprised differently so it becomes a sort of pointless rubrik, everyone 
claiming it supports their view. I think it is only useful in cases where there 
is clear agreement that something is surprising, but unhelpful when choosing 
between subtle variations on approach.

The main goal IMO should be clarity and consistency, so that the user can 
reason about the constructs easily, and so we can evolve them.

For instance, we should be sure to consider how the syntax will look if we *do* 
offer interactive transactions, or JOINs, or anything else we might add in 
future.


From: Derek Chen-Becker mailto:de...@chen-becker.org>>
Date: Monday, 13 June 2022 at 23:09
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
On Mon, Jun 13, 2022 at 1:57 PM Blake Eggleston 
mailto:beggles...@apple.com>> wrote:
I prefer an approach that supports an accurate mental model of what’s happening 
behind the scenes. I think that should be a design priority for the syntax. 
We’ll be able to build things on top of accord, but the core multi-key cas 
operation isn’t going to change too much.

+1, the principle of least surprise tells me that if this doesn't behave 
exactly like SQL transactions (for whatever SQL actually means), it could be 
more clear to not try and emulate it halfway

BEGIN MIXED TRANSACTION?

Derek



On Jun 13, 2022, at 12:14 PM, Blake Eggleston 
mailto:beggles...@apple.com>> wrote:

Does the IF <...> ABORT simplify reasoning though? If you restrict it to only 
dealing with the most recent row it would, but referencing the name implies 
you’d be able to include references from other operations, in which case you’d 
have the sa

Re: CEP-15 multi key transaction syntax

2022-06-14 Thread bened...@apache.org
> It seems like we would want to start with restrictions on number of rows, 
> uniqueness, homogeneity of results, etc

I am not keen on any hard limit on the number of rows, I anticipate a 
configurable guardrail for rejecting queries that are too expensive. I think 
the normal CQL restrictions are likely to apply (must include partition key), 
plus (initially) no range scans, and the aforementioned restrictions on what 
order statements must occur in the transaction.


From: Derek Chen-Becker 
Date: Tuesday, 14 June 2022 at 18:42
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
"MIXED" means, "hey, this might not be my standard PGSQL transaction" :)

I do think that surprise is a meaningful measure, from the perspective of an 
individual developer coming to Cassandra from any arbitrary RDBMS. My own 
experience is that a non-trivial number of developers are essentially blindly 
following guidance given to them by someone else when it comes to features like 
transactions, so making syntax that looks superficially similar to SQL 
transactions but acts subtly different (or uses slightly different syntax) is 
going to be surprising. I think we get diminishing marginal returns on "it 
looks just like SQL!" when we start to venture further into territory where 
even different RDMBSs disagree. I would rather use some syntax that is clearly 
Cassandra-specific, even if the structure would be similar to a SQL 
transaction, just to ensure that developers understand that it's different and 
actually look at the docs.

I completely agree on focusing on clarity and consistency, and I think 
considering how we think it might evolve is good, but that can't be an 
open-ended exercise. My primary concern is how we can start getting incremental 
improvements into end users' hands more quickly, since the alternative right 
now is to basically roll your own, right?

Cheers,

Derek

On Mon, Jun 13, 2022 at 4:16 PM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
What on earth does MIXED mean?

I agree with the sentiment we should minimise surprise, but everyone is 
surprised differently so it becomes a sort of pointless rubrik, everyone 
claiming it supports their view. I think it is only useful in cases where there 
is clear agreement that something is surprising, but unhelpful when choosing 
between subtle variations on approach.

The main goal IMO should be clarity and consistency, so that the user can 
reason about the constructs easily, and so we can evolve them.

For instance, we should be sure to consider how the syntax will look if we *do* 
offer interactive transactions, or JOINs, or anything else we might add in 
future.


From: Derek Chen-Becker mailto:de...@chen-becker.org>>
Date: Monday, 13 June 2022 at 23:09
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
On Mon, Jun 13, 2022 at 1:57 PM Blake Eggleston 
mailto:beggles...@apple.com>> wrote:
I prefer an approach that supports an accurate mental model of what’s happening 
behind the scenes. I think that should be a design priority for the syntax. 
We’ll be able to build things on top of accord, but the core multi-key cas 
operation isn’t going to change too much.

+1, the principle of least surprise tells me that if this doesn't behave 
exactly like SQL transactions (for whatever SQL actually means), it could be 
more clear to not try and emulate it halfway

BEGIN MIXED TRANSACTION?

Derek



On Jun 13, 2022, at 12:14 PM, Blake Eggleston 
mailto:beggles...@apple.com>> wrote:

Does the IF <...> ABORT simplify reasoning though? If you restrict it to only 
dealing with the most recent row it would, but referencing the name implies 
you’d be able to include references from other operations, in which case you’d 
have the same problem.

> return instead an exception if the transaction is aborted

Since the txn is not actually interactive, I think it would be better to 
receive values instead of an excetion, to understand why the operation was 
rolled back.

On Jun 13, 2022, at 10:32 AM, Aaron Ploetz 
mailto:aaronplo...@gmail.com>> wrote:

Benedict,

I'm really excited about this feature.  I've been observing this conversation 
for a while now, and I"m happy to add some thoughts.

We must balance the fact we cannot afford to do everything (yet), against the 
need to make sure what we do is reasonably intuitive (to both CQL and SQL 
users) and consistent – including with whatever we do in future.

I think taking small steps forward, to build a few complete features as close 
to SQL as possible is a good approach.

question we are currently asking: do we want to have a more LWT-like 
approach... or do we want a more SQL-like approach

For years now we've been fighting this notion that Cassandra is difficult t

Re: CEP-15 multi key transaction syntax

2022-06-14 Thread bened...@apache.org
> I … couldn't find an implementation that wasn't vendor specific.

I’ve fallen into the same trap as others. You’re right, all control flow is 
vendor specific it turns out. So, we either need to consciously pick an SQL 
dialect to mimic (probably the safest would be pgsql), or make sure we are 
distinct.

I should say that I’m *not* opposed to the COMMIT IF syntax (and this factoid 
above further cements that lack of opposition), I just want to be sure our 
syntax can handle evolution of the feature without getting confused/confusing.

If we rule out ever offering the succinct UPDATE x … AS syntax then I don’t 
think it will ever be confusing, it might simply become defunct (not ideal, but 
not the end of the world).

We have a few more things to figure out:

1) Do we automatically turn SELECT statements with a single row into something 
addressable? I like the brevity offered, but I’m not aware of other SQL-like 
languages where this happens. I think the norm is IF (SELECT x FROM…) THEN; or 
you can declare variables.

This also makes it hard right now to declare SELECT statements that return to 
the user and those that do not without introducing additional non-standard 
modifications to COMMIT or SELECT.

Alternatives might be to either require a full SELECT inside the IF for now; or 
to introduce a LET x=, y= FROM… AS z, to make clear we’re declaring some 
variables we can use in expressions.

2) How do we return success/failure. The IF (X) THEN… approach would naturally 
return nothing, nor throw an exception, so we might want to offer the user the 
ability to perform SELECT within the IF so that the presence of a resultset 
defines success/failure – we could even offer the user SELECT ? or to 
return the value of the boolean SELECT X; IF (X) THEN UPDATE y…; END IF

3) The AS syntax – do we want this to look more like SQL, i.e. SELECT X FROM 
tbl AS mytableref?


From: Blake Eggleston 
Date: Tuesday, 14 June 2022 at 00:33
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
> It’s something to hammer out in more detail once we get these other questions 
> pinned down, as I think we can figure out a good compromise.

+1

> self contained, one-off statement like these

I meant with the if statements inline? I hadn't encountered them before myself, 
and couldn't find an implementation that wasn't vendor specific.

Regarding commit if, I'd be totally fine settling on this:

BEGIN TRANSACTION
IF (X) THEN BEGIN
UPDATE someothertable SET anotherval=14 WHERE key=10;
UPDATE someothertable SET anotherval=13 WHERE key=10;
UPDATE someothertable SET anotherval=12 WHERE key=10;
END
COMMIT TRANSACTION

I prefer it to if...abort and commit if ... isn't popular.


On Jun 13, 2022, at 4:14 PM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

> Like I mentioned in my earlier email, the if/abort syntax throwing an 
> exception would, at least as described, limit useful data returned to the 
> client

Right, I agree. I think this is orthogonal to the other syntax questions. I 
think it is also preferable not to mix success/failure with data results, and 
that might be preferable for both syntaxes. It’s something to hammer out in 
more detail once we get these other questions pinned down, as I think we can 
figure out a good compromise.

> At a higher level, what I meant was that SQL doesn’t have a self contained, 
> one-off statement like these

I’m not sure what you mean? It definitely does? In fact, this was how I most 
often used SQL when I worked with it – non-interactively, with explicit 
transactions as part of a single submission to the server, as this reduced the 
number of round-trips but kept the SQL in version control. Stored procedures 
are just a way of doing this with the SQL saved server-side, and accepting 
explicit parameters, but they’re just a convenience?

> and Cassandra doesn’t have interactive transactions

Yet!

> Incidentally, I think it would be useful to eventually have multiple IF 
> branches inline, and had meant the COMMIT IF as a shorthand for it

I agree it would be nice to support more general IF statements, for both 
positive and negative control flow (i.e. IF (X) THEN UPDATE Y, but also IF (X) 
THEN ABORT/ROLLBACK/RAISERROR).

I’m not sure if COMMIT IF really works as syntactic sugar for the more complex 
construct you outlined, though? Perhaps we could instead offer

IF (X) THEN BEGIN
UPDATE someothertable SET anotherval=14 WHERE key=10;
UPDATE someothertable SET anotherval=13 WHERE key=10;
UPDATE someothertable SET anotherval=12 WHERE key=10;
END

For now we could require that at most one such statement occurs per 
transaction, and encapsulates the whole transaction, e.g.

BEGIN TRANSACTION
IF (X) THEN BEGIN
UPDATE someothertable SET anotherval=14 WHERE key=10;
UPDATE someothertable SET anotherval=13 WHERE key=10;
UPDATE someothertable SET anotherval=12 WHERE key=10;
END
COMMIT TRANSACTION

It would be q

Re: CEP-15 multi key transaction syntax

2022-06-13 Thread bened...@apache.org
> Like I mentioned in my earlier email, the if/abort syntax throwing an 
> exception would, at least as described, limit useful data returned to the 
> client

Right, I agree. I think this is orthogonal to the other syntax questions. I 
think it is also preferable not to mix success/failure with data results, and 
that might be preferable for both syntaxes. It’s something to hammer out in 
more detail once we get these other questions pinned down, as I think we can 
figure out a good compromise.

> At a higher level, what I meant was that SQL doesn’t have a self contained, 
> one-off statement like these

I’m not sure what you mean? It definitely does? In fact, this was how I most 
often used SQL when I worked with it – non-interactively, with explicit 
transactions as part of a single submission to the server, as this reduced the 
number of round-trips but kept the SQL in version control. Stored procedures 
are just a way of doing this with the SQL saved server-side, and accepting 
explicit parameters, but they’re just a convenience?

> and Cassandra doesn’t have interactive transactions

Yet!

> Incidentally, I think it would be useful to eventually have multiple IF 
> branches inline, and had meant the COMMIT IF as a shorthand for it

I agree it would be nice to support more general IF statements, for both 
positive and negative control flow (i.e. IF (X) THEN UPDATE Y, but also IF (X) 
THEN ABORT/ROLLBACK/RAISERROR).

I’m not sure if COMMIT IF really works as syntactic sugar for the more complex 
construct you outlined, though? Perhaps we could instead offer

IF (X) THEN BEGIN
UPDATE someothertable SET anotherval=14 WHERE key=10;
UPDATE someothertable SET anotherval=13 WHERE key=10;
UPDATE someothertable SET anotherval=12 WHERE key=10;
END

For now we could require that at most one such statement occurs per 
transaction, and encapsulates the whole transaction, e.g.

BEGIN TRANSACTION
IF (X) THEN BEGIN
UPDATE someothertable SET anotherval=14 WHERE key=10;
UPDATE someothertable SET anotherval=13 WHERE key=10;
UPDATE someothertable SET anotherval=12 WHERE key=10;
END
COMMIT TRANSACTION

It would be quite easy to relax this (maybe even before release), but it gets 
us off the starting block without planned obsolescence.

From: Blake Eggleston 
Date: Monday, 13 June 2022 at 23:57
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
> I think it is far more problematic to introduce a syntax that would not be 
> consistent with future enhancements to transactional functionality. Then we 
> would have to introduce a third syntax, and more syntaxes makes for a messy 
> language IMO.
> I have a very strong preference for choosing a syntax we can evolve 
> consistently, so that users just gain additional keywords or have 
> restrictions relaxed as the feature evolves.

I think our views and goals as pretty strongly aligned here.

> How so? I think all we’re really considering is *not* introducing the IF part 
> of the COMMIT syntax, which is not-SQL-like

Like I mentioned in my earlier email, the if/abort syntax throwing an exception 
would, at least as described, limit useful data returned to the client. 
Solvable depending on how we settle on what data is returned to the client 
though.

At a higher level, what I meant was that SQL doesn’t have a self contained, 
one-off statement like these (stored procedures/functions are close[1], but 
different), and Cassandra doesn’t have interactive transactions. So the 
argument that something is more SQL like when putting syntax meant for 
interactive transactions into Cassandra’s atomic txns isn’t very convincing imo.

Incidentally, I think it would be useful to eventually have multiple IF 
branches inline, and had meant the COMMIT IF as a shorthand for it. Something 
like

BEGIN TRANSACTION;
SELECT * FROM sometable WHERE key=5 AS sel;
UPDATE sometable SET lastread=now() WHERE key=5;
IF sel.someval = 3 THEN
UPDATE someothertable SET anotherval=14 WHERE key=10;
ELSE IF sel.somval = 4 THEN
UPDATE someothertable SET anotherval=13 WHERE key=10;
ELSE
UPDATE someothertable SET anotherval=12 WHERE key=10;
ENDIF;
COMMIT TRANSACTION;

And for extra fun, here’s an early mockup I did based on the Postgres function 
syntax: https://gist.github.com/bdeggleston/51d5510450a1d7549f725e06d871cc60


> Do we require these to be declared first? If so, the problem of ambiguity 
> goes away at least, ignoring everything else.
> Perhaps we can do that initially either way? It makes both syntaxes easier to 
> implement, so we get our MVP more easily. But if we settle what our preferred 
> syntax is, we can see if there’s time to deliver it before a release. Either 
> way, the syntax evolves on a consistent path.

Yes, that’s the idea.


On Jun 13, 2022, at 1:21 PM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

> Don’t call these transactions, the term implies things accord do

Re: CEP-15 multi key transaction syntax

2022-06-13 Thread bened...@apache.org
> is there a subset … that could be implemented as an initial version and then 
> grown over time to include more powerful features?

This is what I would like to aim for, but it’s hard as we probably don’t agree 
in what direction the feature will develop.

My view is that we are more likely than not to develop creeping SQL-like 
functionality over time, in which case it is perhaps good to plan for this 
intentionally from the start.

SQL has decades of work behind it, so we run less risk of taking a design 
deadend, and finding ourselves in a bind when further evolving the language.

I think the way to approach that is to ensure that we do a mix of the following:

1) Ensure any keywords we copy from SQL work very similarly to their SQL 
counterpart, with only some additional restrictions (esp. when we expect to be 
able to later relax them)
2) Where we can’t reasonably do that, introduce new keywords that look and feel 
like SQL but aren’t, so there is no confusion


From: Derek Chen-Becker 
Date: Monday, 13 June 2022 at 23:07
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
I'm coming to this thread fresh and admittedly I'm still trying to catch up and 
wrap my head around it. I think it's already been called out, but what looked 
superficially simple at the beginning of the thread has quickly become 
something that I'm having to take notes on to make sure I understand the 
semantics. I'm a little worried that there are complexities here that we might 
not realize. I like the idea, and I think it's a really powerful addition to 
CQL, but I think we need to make sure we're not setting up users for confusion. 
CQL is great because it leverages knowledge of SQL, but the devil is in the 
differences.

Also, related to complexity, is there a subset of what's being discussed that 
could be implemented as an initial version and then grown over time to include 
more powerful features?

In terms of things that have been discussed so far, in no particular order, the 
AS keyword seems to give the user reasonable control over whether they get the 
pre- or post-update version of the record. Similarly, I think the IF...ABORT 
syntax is much clearer if using AS, since that keyword then decides which 
version of the row to use for the condition. Consider the following (possibly 
incorrect) example:

BEGIN TRANSACTION
SELECT * from cars where ... AS car
IF car.miles > 10 ROLLBACK TRANSACTION
UPDATE cars SET car.next_service = 10 WHERE ...
COMMIT TRANSACTION

vs

BEGIN TRANSACTION
SELECT * FROM cars WHERE ... AS current_car
IF current_car.miles > 10 ROLLBACK TRANSACTION
UPDATE cars SET car.next_service = 10 WHERE ... AS car
COMMIT TRANSACTION

Cheers,

Derek

On Sun, Jun 12, 2022 at 5:34 AM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
> I would love hearing from people on what they think.

^^ It would be great to have more participants in this conversation

> For context, my questions earlier were based on my 20+ years of using SQL 
> transactions across different systems.

We probably don’t come from a very different place. I spent too many years with 
T-SQL.

> When you start a SQL transaction, you are creating a branch of your data that 
> you can operate with until you reach your desired state and then merge it 
> back with a commit.

That’s the essential complexity we’re grappling with: how much do we permit 
your “branch” to do, how do we let you express it, and how do we let you 
express conditions?

We must balance the fact we cannot afford to do everything (yet), against the 
need to make sure what we do is reasonably intuitive (to both CQL and SQL 
users) and consistent – including with whatever we do in future.

Right now, we have the issue that read-your-writes introduces some complexity 
to the semantics, particularly around the conditions of execution.

LWTs impose conditions on the state of all records prior to execution, but 
their API has a lot of shortcomings. The proposal of COMMIT IF (Boolean expr) 
is most consistent with this approach. This can be confusing, though, if the 
condition is evaluated on a value that has been updated by a prior statement in 
the batch – what value does this global condition get evaluated against?*

SQL has no such concept, but also SQL is designed to be interactive. Depending 
on the dialect there’s probably a lot of ways to do this non-interactively in 
SQL, but we probably cannot cheaply replicate the functionality exactly as we 
do not (yet) support interactive transactions that they were designed for. To 
submit a conditional non-interactive transaction in SQL, you would likely use

IF (X) THEN
ROLLBACK
RETURN (ERRCODE)
END IF

or

IF (X) THEN RAISERROR

So, that is in essence the question we are currently asking: do we want to have 
a more LWT-like approach (and if so, how do we address this complexity for the 
user), or do we wan

Re: CEP-15 multi key transaction syntax

2022-06-13 Thread bened...@apache.org
What on earth does MIXED mean?

I agree with the sentiment we should minimise surprise, but everyone is 
surprised differently so it becomes a sort of pointless rubrik, everyone 
claiming it supports their view. I think it is only useful in cases where there 
is clear agreement that something is surprising, but unhelpful when choosing 
between subtle variations on approach.

The main goal IMO should be clarity and consistency, so that the user can 
reason about the constructs easily, and so we can evolve them.

For instance, we should be sure to consider how the syntax will look if we *do* 
offer interactive transactions, or JOINs, or anything else we might add in 
future.


From: Derek Chen-Becker 
Date: Monday, 13 June 2022 at 23:09
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
On Mon, Jun 13, 2022 at 1:57 PM Blake Eggleston 
mailto:beggles...@apple.com>> wrote:
I prefer an approach that supports an accurate mental model of what’s happening 
behind the scenes. I think that should be a design priority for the syntax. 
We’ll be able to build things on top of accord, but the core multi-key cas 
operation isn’t going to change too much.

+1, the principle of least surprise tells me that if this doesn't behave 
exactly like SQL transactions (for whatever SQL actually means), it could be 
more clear to not try and emulate it halfway

BEGIN MIXED TRANSACTION?

Derek



On Jun 13, 2022, at 12:14 PM, Blake Eggleston 
mailto:beggles...@apple.com>> wrote:

Does the IF <...> ABORT simplify reasoning though? If you restrict it to only 
dealing with the most recent row it would, but referencing the name implies 
you’d be able to include references from other operations, in which case you’d 
have the same problem.

> return instead an exception if the transaction is aborted

Since the txn is not actually interactive, I think it would be better to 
receive values instead of an excetion, to understand why the operation was 
rolled back.


On Jun 13, 2022, at 10:32 AM, Aaron Ploetz 
mailto:aaronplo...@gmail.com>> wrote:

Benedict,

I'm really excited about this feature.  I've been observing this conversation 
for a while now, and I"m happy to add some thoughts.

We must balance the fact we cannot afford to do everything (yet), against the 
need to make sure what we do is reasonably intuitive (to both CQL and SQL 
users) and consistent – including with whatever we do in future.

I think taking small steps forward, to build a few complete features as close 
to SQL as possible is a good approach.

question we are currently asking: do we want to have a more LWT-like 
approach... or do we want a more SQL-like approach

For years now we've been fighting this notion that Cassandra is difficult to 
use.  Coming up with specialized syntax isn't going to bridge that divide.  
From a (new?) user perspective, the best plan is to stay as consistent with SQL 
as possible.

I believe that is a MySQL specific concept. This is one problem with mimicking 
SQL – it’s not one thing!

Right?!?!  As if this needed to be more complex.

I think we have evidence that it is fine to interpret NULL as “false” for the 
evaluation of IF conditions.

Agree.  Null == false isn't too much of a leap.

Thanks for taking up the charge on this one.  Glad to see it moving forward!

Thanks,

Aaron



On Sun, Jun 12, 2022 at 10:33 AM 
bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
Welcome Li, and thanks for your input

> When I first saw the syntax, I took it for granted that the condition was 
> evaluated against the state AFTER the updates

Depending what you mean, I think this is one of the options being considered. 
At least, it seems this syntax is most likely to be evaluated against the 
values written by preceding statements in the batch, but not the statement 
itself (or later ones), as this could lead to nonsensical statements like

BEGIN TRANSACTION
UPDATE tbl SET v = 1 WHERE key = 1 AS tbl
COMMIT TRANSACTION IF tbl.v = 0

Where y is never 0 afterwards, so this never succeeds. I take it in this simple 
case you would expect the condition to be evaluated against the state prior to 
the statement (i.e. the initial state)?

But we have a blank slate, so every option is available to us! We just need to 
make sure it makes sense to the user, even in uncommon cases.

> The IF (Boolean expr) ABORT TRANSACTION would suffer less because users may 
> tend to put the condition closer to the related SELECT statement.

This is probably not going to matter in practice. The SELECTs all happen 
upfront no matter what the CQL might look like, and the UPDATE all happen only 
after the IF conditions are evaluated. This is all just a question of how the 
user expresses things.

In future we may offer interactive transactions, or transactions that are 
multi-step, in which case this would be more relevant and could have an 
efficiency impact.


Re: CEP-15 multi key transaction syntax

2022-06-13 Thread bened...@apache.org
ve that is a MySQL specific concept. This is one problem with mimicking 
SQL – it’s not one thing!

Right?!?!  As if this needed to be more complex.

I think we have evidence that it is fine to interpret NULL as “false” for the 
evaluation of IF conditions.

Agree.  Null == false isn't too much of a leap.

Thanks for taking up the charge on this one.  Glad to see it moving forward!

Thanks,

Aaron



On Sun, Jun 12, 2022 at 10:33 AM 
bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
Welcome Li, and thanks for your input

> When I first saw the syntax, I took it for granted that the condition was 
> evaluated against the state AFTER the updates

Depending what you mean, I think this is one of the options being considered. 
At least, it seems this syntax is most likely to be evaluated against the 
values written by preceding statements in the batch, but not the statement 
itself (or later ones), as this could lead to nonsensical statements like

BEGIN TRANSACTION
UPDATE tbl SET v = 1 WHERE key = 1 AS tbl
COMMIT TRANSACTION IF tbl.v = 0

Where y is never 0 afterwards, so this never succeeds. I take it in this simple 
case you would expect the condition to be evaluated against the state prior to 
the statement (i.e. the initial state)?

But we have a blank slate, so every option is available to us! We just need to 
make sure it makes sense to the user, even in uncommon cases.

> The IF (Boolean expr) ABORT TRANSACTION would suffer less because users may 
> tend to put the condition closer to the related SELECT statement.

This is probably not going to matter in practice. The SELECTs all happen 
upfront no matter what the CQL might look like, and the UPDATE all happen only 
after the IF conditions are evaluated. This is all just a question of how the 
user expresses things.

In future we may offer interactive transactions, or transactions that are 
multi-step, in which case this would be more relevant and could have an 
efficiency impact.

> Would you consider allowing users to start a read-only transaction explicitly 
> like BEGIN TRANSACTION READONLY?

Good question. I would be OK with this, for sure, and will defer to the 
opinions of others here. There won’t be any optimisation impact, as we simply 
check if the transaction contains any updates, but some validation could be 
helpful for the user.

> Finally, I wonder if the community would be interested in idempotency support.

This is something that has been considered, and that Accord is able to support 
(in a couple of ways), but as an end-to-end feature this requires client 
support and other scaffolding that is not currently planned/scheduled. The 
simplest (least robust) approach is for the server to include the transaction’s 
identifier in its timeout, so that it be queried by the client to establish if 
it has been made durable. This should be quite easy to deliver on the 
server-side, but would require some application or client integration, and is 
unreliable in the face of coordinator failure (so the transaction id is unknown 
to the client). The more complete approach is for the client to include an 
idempotency token in its submission to the server, and for C* to record this 
alongside the transaction id, and for some bounded time window to either reject 
re-submissions of this token or to evaluate it as a no-op. This requires much 
tighter integration from the clients, and more work server-side.

Which is simply to say, this is on our radar but I can’t make promises about 
what form it will take, or when it will arrive, only that it has been planned 
for enough to ensure we can achieve it when resources permit.

From: Li Boxuan mailto:libox...@connect.hku.hk>>
Date: Sunday, 12 June 2022 at 16:14
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
Correcting my typo:

>  I took it for granted that the condition was evaluated against the state 
> before the updates

I took it for granted that the condition was evaluated against the state AFTER 
the updates

On Jun 12, 2022, at 11:07 AM, Li Boxuan 
mailto:libox...@connect.hku.hk>> wrote:

Thank you team for this exciting update! I just joined the dev mailing list to 
take part in this discussion. I am not a Cassandra developer and haven’t 
understood Accord myself yet, so my questions are more from a user’s standpoint.

> The COMMIT IF syntax is more succinct, but ambiguity isn’t ideal and we only 
> get one chance to make this API right.

I agree that COMMIT IF syntax is ambiguous. When I first saw the syntax, I took 
it for granted that the condition was evaluated against the state after the 
updates, but it turned out to be the opposite. Thus I prefer the IF (Boolean 
expr) ABORT TRANSACTION idea. In addition, when the transaction is large and 
there are many conditions, using the COMMIT IF syntax migh

Re: CEP-15 multi key transaction syntax

2022-06-13 Thread bened...@apache.org
I believe that is a MySQL specific concept. This is one problem with mimicking 
SQL – it’s not one thing!

In T-SQL, a Boolean expression is TRUE, FALSE or UNKNOWN[1], and a NULL value 
submitted to a Boolean operator yields UNKNOWN.

IF (X) THEN Y does not run Y if X is UNKNOWN;
IF (X) THEN Y ELSE Z does run Z if X is UNKNOWN.

So, I think we have evidence that it is fine to interpret NULL as “false” for 
the evaluation of IF conditions.

[1] 
https://docs.microsoft.com/en-us/sql/t-sql/language-elements/else-if-else-transact-sql?view=sql-server-ver16



From: Konstantin Osipov 
Date: Monday, 13 June 2022 at 14:57
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
> IF (X) THEN
> ROLLBACK
> RETURN (ERRCODE)
> END IF
>
> or
>
> IF (X) THEN RAISERROR
>
> So, that is in essence the question we are currently asking: do
> we want to have a more LWT-like approach (and if so, how do we
> address this complexity for the user), or do we want a more
> SQL-like approach (and if so, how do we modify it to make
> non-interactive transactions convenient, and implementation
> tractable)
>
> * This is anyway a shortcoming of existing batches, I think? So
> it might be we can sweep it under the rug, but I think it will
> be more relevant here as people execute more complex
> transactions, and we should ideally have semantics that will
> work well into the future – including if we later introduce
> interactive transactions.

I'd start with answering the question how the syntax should handle
NOT FOUND condition. In SQL, that would trigger activation of a
CONTINUE handler.

It's hard to see how one can truly branch the logic without it.
Relying on NULL content of a cell would be full of gotchas.

--
Konstantin Osipov, Moscow, Russia


Re: CEP-15 multi key transaction syntax

2022-06-12 Thread bened...@apache.org
velopers’ bugs 
like unintentional updates. This might also give Cassandra a hint for 
optimization.

Finally, I wonder if the community would be interested in idempotency support. 
DynamoDB has this interesting feature 
(https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/transaction-apis.html#transaction-apis-txwriteitems),
 which guards the situation where the same transaction is submitted multiple 
times due to a connection time-out or other connectivity issue. I have no idea 
how that is implemented under the hood and I don’t even know if this is 
technically possible with the Accord design, but I thought it would be 
interesting to think about.

Best regards,
Boxuan



On Jun 12, 2022, at 7:31 AM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

> I would love hearing from people on what they think.

^^ It would be great to have more participants in this conversation

> For context, my questions earlier were based on my 20+ years of using SQL 
> transactions across different systems.

We probably don’t come from a very different place. I spent too many years with 
T-SQL.

> When you start a SQL transaction, you are creating a branch of your data that 
> you can operate with until you reach your desired state and then merge it 
> back with a commit.

That’s the essential complexity we’re grappling with: how much do we permit 
your “branch” to do, how do we let you express it, and how do we let you 
express conditions?

We must balance the fact we cannot afford to do everything (yet), against the 
need to make sure what we do is reasonably intuitive (to both CQL and SQL 
users) and consistent – including with whatever we do in future.

Right now, we have the issue that read-your-writes introduces some complexity 
to the semantics, particularly around the conditions of execution.

LWTs impose conditions on the state of all records prior to execution, but 
their API has a lot of shortcomings. The proposal of COMMIT IF (Boolean expr) 
is most consistent with this approach. This can be confusing, though, if the 
condition is evaluated on a value that has been updated by a prior statement in 
the batch – what value does this global condition get evaluated against?*

SQL has no such concept, but also SQL is designed to be interactive. Depending 
on the dialect there’s probably a lot of ways to do this non-interactively in 
SQL, but we probably cannot cheaply replicate the functionality exactly as we 
do not (yet) support interactive transactions that they were designed for. To 
submit a conditional non-interactive transaction in SQL, you would likely use

IF (X) THEN
ROLLBACK
RETURN (ERRCODE)
END IF

or

IF (X) THEN RAISERROR

So, that is in essence the question we are currently asking: do we want to have 
a more LWT-like approach (and if so, how do we address this complexity for the 
user), or do we want a more SQL-like approach (and if so, how do we modify it 
to make non-interactive transactions convenient, and implementation tractable)

* This is anyway a shortcoming of existing batches, I think? So it might be we 
can sweep it under the rug, but I think it will be more relevant here as people 
execute more complex transactions, and we should ideally have semantics that 
will work well into the future – including if we later introduce interactive 
transactions.





From: Patrick McFadin mailto:pmcfa...@gmail.com>>
Date: Saturday, 11 June 2022 at 15:33
To: dev mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
I think the syntax is evolving into something pretty complicated, which may be 
warranted but I wanted to take a step back and be a bit more reflective on what 
we are trying to accomplish.

For context, my questions earlier were based on my 20+ years of using SQL 
transactions across different systems. That's my personal bias when I see the 
word "database transaction" in this case. When you start a SQL transaction, you 
are creating a branch of your data that you can operate with until you reach 
your desired state and then merge it back with a commit. Or if you don't like 
what you see, use a rollback and act like it never happened. That was the 
thinking when I asked about interactive sessions. If you are using a driver, 
that all happens in a batch. I realize that is out of scope here, but that's 
probably knowledge that is pre-installed in the majority of the user community.

Getting to the point, which is developer experience. I'm seeing a philosophical 
fork in the road which hopefully will generate some comments in the larger user 
community.

Path 1)
Mimic what's already been available in the SQL community, using existing CQL 
syntax. (SQL Example using JDBC: https://www.baeldung.com/java-jdbc-auto-commit)

Path 2)
Chart a new direction with new syntax

I genuinely don't have a clear answer, but I would love hearing from people on 
what they think.

Patrick

On Fri, Jun 10, 2022 at 12:07

Re: CEP-15 multi key transaction syntax

2022-06-12 Thread bened...@apache.org
> I would love hearing from people on what they think.

^^ It would be great to have more participants in this conversation

> For context, my questions earlier were based on my 20+ years of using SQL 
> transactions across different systems.

We probably don’t come from a very different place. I spent too many years with 
T-SQL.

> When you start a SQL transaction, you are creating a branch of your data that 
> you can operate with until you reach your desired state and then merge it 
> back with a commit.

That’s the essential complexity we’re grappling with: how much do we permit 
your “branch” to do, how do we let you express it, and how do we let you 
express conditions?

We must balance the fact we cannot afford to do everything (yet), against the 
need to make sure what we do is reasonably intuitive (to both CQL and SQL 
users) and consistent – including with whatever we do in future.

Right now, we have the issue that read-your-writes introduces some complexity 
to the semantics, particularly around the conditions of execution.

LWTs impose conditions on the state of all records prior to execution, but 
their API has a lot of shortcomings. The proposal of COMMIT IF (Boolean expr) 
is most consistent with this approach. This can be confusing, though, if the 
condition is evaluated on a value that has been updated by a prior statement in 
the batch – what value does this global condition get evaluated against?*

SQL has no such concept, but also SQL is designed to be interactive. Depending 
on the dialect there’s probably a lot of ways to do this non-interactively in 
SQL, but we probably cannot cheaply replicate the functionality exactly as we 
do not (yet) support interactive transactions that they were designed for. To 
submit a conditional non-interactive transaction in SQL, you would likely use

IF (X) THEN
ROLLBACK
RETURN (ERRCODE)
END IF

or

IF (X) THEN RAISERROR

So, that is in essence the question we are currently asking: do we want to have 
a more LWT-like approach (and if so, how do we address this complexity for the 
user), or do we want a more SQL-like approach (and if so, how do we modify it 
to make non-interactive transactions convenient, and implementation tractable)

* This is anyway a shortcoming of existing batches, I think? So it might be we 
can sweep it under the rug, but I think it will be more relevant here as people 
execute more complex transactions, and we should ideally have semantics that 
will work well into the future – including if we later introduce interactive 
transactions.





From: Patrick McFadin 
Date: Saturday, 11 June 2022 at 15:33
To: dev 
Subject: Re: CEP-15 multi key transaction syntax
I think the syntax is evolving into something pretty complicated, which may be 
warranted but I wanted to take a step back and be a bit more reflective on what 
we are trying to accomplish.

For context, my questions earlier were based on my 20+ years of using SQL 
transactions across different systems. That's my personal bias when I see the 
word "database transaction" in this case. When you start a SQL transaction, you 
are creating a branch of your data that you can operate with until you reach 
your desired state and then merge it back with a commit. Or if you don't like 
what you see, use a rollback and act like it never happened. That was the 
thinking when I asked about interactive sessions. If you are using a driver, 
that all happens in a batch. I realize that is out of scope here, but that's 
probably knowledge that is pre-installed in the majority of the user community.

Getting to the point, which is developer experience. I'm seeing a philosophical 
fork in the road which hopefully will generate some comments in the larger user 
community.

Path 1)
Mimic what's already been available in the SQL community, using existing CQL 
syntax. (SQL Example using JDBC: https://www.baeldung.com/java-jdbc-auto-commit)

Path 2)
Chart a new direction with new syntax

I genuinely don't have a clear answer, but I would love hearing from people on 
what they think.

Patrick

On Fri, Jun 10, 2022 at 12:07 PM 
bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
This might also permit us to remove one result set (the success/failure one) 
and return instead an exception if the transaction is aborted. This is also 
more consistent with SQL, if memory serves. That might conflict with returning 
the other result sets in the event of abort (though that’s up to us 
ultimately), but it feels like a nicer API for the user – depending on how 
these exceptions are surfaced in client APIs.

From: bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>>
Date: Friday, 10 June 2022 at 19:59
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
So, thinking on it myself some mor

Re: CEP-15 multi key transaction syntax

2022-06-10 Thread bened...@apache.org
This might also permit us to remove one result set (the success/failure one) 
and return instead an exception if the transaction is aborted. This is also 
more consistent with SQL, if memory serves. That might conflict with returning 
the other result sets in the event of abort (though that’s up to us 
ultimately), but it feels like a nicer API for the user – depending on how 
these exceptions are surfaced in client APIs.

From: bened...@apache.org 
Date: Friday, 10 June 2022 at 19:59
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
So, thinking on it myself some more, I think if there’s an option that 
*doesn’t* require the user to reason about the point at which the read happens 
in order to understand how the condition is applied would probably be better.

What do you think of the IF (Boolean expr) ABORT TRANSACTION idea?

It’s compatible with more advanced IF functionality later, and probably not 
much trickier to implement?

The COMMIT IF syntax is more succinct, but ambiguity isn’t ideal and we only 
get one chance to make this API right.


From: Blake Eggleston 
Date: Friday, 10 June 2022 at 18:56
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
Yeah I think that’s intuitive enough. I had been thinking about multiple 
condition branches, but was thinking about something closer to

IF select.column=5
  UPDATE ... SET ... WHERE key=1;
ELSE IF select.column=6
  UPDATE ... SET ... WHERE key=2;
ELSE
  UPDATE ... SET ... WHERE key=3;
ENDIF
COMMIT TRANSACTION;

Which would make the proposed COMMIT IF we're talking about now a shorthand. Of 
course this would be follow on work.



On Jun 8, 2022, at 1:20 PM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

I imagine that conditions would be evaluated against the state prior to the 
execution of statement against which it is being evaluated, but after the prior 
statements. I think that should be OK to reason about.

i.e. we might have a contrived example like:

BEGIN TRANSACTION
UPDATE tbl SET a = 1 WHERE k = 1 AS q1
UPDATE tbl SET a = q1.a + 1 WHERE k = 1 AS q2
COMMIT TRANSACTION IF q1.a = 0 AND q2.a = 1

So q1 would read a = 0, but q2 would read a = 1 and set a = 2.

I think this is probably adequately intuitive? It is a bit atypical to have 
conditions that wrap the whole transaction though.

We have another option, of course, which is to offer IF x ROLLBACK TRANSACTION, 
which is closer to SQL, which would translate the above to:

BEGIN TRANSACTION
SELECT a FROM tbl WHERE k = 1 AS q0
IF q0.a != 0 ROLLBACK TRANSACTION
UPDATE tbl SET a = 1 WHERE k = 1 AS q1
IF q1.a != 1 ROLLBACK TRANSACTION
UPDATE tbl SET a = q1.a + 1 WHERE k = 1 AS q2
COMMIT TRANSACTION

This is less succinct, but might be more familiar to users. We could also 
eschew the ability to read from UPDATE statements entirely in this scheme, as 
this would then look very much like SQL.


From: Blake Eggleston mailto:beggles...@apple.com>>
Date: Wednesday, 8 June 2022 at 20:59
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
> It affects not just RETURNING but also conditions that are evaluated against 
> the row, and if we in future permit using the values from one select in a 
> function call / write to another table (which I imagine we will).

I hadn’t thought about that... using intermediate or even post update values in 
condition evaluation or function calls seems like it would make it difficult to 
understand why a condition is or is not applying. On the other hand, it would 
powerful, especially when using things like database generated values in 
queries (auto incrementing integer clustering keys or server generated 
timeuuids being examples that come to mind). Additionally, if we return these 
values, I guess that would solve the visibility issues I’m worried about.

Agreed intermediate values would be straightforward to calculate though.




On Jun 6, 2022, at 4:33 PM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

It affects not just RETURNING but also conditions that are evaluated against 
the row, and if we in future permit using the values from one select in a 
function call / write to another table (which I imagine we will).

I think that for it to be intuitive we need it to make sense sequentially, 
which means either calculating it or restricting what can be stated (or 
abandoning the syntax).

If we initially forbade multiple UPDATE/INSERT to the same key, but permitted 
overlapping DELETE (and as many SELECT as you like) that would perhaps make it 
simple enough? Require for now that SELECTS go first, then DELETE and then 
INSERT/UPDATE (or vice versa, depending what we want to make simple)?

FWIW, I don’t think this is terribly onerous to calculate either, since it’s 
restricted to single rows we are updating, so we could simply maintain a 
collections of 

Re: CEP-15 multi key transaction syntax

2022-06-10 Thread bened...@apache.org
So, thinking on it myself some more, I think if there’s an option that 
*doesn’t* require the user to reason about the point at which the read happens 
in order to understand how the condition is applied would probably be better.

What do you think of the IF (Boolean expr) ABORT TRANSACTION idea?

It’s compatible with more advanced IF functionality later, and probably not 
much trickier to implement?

The COMMIT IF syntax is more succinct, but ambiguity isn’t ideal and we only 
get one chance to make this API right.


From: Blake Eggleston 
Date: Friday, 10 June 2022 at 18:56
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
Yeah I think that’s intuitive enough. I had been thinking about multiple 
condition branches, but was thinking about something closer to

IF select.column=5
  UPDATE ... SET ... WHERE key=1;
ELSE IF select.column=6
  UPDATE ... SET ... WHERE key=2;
ELSE
  UPDATE ... SET ... WHERE key=3;
ENDIF
COMMIT TRANSACTION;

Which would make the proposed COMMIT IF we're talking about now a shorthand. Of 
course this would be follow on work.


On Jun 8, 2022, at 1:20 PM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

I imagine that conditions would be evaluated against the state prior to the 
execution of statement against which it is being evaluated, but after the prior 
statements. I think that should be OK to reason about.

i.e. we might have a contrived example like:

BEGIN TRANSACTION
UPDATE tbl SET a = 1 WHERE k = 1 AS q1
UPDATE tbl SET a = q1.a + 1 WHERE k = 1 AS q2
COMMIT TRANSACTION IF q1.a = 0 AND q2.a = 1

So q1 would read a = 0, but q2 would read a = 1 and set a = 2.

I think this is probably adequately intuitive? It is a bit atypical to have 
conditions that wrap the whole transaction though.

We have another option, of course, which is to offer IF x ROLLBACK TRANSACTION, 
which is closer to SQL, which would translate the above to:

BEGIN TRANSACTION
SELECT a FROM tbl WHERE k = 1 AS q0
IF q0.a != 0 ROLLBACK TRANSACTION
UPDATE tbl SET a = 1 WHERE k = 1 AS q1
IF q1.a != 1 ROLLBACK TRANSACTION
UPDATE tbl SET a = q1.a + 1 WHERE k = 1 AS q2
COMMIT TRANSACTION

This is less succinct, but might be more familiar to users. We could also 
eschew the ability to read from UPDATE statements entirely in this scheme, as 
this would then look very much like SQL.


From: Blake Eggleston mailto:beggles...@apple.com>>
Date: Wednesday, 8 June 2022 at 20:59
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
> It affects not just RETURNING but also conditions that are evaluated against 
> the row, and if we in future permit using the values from one select in a 
> function call / write to another table (which I imagine we will).

I hadn’t thought about that... using intermediate or even post update values in 
condition evaluation or function calls seems like it would make it difficult to 
understand why a condition is or is not applying. On the other hand, it would 
powerful, especially when using things like database generated values in 
queries (auto incrementing integer clustering keys or server generated 
timeuuids being examples that come to mind). Additionally, if we return these 
values, I guess that would solve the visibility issues I’m worried about.

Agreed intermediate values would be straightforward to calculate though.



On Jun 6, 2022, at 4:33 PM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

It affects not just RETURNING but also conditions that are evaluated against 
the row, and if we in future permit using the values from one select in a 
function call / write to another table (which I imagine we will).

I think that for it to be intuitive we need it to make sense sequentially, 
which means either calculating it or restricting what can be stated (or 
abandoning the syntax).

If we initially forbade multiple UPDATE/INSERT to the same key, but permitted 
overlapping DELETE (and as many SELECT as you like) that would perhaps make it 
simple enough? Require for now that SELECTS go first, then DELETE and then 
INSERT/UPDATE (or vice versa, depending what we want to make simple)?

FWIW, I don’t think this is terribly onerous to calculate either, since it’s 
restricted to single rows we are updating, so we could simply maintain a 
collections of rows and upsert into them as we process the execution. Most 
transactions won’t need it, I suspect, so we don’t need to worry about perfect 
efficiency.


From: Blake Eggleston mailto:beggles...@apple.com>>
Date: Tuesday, 7 June 2022 at 00:21
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
That's a good question. I'd lean towards returning the final state of things, 
although I could understand expecting to see intermediate state. Regarding 
range tomb

Re: CEP-15 multi key transaction syntax

2022-06-08 Thread bened...@apache.org
I imagine that conditions would be evaluated against the state prior to the 
execution of statement against which it is being evaluated, but after the prior 
statements. I think that should be OK to reason about.

i.e. we might have a contrived example like:

BEGIN TRANSACTION
UPDATE tbl SET a = 1 WHERE k = 1 AS q1
UPDATE tbl SET a = q1.a + 1 WHERE k = 1 AS q2
COMMIT TRANSACTION IF q1.a = 0 AND q2.a = 1

So q1 would read a = 0, but q2 would read a = 1 and set a = 2.

I think this is probably adequately intuitive? It is a bit atypical to have 
conditions that wrap the whole transaction though.

We have another option, of course, which is to offer IF x ROLLBACK TRANSACTION, 
which is closer to SQL, which would translate the above to:

BEGIN TRANSACTION
SELECT a FROM tbl WHERE k = 1 AS q0
IF q0.a != 0 ROLLBACK TRANSACTION
UPDATE tbl SET a = 1 WHERE k = 1 AS q1
IF q1.a != 1 ROLLBACK TRANSACTION
UPDATE tbl SET a = q1.a + 1 WHERE k = 1 AS q2
COMMIT TRANSACTION

This is less succinct, but might be more familiar to users. We could also 
eschew the ability to read from UPDATE statements entirely in this scheme, as 
this would then look very much like SQL.


From: Blake Eggleston 
Date: Wednesday, 8 June 2022 at 20:59
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
> It affects not just RETURNING but also conditions that are evaluated against 
> the row, and if we in future permit using the values from one select in a 
> function call / write to another table (which I imagine we will).

I hadn’t thought about that... using intermediate or even post update values in 
condition evaluation or function calls seems like it would make it difficult to 
understand why a condition is or is not applying. On the other hand, it would 
powerful, especially when using things like database generated values in 
queries (auto incrementing integer clustering keys or server generated 
timeuuids being examples that come to mind). Additionally, if we return these 
values, I guess that would solve the visibility issues I’m worried about.

Agreed intermediate values would be straightforward to calculate though.


On Jun 6, 2022, at 4:33 PM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

It affects not just RETURNING but also conditions that are evaluated against 
the row, and if we in future permit using the values from one select in a 
function call / write to another table (which I imagine we will).

I think that for it to be intuitive we need it to make sense sequentially, 
which means either calculating it or restricting what can be stated (or 
abandoning the syntax).

If we initially forbade multiple UPDATE/INSERT to the same key, but permitted 
overlapping DELETE (and as many SELECT as you like) that would perhaps make it 
simple enough? Require for now that SELECTS go first, then DELETE and then 
INSERT/UPDATE (or vice versa, depending what we want to make simple)?

FWIW, I don’t think this is terribly onerous to calculate either, since it’s 
restricted to single rows we are updating, so we could simply maintain a 
collections of rows and upsert into them as we process the execution. Most 
transactions won’t need it, I suspect, so we don’t need to worry about perfect 
efficiency.


From: Blake Eggleston mailto:beggles...@apple.com>>
Date: Tuesday, 7 June 2022 at 00:21
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
That's a good question. I'd lean towards returning the final state of things, 
although I could understand expecting to see intermediate state. Regarding 
range tombstones, we could require them to precede any updates like selects, 
but there's still the question of how to handle multiple updates to the same 
cell when the user has requested we return the post-update state of the cell.



On Jun 6, 2022, at 4:00 PM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

> if multiple updates end up touching the same cell, I’d expect the last one to 
> win

Hmm, yes I suppose range tombstones are a plausible and reasonable thing to mix 
with inserts over the same key range.

What’s your present thinking about the idea of handling returning the values as 
of a given point in the sequential execution then?

The succinct syntax is I think highly desirable for user experience, but this 
does complicate it a bit if we want to remain intuitive.




From: Blake Eggleston mailto:beggles...@apple.com>>
Date: Monday, 6 June 2022 at 23:17
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
Hi all,

Thanks for all the input and questions so far. Glad people are excited about 
this!

I didn’t have any free time to respond this weekend, although it looks like 
Benedict has responded to most of the questions so far, so if I don’t respond

Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / interface implementations

2022-06-08 Thread bened...@apache.org
I’ve opened a PR: https://github.com/apache/cassandra-website/pull/137

Not sure what our commit norms are for the website, but I’m assuming we would 
normally expect a +1 from somebody else.

From: Dinesh Joshi 
Date: Saturday, 4 June 2022 at 19:59
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / 
interface implementations
sounds good. Lazy consensus it is.
On Jun 4, 2022, at 11:09 AM, bened...@apache.org wrote:

I think lazy consensus is good enough here, since there has been no dissent so 
far as I can tell. It’s easier to modify if we assume lazy consensus until a 
dispute arises. If anyone wants to escalate to a formal vote, feel free to say 
so.

I’ll update the wiki in a couple of days; we can always roll back if a 
dissenting voice appears.


From: Dinesh Joshi 
Date: Friday, 3 June 2022 at 18:34
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / 
interface implementations
Let’s bring it to vote? We can update the docs as we evolve the guidance but I 
think it’s in a good enough shape to publish.

On Jun 3, 2022, at 9:07 AM, bened...@apache.org wrote:

I always ask if we’re ready, get a few acks, then one or two new queries come 
out of the woodwork.

Perhaps I will just publish, and we can start addressing these queries in a 
follow-up process.

From: Dinesh Joshi 
Date: Friday, 3 June 2022 at 16:57
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / 
interface implementations
I don’t think the guide has yet been published to the official website, has it? 
Maybe we should just get it out there.
On Jun 3, 2022, at 8:54 AM, bened...@apache.org wrote:

Somebody hasn’t looked at the new style guide*, the conversation for which 
keeps rolling on and so it never quite gets promoted to the wiki. It says:

Always use @Override annotations when implementing abstract or interface 
methods or overriding a parent method.

* 
https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo


From: Josh McKenzie 
Date: Friday, 3 June 2022 at 16:14
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / 
interface implementations
> Avoid redundant @Override annotations when implementing abstract or interface 
> methods.
I'd argue they're not redundant. We're humans and infinitely fallible. :)

+1 to changing this to just always annotate for all the reasons you enumerate.

On Fri, Jun 3, 2022, at 10:16 AM, Alex Petrov wrote:
Right, my thinking matches what David has mentioned:

https://issues.apache.org/jira/browse/CASSANDRA-16096
https://lists.apache.org/thread/mkskwxn921t5bkfmnog032qvnyjk82t7

I'll make sure to update the style guide itself, too, since it looks like there 
was a vote, and intellij file is updated, just need to fixup the website.


On Fri, Jun 3, 2022, at 4:02 PM, Dinesh Joshi wrote:
So your proposal is to always add override annotation? Or are there situations 
where you don’t want to add them?


On Jun 3, 2022, at 6:53 AM, Alex Petrov  wrote:

Hi everyone,

In our style guide [1], we have a following statement:

> Avoid redundant @Override annotations when implementing abstract or interface 
> methods.

I'd like to suggest we change this.

@Override annotation in subclasses might be annoying when you're writing the 
code for the first time, or reading already familiar code, but when you're 
working on large changes and have complex class hierarchies, or multiple 
overloads for the method, it's easy to overlook methods that were not marked as 
overrides, and leave a wrong method in the code, or misinterpret the call chain.

I think @Override annotations are extremely useful and serve their purpose, 
especially when refactoring: I can change the interface, and will not only be 
pointed to all classes that do not implement the new version (which compiler 
will do anyways), but also will be pointed to the classes that, to the human 
eye, may look like they're overriding the method, but in fact they do not.

More concrete example: there is an abstract class between the interface and a 
concrete implementation: you change the interface, modify the method in the 
abstract class, but then forget to change the signature in the overriden 
implementation of the concrete class, and get a behaviour from the abstract 
class rather then concrete implementation.

The question is not about taste or code aesthetics, but about making 
maintaining a large codebase that has a lot of complexity and that was evolving 
over many years simpler. If you could provide an example where @Override would 
be counter-productive or overly burdensome, we could compare this cost of 
maintenance with the cost of potential errors.

Thank you,
--Alex

[1] https://cassandra.apache.org/_/development/code_style.html




Re: CEP-15 multi key transaction syntax

2022-06-06 Thread bened...@apache.org
It affects not just RETURNING but also conditions that are evaluated against 
the row, and if we in future permit using the values from one select in a 
function call / write to another table (which I imagine we will).

I think that for it to be intuitive we need it to make sense sequentially, 
which means either calculating it or restricting what can be stated (or 
abandoning the syntax).

If we initially forbade multiple UPDATE/INSERT to the same key, but permitted 
overlapping DELETE (and as many SELECT as you like) that would perhaps make it 
simple enough? Require for now that SELECTS go first, then DELETE and then 
INSERT/UPDATE (or vice versa, depending what we want to make simple)?

FWIW, I don’t think this is terribly onerous to calculate either, since it’s 
restricted to single rows we are updating, so we could simply maintain a 
collections of rows and upsert into them as we process the execution. Most 
transactions won’t need it, I suspect, so we don’t need to worry about perfect 
efficiency.


From: Blake Eggleston 
Date: Tuesday, 7 June 2022 at 00:21
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
That's a good question. I'd lean towards returning the final state of things, 
although I could understand expecting to see intermediate state. Regarding 
range tombstones, we could require them to precede any updates like selects, 
but there's still the question of how to handle multiple updates to the same 
cell when the user has requested we return the post-update state of the cell.


On Jun 6, 2022, at 4:00 PM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

> if multiple updates end up touching the same cell, I’d expect the last one to 
> win

Hmm, yes I suppose range tombstones are a plausible and reasonable thing to mix 
with inserts over the same key range.

What’s your present thinking about the idea of handling returning the values as 
of a given point in the sequential execution then?

The succinct syntax is I think highly desirable for user experience, but this 
does complicate it a bit if we want to remain intuitive.




From: Blake Eggleston mailto:beggles...@apple.com>>
Date: Monday, 6 June 2022 at 23:17
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax
Hi all,

Thanks for all the input and questions so far. Glad people are excited about 
this!

I didn’t have any free time to respond this weekend, although it looks like 
Benedict has responded to most of the questions so far, so if I don’t respond 
to a question you asked here, you can interpret that as “what Benedict said” :).


Jeff,

> Is there a new keyword for “partition (not) exists” or is it inferred by the 
> select?

I'd intended this to be worked out from the select statement, ie: if the 
read/reference is null/empty, then it doesn't exist, whether you're interested 
in the partition, row, or cell. So I don't think we'd need an additional 
keyword there. I think that would address partition exists / not exists use 
cases?

> And would you allow a transaction that had > 1 named select and no 
> modification statements, but commit if 1=1 ?

Yes, an unconditional commit (ie: just COMMIT TRANSACTION; without an IF) would 
be part of the syntax. Also, running a txn that doesn’t contain updates 
wouldn’t be a problem.

Patrick, I think Benedict answered your questions? Glad you got the joke :)

Alex,

> 1. Dependant SELECTs
> 2. Dependant UPDATEs
> 3. UPDATE from secondary index (or SASI)
> 5. UPDATE with predicate on non-primary key

The full primary key must be defined as part of the statement, and you can’t 
use column references to define them, so you wouldn’t be able to run these.

> MVs

To prevent being spread too thin, both in syntax design and implementation 
work, I’d like to limit read and write operations in the initial implementation 
to vanilla selects, updates, inserts, and deletes. Once we have a solid 
implementation of multi-key/table transactions supporting foundational 
operations, we can start figuring out how the more advanced pieces can be best 
supported. Not a great answer to your question, but a related tangent I should 
have included in my initial email.

> ... RETURNING ...

I like the idea of the returning statement, but to echo what Benedict said, I 
think any scheme for specifying data to be returned should apply the same to 
select and update statements, since updates can have underlying reads that the 
user may be interested in. I’d mentioned having an optional RETURN statement in 
addition to automatically returning selects in my original email.

> ... WITH ...

I like the idea of defining statement names at the beginning of a statement, 
since I could imagine mapping names to selects might get difficult if there are 
a lot of columns in the select or update, but beginning each statement w

Re: CEP-15 multi key transaction syntax

2022-06-06 Thread bened...@apache.org
since a txn won’t always result in an 
update, in which case we’d just return the select.

Thanks,

Blake




On Jun 6, 2022, at 9:41 AM, Henrik Ingo 
mailto:henrik.i...@datastax.com>> wrote:

On Mon, Jun 6, 2022 at 5:28 PM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
> One way to make it obvious is to require the user to explicitly type the 
> SELECTs and then to require that all SELECTs appear before 
> UPDATE/INSERT/DELETE.

Yes, I agree that SELECT statements should be required to go first.

However, I think this is sufficient and we can retain the shorter format for 
RETURNING. There only remains the issue of conditions imposed upon 
UPDATE/INSERT/DELETE statements when there are multiple statements that affect 
the same primary key. I think we can (and should) simply reject such queries 
for now, as it doesn’t make much sense to have multiple statements for the same 
primary key in the same transaction.

I guess I was thinking ahead to a future where and UPDATE write set may or may 
not intersect with a previous update due to allowing WHERE clause to use 
secondary keys, etc.

That said, I'm not saying we SHOULD require explicit SELECT statements for 
every update. I'm sure that would be annoying more than useful.I was just 
following a train of thought.



> Returning the "result" from an UPDATE presents the question should it be the 
> data at the start of the transaction or end state?

I am inclined to only return the new values (as proposed by Alex) for the 
purpose of returning new auto-increment values etc. If you require the prior 
value, SELECT is available to express this.

That's a great point!


> I was thinking the following coordinator-side implementation would allow to 
> use also old drivers

I am inclined to return just the first result set to old clients. I think it’s 
fine to require a client upgrade to get multiple result sets.

Possibly. I just wanted to share an idea for consideration. IMO the temp table 
idea might not be too hard to implement*, but sure the syntax does feel a bit 
bolted on.

*) I'm maybe the wrong person to judge that, of course :-)

henrik

--
Henrik Ingo
+358 40 569 7354
[Visit us online.]<https://www.datastax.com/>  [Visit us on Twitter.] 
<https://twitter.com/DataStaxEng>   [Visit us on YouTube.] 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg=DwMFaQ=adz96Xi0w1RHqtPMowiL2g=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw=>
   [Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>



Re: CEP-15 multi key transaction syntax

2022-06-06 Thread bened...@apache.org
> One way to make it obvious is to require the user to explicitly type the 
> SELECTs and then to require that all SELECTs appear before 
> UPDATE/INSERT/DELETE.

Yes, I agree that SELECT statements should be required to go first.

However, I think this is sufficient and we can retain the shorter format for 
RETURNING. There only remains the issue of conditions imposed upon 
UPDATE/INSERT/DELETE statements when there are multiple statements that affect 
the same primary key. I think we can (and should) simply reject such queries 
for now, as it doesn’t make much sense to have multiple statements for the same 
primary key in the same transaction.

> Returning the "result" from an UPDATE presents the question should it be the 
> data at the start of the transaction or end state?

I am inclined to only return the new values (as proposed by Alex) for the 
purpose of returning new auto-increment values etc. If you require the prior 
value, SELECT is available to express this.

> I was thinking the following coordinator-side implementation would allow to 
> use also old drivers

I am inclined to return just the first result set to old clients. I think it’s 
fine to require a client upgrade to get multiple result sets.


From: Henrik Ingo 
Date: Monday, 6 June 2022 at 15:18
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
Thank you Blake and team!

Just some personal reactions and thoughts...

First instinct is to support the shorter format where UPDATE ... AS car  is 
also its own implicit select.

However, a subtle thing to note is that a reasonable user might expect that in 
a sequence of multiple UPDATEs, each of them is also read at the position where 
the UPDATE is in the list of statements. The fact that Accord executes all 
reads first is not at all obvious from the syntax. One way to make it obvious 
is to require the user to explicitly type the SELECTs and then to require that 
all SELECTs appear before UPDATE/INSERT/DELETE.


I like the idea of a RETURN or RETURNING keyword to specify what exactly you 
want to return. This would allow to also return results from UPDATE/INSERT 
since the user explicitly told us to do so.

Returning the "result" from an UPDATE presents the question should it be the 
data at the start of the transaction or end state? Interestingly the MongoDB 
$findAndModify operation allows you to choose between both options. There seems 
to be a valid use case for both. The obvious examples are:

  UPDATE t SET c=100 WHERE id=1 AS t RETURNING BEFORE c;
COMMIT TRANSACTION IF t.c <= 100;

I want to know the value of what c was before I replaced with a new value.

  INSERT INTO t (c) VALUES (100) AS t RETURNING AFTER d;
COMMIT TRANSACTION IF t.c <= 100;

I want to know the defaulted value of d. (...as was already pointed out in 
another email.)

  UPDATE t SET c+=1 WHERE id=1 AS t RETURNING AFTER c;
COMMIT TRANSACTION IF t.c <= 100;

I want to know the result of c after the transaction. (Which I know will be at 
most 100, but I want to know exactly.)

I kind of sympathize with the intuitive opinion that we should return the 
values from the start of the transaction, since that's how Accord works: reads 
first, updates second.


Finally, I wanted to share a thought on how to implement the returning of 
multiple result sets. While you don't address it, I'm assuming the driver api 
will get new functionality where you can get a specific result set out of many.

I was thinking the following coordinator-side implementation would allow to use 
also old drivers:

BEGIN TRANSACTION;
   SELECT * FROM table1 WHERE  AS t1;
   SELECT * FROM table2 WHERE  AS t2;
   UPDATE something...
COMMIT TRANSACTION;
SELECT * FROM t1;
SELECT * FROM t2;

The coordinator-level implementation here would be to store the results of the 
SELECTs inside a transaction into temporary tables that the client can the read 
from after the transaction. Even if those later selects are outside the 
transaction, their contents would be a constant snapshot representing the state 
of those rows at the time of the transaction. The tables should be visible only 
to the same client session and until the start of the next transaction or a 
timeout, whichever comes first.

henrik




On Fri, Jun 3, 2022 at 6:39 PM Blake Eggleston 
mailto:beggles...@apple.com>> wrote:
Hi dev@,

I’ve been working on a draft syntax for Accord transactions and wanted to bring 
what I have to the dev list to solicit feedback and build consensus before 
moving forward with it. The proposed transaction syntax is intended to be an 
extended batch syntax. Basically batches with selects, and an optional 
condition at the end. To facilitate conditions against an arbitrary number of 
select statements, you can also name the statements, and reference columns in 
the results. To cut down on the number of operations needed, select values can 
also be used in updates, including some math operations. 

Re: CEP-15 multi key transaction syntax

2022-06-05 Thread bened...@apache.org
> In the case that the condition is met, is the mutation applied at that point, 
> or has it already happened and there is something like a rollback segment?

The condition is a part of the transaction execution, so no mutation is applied 
until it has been evaluated – there is no rollback.

> What is the case when the condition is not met and what is presented to the 
> end-user?

I think you can expect to have any SELECT/RETURN (whatever we settle on) 
results returned, along with FALSE for the executed result set.

> More importantly, what happens with respect to the A & I in ACID when the 
> transaction is applied?

Not sure what you mean? They’re maintained at all times, but would be happy to 
explain more if I can understand the question better.

> If UPDATE is used, returning the number of rows changed would be helpful.

Do we support updates that affect an uncertain number of rows at the moment? 
Besides DELETE, for which we don’t want to calculate it, as it’s costlier.

> Is this something that can be done interactively in cqlsh or does it all have 
> to be submitted in one statement block?

These are non-interactive, so it needs to be declared in a single statement. I 
think Accord can be extended to natively support interactive transactions in 
future, in a manner consistent with its fast non-interactive transactions, but 
that’s a whole other endeavour.

From: Patrick McFadin 
Date: Sunday, 5 June 2022 at 01:47
To: dev 
Subject: Re: CEP-15 multi key transaction syntax
I've been waiting for this email! I'll echo what Jeff said about how exciting 
this is for the project.

On the SELECT inside the transaction:

In the first example, I'm making an assumption that you are doing a select on a 
partition key and only expect one result but is any valid CQL SELECT allowed 
here? If 'model' were a non-partition key column name and was indexed, then you 
could potentially have multiple rows returned and that isn't an allowed 
operation. Are only partition key lookups allowed or is there some logic 
looking for only one row?

I'm asking because I can see in reverse time series models where you can select 
the latest temperature
  SELECT temperature FROM weather_station WHERE id=1234 AND DATE='2022-06-04' 
LIMIT 1;

(also, horrible example. Everyone knows that the return value for a 
Pinto.is_running will always evaluate to FALSE)

On COMMIT TRANSACTION:

So much to unpack here. In the case that the condition is met, is the mutation 
applied at that point, or has it already happened and there is something like a 
rollback segment? What is the case when the condition is not met and what is 
presented to the end-user? More importantly, what happens with respect to the A 
& I in ACID when the transaction is applied?

If UPDATE is used, returning the number of rows changed would be helpful.

Is this something that can be done interactively in cqlsh or does it all have 
to be submitted in one statement block?

I'll stop here for now.

Patrick

On Sat, Jun 4, 2022 at 3:34 PM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
> The returned result set is after the updates are applied?
Returning the prior values is probably more powerful, as you can perform 
unconditional updates and respond to the prior state, that you otherwise would 
not know. It’s also simpler to implement.

My inclination is to require that SELECT statements are declared first, so that 
we leave open the option of (in future) supporting SELECT statements in any 
place in the transaction, returning the values as of their position in a 
sequential execution of the statements.

> And would you allow a transaction that had > 1 named select and no 
> modification statements, but commit if 1=1 ?

My preference is that the IF condition is anyway optional, as it is much more 
obvious to a user than concocting some always-true condition. But yes, 
read-only transactions involving multiple tables will definitely be supported.


From: Jeff Jirsa mailto:jji...@gmail.com>>
Date: Saturday, 4 June 2022 at 22:49
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: CEP-15 multi key transaction syntax

And would you allow a transaction that had > 1 named select and no modification 
statements, but commit if 1=1 ?

> On Jun 4, 2022, at 2:45 PM, Jeff Jirsa 
> mailto:jji...@gmail.com>> wrote:
>
> 
>
>> On Jun 3, 2022, at 8:39 AM, Blake Eggleston 
>> mailto:beggles...@apple.com>> wrote:
>>
>> Hi dev@,
>
> First, I’m ridiculously excited to see this.
>
>>
>> I’ve been working on a draft syntax for Accord transactions and wanted to 
>> bring what I have to the dev list to solicit feedback and build consensus 
>> before moving forward with it. The proposed transaction syntax is intended 
>> to be an exte

Re: CEP-15 multi key transaction syntax

2022-06-05 Thread bened...@apache.org
> 1. Dependant SELECTs
> 2. Dependant UPDATEs
> 3. UPDATE from secondary index (or SASI)
> 5. UPDATE with predicate on non-primary key

So, I think these are all likely to be rejected the same way they are today, as 
the individual statements would not parse [1,2] or be validated [3,5], as I’m 
fairly sure UPDATE and INSERT require a primary key to be specified and that 
only SELECT supports secondary indexes.

It could be nice to have dedicated messages explaining the limitation for 
[1,2], at least until the restriction is lifted.

> 4. The presence of a materialized view

This is a bit more complex. I think in principle MVs could function as they do 
today, i.e. with eventually consistent update. MVs remain experimental however, 
with known shortcomings, and I am not keen to validate them with Accord.

Since I think our plan is to opt tables into transactional behaviour (to 
minimise the potential for misusing them, unlike LWTs, which are easily used 
unsafely), I would prefer to ensure that MVs are mutually exclusive with 
transactions for now.

I anticipate follow up work will deliver global secondary indexes on top of 
Accord. I’ve no idea if that will replace or coexist with MVs as they exist 
today, perhaps it will be possible to create MVs and specify their consistency 
properties on creation once the existing MVs are reliable.

> 6. Large SELECTs Are Actually Okay But Look Like They Shouldn't Be

I’m not sure what our plans are around aggregations and transactions, perhaps 
Blake can speak more to his thoughts. Since aggregations are relatively new I 
am inclined to exclude them initially, at least for write transactions, since 
LWTs do not support them.

Otherwise we will need some deterministic measure for aborting transactions – 
even after we have agreed to execute them. E.g. a 5000 row limit on live rows 
read as input before a transaction is converted to a no-op. We will have to be 
especially careful here for unconditional transactions without any 
SELECT/RETURN, as these must still wait for the result of execution before 
notifying the user of the outcome, if it may be aborted.

Suggestions welcome here.

> 7. Triggers

Good question!

It looks like LWTs don’t integrate with triggers today, so I guess we can 
ignore them too. I don’t know how stable triggers are, or how widely they are 
used. I’m sure we have some use cases, but I’m not aware of any community 
members that use them so it is likely sparse.

In principle a trigger could modify the transaction submitted by a client to 
include additional updates, but this would likely require changes to the 
trigger API. I anticipate ignoring them until we have community demand.

> Random Syntax Thoughts

I like the RETURNING syntax, and consistency with SQL dialects is a plus. I’m 
concerned about consistency with SELECT statements, though – these already 
imply RETURNING, but we might use them to compute constraint clauses on tables 
we are not updating, and this would leave no consistent way of doing this 
without returning all of its fields to the user, at least not without multiple 
SELECT statements over the same data.

We could introduce a new keyword such as CONSTRAIN in this case, with syntax 
equivalent to UPDATE/DELETE but supporting RETURNING and by default not 
returning any fields?

The idea of a RETURNING syntax on the transaction itself was previously floated 
and is nice, but I worry about having multiple inconsistent ways of returning 
data that can be co-mingled. How would you envisage these keywords interacting?


From: Alex Miller 
Date: Sunday, 5 June 2022 at 03:39
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax
All of my text below largely extends your question of syntax in a few
directions:
 - What is the user experience of trying to run different statements
with this syntax?
 - How do transactions interact with other Cassandra constructs?
 - What are the execution semantics of these statements?
which I do acknowledge is a moderate re-scoping of the question.

Also, please take my understanding of existing CQL and DDL constructs with
an impractically large grain of salt.


Undesireable Transactions
-

I tried to match CQL docs up against a number of ways of writing statements
which Accord wouldn't like, or users might not like the effect of running.
I'm assuming it'd be good to think through how one would express the error
message or guidance given to users?  Or at least just making sure I
understand correctly what is writable but not executable or desirable.

=== Likely Unexecutable

All the cases here are predicated on the lack of automatic reconnaissance
transaction support.

1. Dependant SELECTs

CREATE TABLE users (name text primary key, home_state text);
CREATE TABLE states (name text primary key, population int);

BEGIN TRANSACTION;
  /*1*/ SELECT home_state FROM users WHERE name='blake' AS user;
  /*2*/ SELECT population FROM states 

Re: CEP-15 multi key transaction syntax

2022-06-04 Thread bened...@apache.org
> The returned result set is after the updates are applied?

Returning the prior values is probably more powerful, as you can perform 
unconditional updates and respond to the prior state, that you otherwise would 
not know. It’s also simpler to implement.

My inclination is to require that SELECT statements are declared first, so that 
we leave open the option of (in future) supporting SELECT statements in any 
place in the transaction, returning the values as of their position in a 
sequential execution of the statements.

> And would you allow a transaction that had > 1 named select and no 
> modification statements, but commit if 1=1 ?

My preference is that the IF condition is anyway optional, as it is much more 
obvious to a user than concocting some always-true condition. But yes, 
read-only transactions involving multiple tables will definitely be supported.


From: Jeff Jirsa 
Date: Saturday, 4 June 2022 at 22:49
To: dev@cassandra.apache.org 
Subject: Re: CEP-15 multi key transaction syntax

And would you allow a transaction that had > 1 named select and no modification 
statements, but commit if 1=1 ?

> On Jun 4, 2022, at 2:45 PM, Jeff Jirsa  wrote:
>
> 
>
>> On Jun 3, 2022, at 8:39 AM, Blake Eggleston  wrote:
>>
>> Hi dev@,
>
> First, I’m ridiculously excited to see this.
>
>>
>> I’ve been working on a draft syntax for Accord transactions and wanted to 
>> bring what I have to the dev list to solicit feedback and build consensus 
>> before moving forward with it. The proposed transaction syntax is intended 
>> to be an extended batch syntax. Basically batches with selects, and an 
>> optional condition at the end. To facilitate conditions against an arbitrary 
>> number of select statements, you can also name the statements, and reference 
>> columns in the results. To cut down on the number of operations needed, 
>> select values can also be used in updates, including some math operations. 
>> Parameterization of literals is supported the same as other statements.
>>
>> Here's an example selecting a row from 2 tables, and issuing updates for 
>> each row if a condition is met:
>>
>> BEGIN TRANSACTION;
>> SELECT * FROM users WHERE name='blake' AS user;
>> SELECT * from cars WHERE model='pinto' AS car;
>> UPDATE users SET miles_driven = user.miles_driven + 30 WHERE name='blake';
>> UPDATE cars SET miles_driven = car.miles_driven + 30 WHERE model='pinto';
>> COMMIT TRANSACTION IF car.is_running;
>>
>> This can be simplified by naming the updates with an AS  syntax. If 
>> updates are named, a corresponding read is generated behind the scenes and 
>> its values inform the update.
>>
>> Here's an example, the query is functionally identical to the previous 
>> query. In the case of the user update, a read is still performed behind the 
>> scenes to enable the calculation of miles_driven + 30, but doesn't need to 
>> be named since it's not referenced anywhere else.
>>
>> BEGIN TRANSACTION;
>> UPDATE users SET miles_driven += 30 WHERE name='blake';
>> UPDATE cars SET miles_driven += 30 WHERE model='pinto' AS car;
>> COMMIT TRANSACTION IF car.is_running;
>>
>> Here’s another example, performing the canonical bank transfer:
>>
>> BEGIN TRANSACTION;
>> UPDATE accounts SET balance += 100 WHERE name='blake' AS blake;
>> UPDATE accounts SET balance -= 100 WHERE name='benedict' AS benedict;
>> COMMIT TRANSACTION IF blake EXISTS AND benedict.balance >= 100;
>>
>> As you can see from the examples, column values can be referenced via a dot 
>> syntax, ie: . -> select1.value. Since the read portion 
>> of the transaction is performed before evaluating conditions or applying 
>> updates, values read can be freely applied to non-primary key values in 
>> updates. Select statements used either in checking a condition or creating 
>> an update must be restricted to a single row, either by specifying the full 
>> primary key or a limit of 1. Multi-row selects are allowed, but only for 
>> returning data to the client (see below).
>>
>> For evaluating conditions, = & != are available for all types, <, <=, >, >= 
>> are available for numerical types, and EXISTS, NOT EXISTS can be used for 
>> partitions, rows, and values. If any column references cannot be satisfied 
>> by the result of the reads, the condition implicitly fails. This prevents 
>> having to include a bunch of exists statements.
>
> Is there a new keyword for “partition (not) exists” or is it inferred by the 
> select?
>
>>
>> On completion, an operation would return a boolean value indicating the 
>> operation had been applied, and a result set for each named select (but not 
>> named update). We could also support an optional RETURN keyword, which would 
>> allow the user to only return specific named selects (ie: RETURN select1, 
>> select2).
>>
>
> The returned result set is after the updates are applied?
>
>
>> Let me know what you think!
>>
>> Blake


Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / interface implementations

2022-06-04 Thread bened...@apache.org
I think lazy consensus is good enough here, since there has been no dissent so 
far as I can tell. It’s easier to modify if we assume lazy consensus until a 
dispute arises. If anyone wants to escalate to a formal vote, feel free to say 
so.

I’ll update the wiki in a couple of days; we can always roll back if a 
dissenting voice appears.


From: Dinesh Joshi 
Date: Friday, 3 June 2022 at 18:34
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / 
interface implementations
Let’s bring it to vote? We can update the docs as we evolve the guidance but I 
think it’s in a good enough shape to publish.

On Jun 3, 2022, at 9:07 AM, bened...@apache.org wrote:

I always ask if we’re ready, get a few acks, then one or two new queries come 
out of the woodwork.

Perhaps I will just publish, and we can start addressing these queries in a 
follow-up process.

From: Dinesh Joshi 
Date: Friday, 3 June 2022 at 16:57
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / 
interface implementations
I don’t think the guide has yet been published to the official website, has it? 
Maybe we should just get it out there.
On Jun 3, 2022, at 8:54 AM, bened...@apache.org wrote:

Somebody hasn’t looked at the new style guide*, the conversation for which 
keeps rolling on and so it never quite gets promoted to the wiki. It says:

Always use @Override annotations when implementing abstract or interface 
methods or overriding a parent method.

* 
https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo


From: Josh McKenzie 
Date: Friday, 3 June 2022 at 16:14
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / 
interface implementations
> Avoid redundant @Override annotations when implementing abstract or interface 
> methods.
I'd argue they're not redundant. We're humans and infinitely fallible. :)

+1 to changing this to just always annotate for all the reasons you enumerate.

On Fri, Jun 3, 2022, at 10:16 AM, Alex Petrov wrote:
Right, my thinking matches what David has mentioned:

https://issues.apache.org/jira/browse/CASSANDRA-16096
https://lists.apache.org/thread/mkskwxn921t5bkfmnog032qvnyjk82t7

I'll make sure to update the style guide itself, too, since it looks like there 
was a vote, and intellij file is updated, just need to fixup the website.


On Fri, Jun 3, 2022, at 4:02 PM, Dinesh Joshi wrote:
So your proposal is to always add override annotation? Or are there situations 
where you don’t want to add them?


On Jun 3, 2022, at 6:53 AM, Alex Petrov  wrote:

Hi everyone,

In our style guide [1], we have a following statement:

> Avoid redundant @Override annotations when implementing abstract or interface 
> methods.

I'd like to suggest we change this.

@Override annotation in subclasses might be annoying when you're writing the 
code for the first time, or reading already familiar code, but when you're 
working on large changes and have complex class hierarchies, or multiple 
overloads for the method, it's easy to overlook methods that were not marked as 
overrides, and leave a wrong method in the code, or misinterpret the call chain.

I think @Override annotations are extremely useful and serve their purpose, 
especially when refactoring: I can change the interface, and will not only be 
pointed to all classes that do not implement the new version (which compiler 
will do anyways), but also will be pointed to the classes that, to the human 
eye, may look like they're overriding the method, but in fact they do not.

More concrete example: there is an abstract class between the interface and a 
concrete implementation: you change the interface, modify the method in the 
abstract class, but then forget to change the signature in the overriden 
implementation of the concrete class, and get a behaviour from the abstract 
class rather then concrete implementation.

The question is not about taste or code aesthetics, but about making 
maintaining a large codebase that has a lot of complexity and that was evolving 
over many years simpler. If you could provide an example where @Override would 
be counter-productive or overly burdensome, we could compare this cost of 
maintenance with the cost of potential errors.

Thank you,
--Alex

[1] https://cassandra.apache.org/_/development/code_style.html




Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / interface implementations

2022-06-03 Thread bened...@apache.org
I always ask if we’re ready, get a few acks, then one or two new queries come 
out of the woodwork.

Perhaps I will just publish, and we can start addressing these queries in a 
follow-up process.

From: Dinesh Joshi 
Date: Friday, 3 June 2022 at 16:57
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / 
interface implementations
I don’t think the guide has yet been published to the official website, has it? 
Maybe we should just get it out there.
On Jun 3, 2022, at 8:54 AM, bened...@apache.org wrote:

Somebody hasn’t looked at the new style guide*, the conversation for which 
keeps rolling on and so it never quite gets promoted to the wiki. It says:

Always use @Override annotations when implementing abstract or interface 
methods or overriding a parent method.

* 
https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo


From: Josh McKenzie 
Date: Friday, 3 June 2022 at 16:14
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / 
interface implementations
> Avoid redundant @Override annotations when implementing abstract or interface 
> methods.
I'd argue they're not redundant. We're humans and infinitely fallible. :)

+1 to changing this to just always annotate for all the reasons you enumerate.

On Fri, Jun 3, 2022, at 10:16 AM, Alex Petrov wrote:
Right, my thinking matches what David has mentioned:

https://issues.apache.org/jira/browse/CASSANDRA-16096
https://lists.apache.org/thread/mkskwxn921t5bkfmnog032qvnyjk82t7

I'll make sure to update the style guide itself, too, since it looks like there 
was a vote, and intellij file is updated, just need to fixup the website.


On Fri, Jun 3, 2022, at 4:02 PM, Dinesh Joshi wrote:
So your proposal is to always add override annotation? Or are there situations 
where you don’t want to add them?


On Jun 3, 2022, at 6:53 AM, Alex Petrov  wrote:

Hi everyone,

In our style guide [1], we have a following statement:

> Avoid redundant @Override annotations when implementing abstract or interface 
> methods.

I'd like to suggest we change this.

@Override annotation in subclasses might be annoying when you're writing the 
code for the first time, or reading already familiar code, but when you're 
working on large changes and have complex class hierarchies, or multiple 
overloads for the method, it's easy to overlook methods that were not marked as 
overrides, and leave a wrong method in the code, or misinterpret the call chain.

I think @Override annotations are extremely useful and serve their purpose, 
especially when refactoring: I can change the interface, and will not only be 
pointed to all classes that do not implement the new version (which compiler 
will do anyways), but also will be pointed to the classes that, to the human 
eye, may look like they're overriding the method, but in fact they do not.

More concrete example: there is an abstract class between the interface and a 
concrete implementation: you change the interface, modify the method in the 
abstract class, but then forget to change the signature in the overriden 
implementation of the concrete class, and get a behaviour from the abstract 
class rather then concrete implementation.

The question is not about taste or code aesthetics, but about making 
maintaining a large codebase that has a lot of complexity and that was evolving 
over many years simpler. If you could provide an example where @Override would 
be counter-productive or overly burdensome, we could compare this cost of 
maintenance with the cost of potential errors.

Thank you,
--Alex

[1] https://cassandra.apache.org/_/development/code_style.html




Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / interface implementations

2022-06-03 Thread bened...@apache.org
Somebody hasn’t looked at the new style guide*, the conversation for which 
keeps rolling on and so it never quite gets promoted to the wiki. It says:

Always use @Override annotations when implementing abstract or interface 
methods or overriding a parent method.

* 
https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo


From: Josh McKenzie 
Date: Friday, 3 June 2022 at 16:14
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Change code style guide WRT to @Override in subclasses / 
interface implementations
> Avoid redundant @Override annotations when implementing abstract or interface 
> methods.
I'd argue they're not redundant. We're humans and infinitely fallible. :)

+1 to changing this to just always annotate for all the reasons you enumerate.

On Fri, Jun 3, 2022, at 10:16 AM, Alex Petrov wrote:
Right, my thinking matches what David has mentioned:

https://issues.apache.org/jira/browse/CASSANDRA-16096
https://lists.apache.org/thread/mkskwxn921t5bkfmnog032qvnyjk82t7

I'll make sure to update the style guide itself, too, since it looks like there 
was a vote, and intellij file is updated, just need to fixup the website.


On Fri, Jun 3, 2022, at 4:02 PM, Dinesh Joshi wrote:
So your proposal is to always add override annotation? Or are there situations 
where you don’t want to add them?


On Jun 3, 2022, at 6:53 AM, Alex Petrov  wrote:

Hi everyone,

In our style guide [1], we have a following statement:

> Avoid redundant @Override annotations when implementing abstract or interface 
> methods.

I'd like to suggest we change this.

@Override annotation in subclasses might be annoying when you're writing the 
code for the first time, or reading already familiar code, but when you're 
working on large changes and have complex class hierarchies, or multiple 
overloads for the method, it's easy to overlook methods that were not marked as 
overrides, and leave a wrong method in the code, or misinterpret the call chain.

I think @Override annotations are extremely useful and serve their purpose, 
especially when refactoring: I can change the interface, and will not only be 
pointed to all classes that do not implement the new version (which compiler 
will do anyways), but also will be pointed to the classes that, to the human 
eye, may look like they're overriding the method, but in fact they do not.

More concrete example: there is an abstract class between the interface and a 
concrete implementation: you change the interface, modify the method in the 
abstract class, but then forget to change the signature in the overriden 
implementation of the concrete class, and get a behaviour from the abstract 
class rather then concrete implementation.

The question is not about taste or code aesthetics, but about making 
maintaining a large codebase that has a lot of complexity and that was evolving 
over many years simpler. If you could provide an example where @Override would 
be counter-productive or overly burdensome, we could compare this cost of 
maintenance with the cost of potential errors.

Thank you,
--Alex

[1] https://cassandra.apache.org/_/development/code_style.html




Re: Updating our Code Contribution/Style Guide

2022-06-01 Thread bened...@apache.org
I’ve modified just the first sentence, to:

Dependencies expose the project to ongoing audit and maintenance burdens, and 
security risks. We wish to minimise our declared and transitive dependencies 
and to standardise mechanisms and solutions in the codebase. Adding new 
dependencies requires community consensus via a [DISCUSS] thread on the 
dev@cassandra.apache.org mailing list.

Since it’s not only security risks we care about. But really this is all 
nitpicking.


From: Mick Semb Wever 
Date: Wednesday, 1 June 2022 at 10:51
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide




On Mon, 30 May 2022 at 22:37, Ekaterina Dimitrova 
mailto:e.dimitr...@gmail.com>> wrote:
I also like it, thank you for putting it together. We can always add more and 
more, but I think the current one is already quite extensive. I like the 
dependency management point.



The dependency management paragraph, no objections, but the wording can be 
shortened…

For example,

Dependencies to the project are difficult to maintain over time and expose 
security flaws that are difficult for us to continuously audit. We wish to 
minimise our declared and transitive dependencies and to standardise mechanisms 
and solutions in the codebase. Adding new dependencies requires community 
consensus via a [DISCUSS] thread on the 
dev@cassandra.apache.org mailing list.






Re: Updating our Code Contribution/Style Guide

2022-05-31 Thread bened...@apache.org
I would be OK with failing the build for Javadoc warnings, and having a single 
cleanup pass to fix this. I think the kinds of issues we have (mismatching 
names/parameters) are the least of our documentation problems though.

I think it would be great to introduce guidance for authoring Javadoc, but I 
haven’t given much thought how to express best practices here, or even what 
best practice looks like.

I suspect our biggest documentation problem is a matter of incentives and 
priorities more than guidance though.

There are some things we could perhaps do to improve matters in that respect, 
such as requiring top level interfaces to have Javadoc on every method and 
failing the build if they do not (with some easy way to disable it, to avoid 
low quality boilerplate Javadoc to silence the warning), but I would suggest we 
open a separate thread to consider this topic.


From: Stefan Miklosovic 
Date: Tuesday, 31 May 2022 at 14:20
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide
Hi Benedict, all,

I do not want to hijack this thread, if we want to have separate
discussion about that I can open one but anyway ...

What do you think about Javadocs? I do not see that it is mentioned
but Javadocs are technically the code as well (well, they are written
in the source code, right?).

We have a lot of Javadoc warnings / errors when one wants to build it.
What I noticed is that the build reports at most 100 of Javadoc issues
but there is _way more_ of them. I already started to clean it up but
after a while I realized it is too big of a change to do during "rainy
afternoon" and I was not even sure what branch I should target.

So, do we want to target cleanup of Javadocs? I think we should target
just the trunk. Do you think we need any formal documentation /
guidance as to how to write Javadocs? Anything specific? I think that
we should strive for having 0 issues on Javadocs and flat out error
the build if some are found.

If we do not seek any significant improvement in this area (as a
community), I would like to have it explicitly stated.

Regards

On Tue, 31 May 2022 at 14:46, bened...@apache.org  wrote:
>
> I think that it is hard to define what the right extent of a patch is, but it 
> should be the minimal scope that the author feels sufficient to safely 
> address the concerns of the patch. I have added a sentence to this effect in 
> the top section of the proposal.
>
>
>
> My view (not propagated to the document) is that we should generally as a 
> rule avoid pure mechanistic clean-up work, unless it is associated with an 
> important refactor (and hence, likely to be trunk only). I would normally 
> give the cleanup its own commits for review, but not at merge.
>
>
>
> We don’t currently have any project norms around linter warnings, only errors 
> that we enforce with checkstyle and ecj. So I think right now it’s down to 
> personal taste at commit time, as part of any patch-related cleanup.
>
>
>
> Do we want to try pursuing zero warnings for commits by some linters? This 
> might be a good thing, if we are willing to be liberal with @SupressWarnings. 
> I’m not sure how we would transition, though, with so many existing warnings.
>
>
>
>
>
>
>
> From: Ekaterina Dimitrova 
> Date: Monday, 30 May 2022 at 21:37
> To: dev@cassandra.apache.org 
> Subject: Re: Updating our Code Contribution/Style Guide
>
> I also like it, thank you for putting it together. We can always add more and 
> more, but I think the current one is already quite extensive. I like the 
> dependency management point.
>
>
>
> I want to clarify a bit only one point. Any kind of old warnings and code 
> cleaning. If it is not immediately related to the patch, we should do those 
> in trunk and if it requires a lot of noise - probably in a separate 
> commit/ticket, no? Is this a valid statement? I've seen different opinions 
> but I feel it is good to have a consensus and this feels like a good time to 
> mention it. I mean cases where there are classes with 20 warnings, etc and 
> they may exist since early versions for example.
>
>
>
> Best regards,
>
> Ekaterina
>
>
>
> On Mon, 30 May 2022 at 14:10, Derek Chen-Becker  wrote:
>
> Looks great!
>
>
>
> On Mon, May 30, 2022, 5:37 AM bened...@apache.org  wrote:
>
> Any more feedback around this? Everyone happy with the latest proposal?
>
>
>
> From: bened...@apache.org 
> Date: Sunday, 15 May 2022 at 15:08
> To: dev@cassandra.apache.org 
> Subject: Re: Updating our Code Contribution/Style Guide
>
> I agree with this sentiment, but I think it will require a bit of time to 
> figure out where that balance is.
>
>
>
> I’ve inserted a mention of @Nullable, @ThreadSafe, @NotThreadSafe and 
>

Re: Updating our Code Contribution/Style Guide

2022-05-31 Thread bened...@apache.org
I think that it is hard to define what the right extent of a patch is, but it 
should be the minimal scope that the author feels sufficient to safely address 
the concerns of the patch. I have added a sentence to this effect in the top 
section of the proposal.

My view (not propagated to the document) is that we should generally as a rule 
avoid pure mechanistic clean-up work, unless it is associated with an important 
refactor (and hence, likely to be trunk only). I would normally give the 
cleanup its own commits for review, but not at merge.

We don’t currently have any project norms around linter warnings, only errors 
that we enforce with checkstyle and ecj. So I think right now it’s down to 
personal taste at commit time, as part of any patch-related cleanup.

Do we want to try pursuing zero warnings for commits by some linters? This 
might be a good thing, if we are willing to be liberal with @SupressWarnings. 
I’m not sure how we would transition, though, with so many existing warnings.



From: Ekaterina Dimitrova 
Date: Monday, 30 May 2022 at 21:37
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide
I also like it, thank you for putting it together. We can always add more and 
more, but I think the current one is already quite extensive. I like the 
dependency management point.

I want to clarify a bit only one point. Any kind of old warnings and code 
cleaning. If it is not immediately related to the patch, we should do those in 
trunk and if it requires a lot of noise - probably in a separate commit/ticket, 
no? Is this a valid statement? I've seen different opinions but I feel it is 
good to have a consensus and this feels like a good time to mention it. I mean 
cases where there are classes with 20 warnings, etc and they may exist since 
early versions for example.

Best regards,
Ekaterina

On Mon, 30 May 2022 at 14:10, Derek Chen-Becker 
mailto:de...@chen-becker.org>> wrote:
Looks great!

On Mon, May 30, 2022, 5:37 AM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
Any more feedback around this? Everyone happy with the latest proposal?

From: bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>>
Date: Sunday, 15 May 2022 at 15:08
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: Updating our Code Contribution/Style Guide
I agree with this sentiment, but I think it will require a bit of time to 
figure out where that balance is.

I’ve inserted a mention of @Nullable, @ThreadSafe, @NotThreadSafe and 
@Immutable.

> If we only use one of the two - for example @Nullable - that leaves us with 
> "We know the original author expected this to be null at some point in its 
> lifecycle and it means something" and "We have no idea if this is legacy and 
> nullable or not"

My inclination is to start building some norms around this, carefully as we 
don’t have enough experience and understanding of the pitfalls and long term 
usage. But, my preferred norms would be that properties should be assumed to be 
@Nonnull and that all nullable parameters and properties should be marked as 
@Nullable. This is how I use these properties today; Nonnull always seems 
superfluous, as it is rare to have a set of properties where null is the 
default, or where it is particularly important that the reader or compiler 
realise this.

There will be an interim period, in particular for legacy code, where this may 
lead to less clarity. But in the long term this is probably preferable to 
inconsistent usage where some areas of the codebase indicate @Nonnull without 
indicating @Nullable, and vice-versa, or where every variable and method ends 
up marked with one or the other.

This is probably also most consistent with a future world of cheap Optional 
types (i.e. Valhalla), where Nullable may begin to be replaced with Optional, 
and Nonnull may become very much the default.

That said, as stated multiple times, the author and reviewer’s determinations 
are final. This document just sets up some basic parameters/expectations.

From: Derek Chen-Becker mailto:de...@chen-becker.org>>
Date: Saturday, 14 May 2022 at 20:56
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: Updating our Code Contribution/Style Guide
On Sat, May 14, 2022 at 11:00 AM Josh McKenzie 
mailto:jmcken...@apache.org>> wrote:

Incidentally, I've found similar value in @ThreadSafe, const, readonly, etc - 
communications of author's intent; being able to signal to future maintainers 
helps them make modifications that are more consistent with and safer with 
regards to the original intention and guarantees of the author.

Assuming you trust those guarantees that is. :)

I think author's intent is important, which is why I also think that

Re: Updating our Code Contribution/Style Guide

2022-05-30 Thread bened...@apache.org
Any more feedback around this? Everyone happy with the latest proposal?

From: bened...@apache.org 
Date: Sunday, 15 May 2022 at 15:08
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide
I agree with this sentiment, but I think it will require a bit of time to 
figure out where that balance is.

I’ve inserted a mention of @Nullable, @ThreadSafe, @NotThreadSafe and 
@Immutable.

> If we only use one of the two - for example @Nullable - that leaves us with 
> "We know the original author expected this to be null at some point in its 
> lifecycle and it means something" and "We have no idea if this is legacy and 
> nullable or not"

My inclination is to start building some norms around this, carefully as we 
don’t have enough experience and understanding of the pitfalls and long term 
usage. But, my preferred norms would be that properties should be assumed to be 
@Nonnull and that all nullable parameters and properties should be marked as 
@Nullable. This is how I use these properties today; Nonnull always seems 
superfluous, as it is rare to have a set of properties where null is the 
default, or where it is particularly important that the reader or compiler 
realise this.

There will be an interim period, in particular for legacy code, where this may 
lead to less clarity. But in the long term this is probably preferable to 
inconsistent usage where some areas of the codebase indicate @Nonnull without 
indicating @Nullable, and vice-versa, or where every variable and method ends 
up marked with one or the other.

This is probably also most consistent with a future world of cheap Optional 
types (i.e. Valhalla), where Nullable may begin to be replaced with Optional, 
and Nonnull may become very much the default.

That said, as stated multiple times, the author and reviewer’s determinations 
are final. This document just sets up some basic parameters/expectations.

From: Derek Chen-Becker 
Date: Saturday, 14 May 2022 at 20:56
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide
On Sat, May 14, 2022 at 11:00 AM Josh McKenzie 
mailto:jmcken...@apache.org>> wrote:

Incidentally, I've found similar value in @ThreadSafe, const, readonly, etc - 
communications of author's intent; being able to signal to future maintainers 
helps them make modifications that are more consistent with and safer with 
regards to the original intention and guarantees of the author.

Assuming you trust those guarantees that is. :)

I think author's intent is important, which is why I also think that 
judicious/effective commenting and naming are important (and I'm glad that 
naming is called out in the guidelines explicitly). However, I also think that 
these are also opportunities to help the compiler and tooling help us, similar 
to how Benedict's draft calls out effective use of the type system as a way to 
encode semantics and constraints in the code. These annotations, while clunky 
and verbose, do open the door in some cases to static analysis that the Java 
compiler is incapable of doing. I don't know exactly where it is, but I think 
there's a balance between use of annotations to help tooling identify problems 
while not becoming onerous for current and future contributors. I know this is 
more difficult in Java than, say, Rust, but I'm an eternal optimist and I think 
we can find that balance :)

Cheers,

Derek

--
+---+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and   |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---+



Re: Updating our Code Contribution/Style Guide

2022-05-15 Thread bened...@apache.org
I agree with this sentiment, but I think it will require a bit of time to 
figure out where that balance is.

I’ve inserted a mention of @Nullable, @ThreadSafe, @NotThreadSafe and 
@Immutable.

> If we only use one of the two - for example @Nullable - that leaves us with 
> "We know the original author expected this to be null at some point in its 
> lifecycle and it means something" and "We have no idea if this is legacy and 
> nullable or not"

My inclination is to start building some norms around this, carefully as we 
don’t have enough experience and understanding of the pitfalls and long term 
usage. But, my preferred norms would be that properties should be assumed to be 
@Nonnull and that all nullable parameters and properties should be marked as 
@Nullable. This is how I use these properties today; Nonnull always seems 
superfluous, as it is rare to have a set of properties where null is the 
default, or where it is particularly important that the reader or compiler 
realise this.

There will be an interim period, in particular for legacy code, where this may 
lead to less clarity. But in the long term this is probably preferable to 
inconsistent usage where some areas of the codebase indicate @Nonnull without 
indicating @Nullable, and vice-versa, or where every variable and method ends 
up marked with one or the other.

This is probably also most consistent with a future world of cheap Optional 
types (i.e. Valhalla), where Nullable may begin to be replaced with Optional, 
and Nonnull may become very much the default.

That said, as stated multiple times, the author and reviewer’s determinations 
are final. This document just sets up some basic parameters/expectations.

From: Derek Chen-Becker 
Date: Saturday, 14 May 2022 at 20:56
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide
On Sat, May 14, 2022 at 11:00 AM Josh McKenzie 
mailto:jmcken...@apache.org>> wrote:

Incidentally, I've found similar value in @ThreadSafe, const, readonly, etc - 
communications of author's intent; being able to signal to future maintainers 
helps them make modifications that are more consistent with and safer with 
regards to the original intention and guarantees of the author.

Assuming you trust those guarantees that is. :)

I think author's intent is important, which is why I also think that 
judicious/effective commenting and naming are important (and I'm glad that 
naming is called out in the guidelines explicitly). However, I also think that 
these are also opportunities to help the compiler and tooling help us, similar 
to how Benedict's draft calls out effective use of the type system as a way to 
encode semantics and constraints in the code. These annotations, while clunky 
and verbose, do open the door in some cases to static analysis that the Java 
compiler is incapable of doing. I don't know exactly where it is, but I think 
there's a balance between use of annotations to help tooling identify problems 
while not becoming onerous for current and future contributors. I know this is 
more difficult in Java than, say, Rust, but I'm an eternal optimist and I think 
we can find that balance :)

Cheers,

Derek

--
+---+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and   |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---+



Re: Updating our Code Contribution/Style Guide

2022-05-14 Thread bened...@apache.org
> having the policy be enums by default as opposed to just recommending them

This might be a stylistic issue. “Prefer an enum to Boolean properties” is 
imperative voice, and is meant to be read as “you should use enums, not 
booleans, unless you have overriding reasons not to” – perhaps the example 
scenarios that follow, in which they are most strongly indicated, weaken the 
effect.

I’m sure I can tweak the language, but overall I have tried to avoid making 
anything an explicit dictat in this style guide. It’s somewhere between a 
policy and a set of recommendations, as I think it is preferable to leave the 
author and reviewer to make final determinations, and also to avoid imbuing 
documents like this with too much power (and making them too contentious).

I’ll see about tweaking it along with your other suggestions.

From: Derek Chen-Becker 
Date: Saturday, 14 May 2022 at 20:45
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide


On Sat, May 14, 2022 at 8:24 AM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
> I'm in favor of codifying the usage of @NotNull and @Nullable stylistically. 
> +1

I’m in favour of the use of _one_ of @Nullable and @NotNull, preferably the 
former since we already use it and it’s more reasonable to have a default of 
non-null variables, parameters and properties.

However, I’m not confident in how to craft guidance for these annotations. I 
don’t think they should be used in every place a variable or property might be 
null, only in places where it is surprising or otherwise informative to a 
reader that they might be null. Annotating every property and variable with 
@NonNull or @Nullable would seriously pollute the screen, and probably harm 
legibility more than help.

At the very least we should mention @Nullable and invite authors to use it 
where it aids clarity, but if somebody has a good proposal for better guidance 
I’m all ears.

Yes, unfortunately there's a whole menagerie of these types of annotations, and 
I didn't mean both. If we're already using Nullable (from Findbugs) that's the 
better one anyway because you can specify the when parameter. It's also 
supported by languages like Kotlin for nullable types if we were ever 
considering a language that wouldn't require polluting the screen for a bit 
more safety ;)

Overall I think that an assumption that all variables are null unless 
explicitly marked is probably a reasonable first step if it's not already in 
place, but it's also a good intention more than a mechanism and I'll put some 
thought into other ways we can improve the situation without impacting 
legibility.



> I think extra clarity and social pressure around "Never catch Exception or 
> Throwable unless you explicitly rethrow them" sounds valuable

We already stipulate that you should always rethrow exceptions, but this is 
very vague. I will try to tidy this up. On the whole, though, we have a 
fail-fast approach to processing commands, so we mostly just propagate, with 
exception handlers existing only for clean-up purposes (except in particular 
circumstances, usually involving checked exceptions like InterruptedException). 
So we mostly do catch Throwable (and rethrow), I think, which is what informed 
the current vague formulation.

Sure, rethrow after cleanup seems reasonable, but I think that should be the 
explicit exception rather than an assumption of our approach to error handling.

> I would recommend that we strengthen the recommendation for using enums for 
> Boolean properties for any type that is used in method parameters

I’m unsure about this. I am not against it per se, but the more enums we have 
the more clashes of enum identifiers we have, and this can cause confusion 
particularly with static imports, and in some cases the Boolean property will 
have a very obvious effect. I prefer to leave some decisions to the author, 
since we have expressed a strong preference here for the author to consider. 
But perhaps a blanket policy would do more good than harm. I could endorse it, 
and am relatively neutral.


To be clear, I think there should always be room for (clearly documented) 
exceptions, so I was thinking more of having the policy be enums by default as 
opposed to just recommending them. I've been thinking that as part of the 
guidelines it might be good to have some examples of both (here's how you can 
use an enum, but here's a case where a boolean was simple and clear), so let me 
dig around and see if I can find some code to point to.

Cheers,

Derek



--
+---+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and   |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---+



Re: Updating our Code Contribution/Style Guide

2022-05-14 Thread bened...@apache.org
lity at the call site outweighs the 
(modest, IMHO) cost of introducing a new enum, and the enum also provides a 
useful "handle" for providing documentation on the semantics of the flag. There 
are already a lot of Boolean parameters in use in the codebase and I can take a 
look at what it would take to clean these up
  *   I like the section on Method clarity, but I would also call out 
non-trivial predicate logic as a candidate for encapsulation in its own method
  *   Should we consider @NotNull/@Nullable or other annotations besides 
@Override?
  *   In the exception handling section should we discuss using the most 
applicable exception type for the handler? I.e. don't catch Exception or 
Throwable? This probably falls under the don't silently swallow or log 
exceptions paragraph
  *   The guidance on brace placement seems to contradict the Java coding 
conventions if we place the opening brace on a new line. Is that intentional or 
am I misreading the statement? Would it be clearer to link to a specific style 
as defined somewhere (e.g. 
https://en.wikipedia.org/wiki/Indentation_style#Variant:_Java)
  *   The doc doesn't seem to cover a recommendation for braces with 
single-line bodies of conditional/loop statements. In my own experience it 
makes it easier to read if we uniformly used braces everywhere, but it does 
look like there are quite a few places in the code where we have unbraced ifs



Overall the doc is well written and carefully considered, and I appreciate all 
of the work that went into it!



Cheers,



Derek



From: "bened...@apache.org" 
Reply-To: "dev@cassandra.apache.org" 
Date: Friday, May 13, 2022 at 6:41 AM
To: "dev@cassandra.apache.org" 
Subject: RE: [EXTERNAL]Updating our Code Contribution/Style Guide



CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



It’s been a couple of months since I opened this discussion. I think I have 
integrated the feedback into the google doc. Are there any elements anyone 
wants to continue discussing, or things I have not fully addressed? I’ll take 
an absence of response as lazy consensus to commit the changes to the wiki.







From: bened...@apache.org 
Date: Monday, 14 March 2022 at 09:41
To: dev@cassandra.apache.org 
Subject: Updating our Code Contribution/Style Guide

Our style guide hasn’t been updated in about a decade, and I think it is 
overdue some improvements that address some shortcomings as well as modern 
facilities such as streams and lambdas.



Most of this was put together for an effort Dinesh started a few years ago, but 
has languished since, in part because the project has always seemed to have 
other priorities. I figure there’s never a good time to raise a contended 
topic, so here is my suggested update to contributor guidelines:



https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo



Many of these suggestions codify norms already widely employed, sometimes in 
spite of the style guide, but some likely remain contentious. Some potentially 
contentious things to draw your attention to:



  *   Deemphasis of getX() nomenclature, in favour of richer set of prefixes 
and more succinct simple x() to retrieve where clear
  *   Avoid implementing methods, incl. equals(), hashCode() and toString(), 
unless actually used
  *   Modified new-line rules for multi-line function calls
  *   External dependency rules (require DISCUSS thread before introducing)







Re: Updating our Code Contribution/Style Guide

2022-05-13 Thread bened...@apache.org
It’s been a couple of months since I opened this discussion. I think I have 
integrated the feedback into the google doc. Are there any elements anyone 
wants to continue discussing, or things I have not fully addressed? I’ll take 
an absence of response as lazy consensus to commit the changes to the wiki.



From: bened...@apache.org 
Date: Monday, 14 March 2022 at 09:41
To: dev@cassandra.apache.org 
Subject: Updating our Code Contribution/Style Guide
Our style guide hasn’t been updated in about a decade, and I think it is 
overdue some improvements that address some shortcomings as well as modern 
facilities such as streams and lambdas.

Most of this was put together for an effort Dinesh started a few years ago, but 
has languished since, in part because the project has always seemed to have 
other priorities. I figure there’s never a good time to raise a contended 
topic, so here is my suggested update to contributor guidelines:

https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo

Many of these suggestions codify norms already widely employed, sometimes in 
spite of the style guide, but some likely remain contentious. Some potentially 
contentious things to draw your attention to:


  *   Deemphasis of getX() nomenclature, in favour of richer set of prefixes 
and more succinct simple x() to retrieve where clear
  *   Avoid implementing methods, incl. equals(), hashCode() and toString(), 
unless actually used
  *   Modified new-line rules for multi-line function calls
  *   External dependency rules (require DISCUSS thread before introducing)





Re: How we flag tickets as blockers during freeze

2022-05-09 Thread bened...@apache.org
I think this is close to what we settled on last we hashed this out.

From: Josh McKenzie 
Date: Monday, 9 May 2022 at 22:47
To: dev@cassandra.apache.org 
Subject: Re: How we flag tickets as blockers during freeze
As you mentioned on slack, we can introduce FixVersions for the unreleased 
interim versions specified in the lifecycle wiki 
(https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle), so 
add the following specific unresolved placeholder FixVersions:

4.1-alpha
4.1-beta
4.1-rc

Thus anything unresolved flagged 4.1-alpha would be a blocker for that, same 
for beta and rc. When the tickets are closed, we switch them to FixVersion 4.1; 
I don't see there being much value in knowing in the future if a ticket is 
fixed during the alpha, beta, or rc phases by using the above as resolved 
FixVersions.

This approach potentially breaks down if we have any final blockers on 4.1 ga, 
but could just cycle through 4.1-rc until it's all clear.

On Mon, May 9, 2022, at 5:07 PM, Mick Semb Wever wrote:
Any other opinions or ideas out there? Would like to tidy our tickets up as 
build lead and scope out remaining work for 4.1.


My request is that we don't overload fixVersions. That is, a fixVersion is 
either for resolved tickets, or a placeholder for unresolved, but never both.
This makes it easier with jira hygiene post release, ensuring issues do get 
properly assigned their correct fixVersion. (This work can be many tickets and 
already quite cumbersome, but it is valued by users.)

It would also be nice to try keep what is a placeholder fixVersion as 
intuitively as possible. The easiest way I see us doing this is to avoid using 
patch numbers. This rules out Option 1.

While the use of 4.0 and 4.1 as resolved fixVersions kinda breaks the above 
notion of "if it doesn't have a patch version then it's a placeholder". The 
precedence here is that all resolved tickets before the first .0 of a major 
gets this short-hand version (and often in addition to the alpha1, beta1, rc1 
fixVersions).





Re: Code freeze starts 1st May. Anything to be addressed?

2022-04-27 Thread bened...@apache.org
> The same backward compatibility mechanism needed for system-provided UUIDs 
> will work for user-provided UUIDs.

By ignoring them, and assigning a different one? That seems confusing, and like 
the feature will in effect be short lived.

It’s a very different problem to upgrade a set of IDs just once that we control 
unilaterally, and another to sensible handle some user input.

I should also note that collision detection is harder than you think. It needs 
to be reliable which means we need to use distributed consensus to allocate 
these ids, it can’t just involve our usual “look in gossip” approach. So 
collision detection by itself is not a small thing to deliver in a few days IMO.

From: Paulo Motta 
Date: Wednesday, 27 April 2022 at 19:09
To: Cassandra DEV 
Subject: Re: Code freeze starts 1st May. Anything to be addressed?
> One reason might be compatibility – this may (I hope _will_) migrate to a 
> simple integer of low cardinality in future, which would be a breaking change.

I look forward to this change, but won't we need to implement some backward 
compatibility handling for legacy UUIDs anyway? The same backward compatibility 
mechanism needed for system-provided UUIDs will work for user-provided UUIDs.

> This identifier will likely be used by Accord for correctness, too, and doing 
> something wrong with it could have severe consequences, so at the very least 
> it should be hard to access.

The only potentially issue I see is a host_id collision, which is easily 
fixable by a simple collision check.

> We could of course have two different host ids, one for the user to set to 
> identify the host in some way for them, and another one for internal usage, 
> but I’m not sure that’s a great idea.

I don't think we need to keep the ability to set a host ID if we change the ID 
representation, since it will be incompatible with externally-provided UUIDs. 
We can just remove the feature and call it a day since the new system will 
warrant a major version update anyway.
To be clear, I don't oppose reverting this if there are concerns about it.

Em qua., 27 de abr. de 2022 às 14:51, 
bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> escreveu:
One reason might be compatibility – this may (I hope _will_) migrate to a 
simple integer of low cardinality in future, which would be a breaking change. 
This identifier will likely be used by Accord for correctness, too, and doing 
something wrong with it could have severe consequences, so at the very least it 
should be hard to access.

We could of course have two different host ids, one for the user to set to 
identify the host in some way for them, and another one for internal usage, but 
I’m not sure that’s a great idea.

From: Paulo Motta mailto:pauloricard...@gmail.com>>
Date: Wednesday, 27 April 2022 at 18:20
To: Cassandra DEV mailto:dev@cassandra.apache.org>>
Subject: Re: Code freeze starts 1st May. Anything to be addressed?
Fully agree we should add a collision check but I don't understand why this 
optional feature is bad/dangerous after we add this ability? Can you provide an 
example of a potential issue?
I don't expect this property to be used by most users, except power users which 
normally know what they're doing. We have tons of potentially dangerous knobs 
and I don't get why this particular one is any different.

Em qua., 27 de abr. de 2022 às 14:05, Sam Tunnicliffe 
mailto:s...@beobal.com>> escreveu:
CASSANDRA-14582 added support for users to supply an arbitrary value for 
HOST_ID when booting a new node. IMO it's a pretty bad and potentially 
dangerous idea for the unique identifier to be settable in this way. Hint 
delivery is already routed by host id and there have been several JIRAs which 
have called for more fundamental reworking of cluster metadata using permanent 
opaque identifiers rather than IPs to address members (CASSANDRA-11559, 
CASSANDRA-15823, etc). Using host id for anything like that in future would be 
made much more difficult with this capability.

Aside from the longer term implications, it seems that the feature as currently 
implemented has some issues. There doesn't appear to be any validation that a 
supplied host id isn't already in use by a live node, so it's trivial to 
trigger a collision which can lead to divergent ring views between nodes and 
ultimately in data loss.

Although this landed in trunk almost 11 months ago it hasn't been included in a 
release yet, so I propose we revert it before cutting 4.1 (although, as the 
revert isn't a feature, I guess technically we could do that during the 
freeze). I'm not completely convinced about encoding metadata into host ids, 
but even if that is something we want to do, I don't think it's wise to 
completely remove control over the identifiers from Cassandra itself.

Thanks,
Sam

On 25 Apr 2022, at 16:17, Ekaterina Dimitrova 
mailto:e.dimitr...@gmail.com>> wro

Re: Code freeze starts 1st May. Anything to be addressed?

2022-04-27 Thread bened...@apache.org
One reason might be compatibility – this may (I hope _will_) migrate to a 
simple integer of low cardinality in future, which would be a breaking change. 
This identifier will likely be used by Accord for correctness, too, and doing 
something wrong with it could have severe consequences, so at the very least it 
should be hard to access.

We could of course have two different host ids, one for the user to set to 
identify the host in some way for them, and another one for internal usage, but 
I’m not sure that’s a great idea.

From: Paulo Motta 
Date: Wednesday, 27 April 2022 at 18:20
To: Cassandra DEV 
Subject: Re: Code freeze starts 1st May. Anything to be addressed?
Fully agree we should add a collision check but I don't understand why this 
optional feature is bad/dangerous after we add this ability? Can you provide an 
example of a potential issue?
I don't expect this property to be used by most users, except power users which 
normally know what they're doing. We have tons of potentially dangerous knobs 
and I don't get why this particular one is any different.

Em qua., 27 de abr. de 2022 às 14:05, Sam Tunnicliffe 
mailto:s...@beobal.com>> escreveu:
CASSANDRA-14582 added support for users to supply an arbitrary value for 
HOST_ID when booting a new node. IMO it's a pretty bad and potentially 
dangerous idea for the unique identifier to be settable in this way. Hint 
delivery is already routed by host id and there have been several JIRAs which 
have called for more fundamental reworking of cluster metadata using permanent 
opaque identifiers rather than IPs to address members (CASSANDRA-11559, 
CASSANDRA-15823, etc). Using host id for anything like that in future would be 
made much more difficult with this capability.

Aside from the longer term implications, it seems that the feature as currently 
implemented has some issues. There doesn't appear to be any validation that a 
supplied host id isn't already in use by a live node, so it's trivial to 
trigger a collision which can lead to divergent ring views between nodes and 
ultimately in data loss.

Although this landed in trunk almost 11 months ago it hasn't been included in a 
release yet, so I propose we revert it before cutting 4.1 (although, as the 
revert isn't a feature, I guess technically we could do that during the 
freeze). I'm not completely convinced about encoding metadata into host ids, 
but even if that is something we want to do, I don't think it's wise to 
completely remove control over the identifiers from Cassandra itself.

Thanks,
Sam


On 25 Apr 2022, at 16:17, Ekaterina Dimitrova 
mailto:e.dimitr...@gmail.com>> wrote:

Hi everyone,

Kind reminder that 1st May is around the corner. What does this mean? Our code 
freeze starts on 1st May and my understanding is that only bug fixing can go 
into the 4.1 branch.
If anyone has anything to raise, now is a good time. On my end I saw a few 
things for this week that we should probably put to completion:
- CASSANDRA-17571 - I 
have to close this one, it is in progress; new types in Config is good to be in 
before the freeze I guess, even if It is not yaml change
- CASSANDRA-17557 - we 
need to take care of the parameters so we don't have to deprecate and  support 
anything not actually needed; I think it is probably more or less done
- CASSANDRA-17379 - adds 
a new flag around config; I think it is more or less done, depends on final CI 
and second reviewer maybe needed?
- JMX intercept Cassandra exceptions, I think David mentioned a rebase was 
needed
- CASSANDRA-17212 - The config property minimum_keyspace_rf and their nodetool 
getter and setter commands are new to 4.1. They are suitable to be ported to 
guardrails, and if we do this port in 4.1 we won't need to deprecate that 
property and nodetool commands in the next release, just one release after 
their introduction.

I guess the failing tests we see could be fixed after the freeze but no API 
changes.

Thanks everyone for all the hard work. Please don’t hesitate to raise the flag 
with questions, concerns or any help needed.

Best regards,
Ekaterina



Re: UDF: adding custom jar to classpath

2022-04-06 Thread bened...@apache.org
The property you are setting permits some kinds of privilege escalation, but by 
default classes outside of those pre-defined by the whitelist are not 
permitted. This is imposed here: 
https://github.com/apache/cassandra/blob/210793f943dc522161fd26b6192f38a5c83fa131/src/java/org/apache/cassandra/cql3/functions/UDFunction.java#L168

You will need to modify the source code to e.g. add additional allowedPatterns, 
or perhaps to permit additional patterns to be configured at startup.

From: Sébastien Rebecchi 
Date: Wednesday, 6 April 2022 at 15:15
To: dev@cassandra.apache.org , e.dimitr...@gmail.com 

Cc: ble...@apache.org 
Subject: Re: UDF: adding custom jar to classpath
Hi Ekaterina,

I use 4.0.1.
But as I said I added a jar in classpath (/usr/share/cassandra/lib/ folder on 
every node) and I see that the jar is loaded in the classpath from the 
Cassandra command line. And I have "enable_user_defined_functions: true" and 
"enable_user_defined_functions_threads: false" in cassandra.yaml.
So I don't see what is missing or not done properly.

Best regards,
Sébastien.

Le mer. 6 avr. 2022 à 16:03, Ekaterina Dimitrova 
mailto:e.dimitr...@gmail.com>> a écrit :
Hi Sebastian,
Do you use the latest 4.0.3 version? Those options were added in 4.0.2 I 
believe, so if you try them with an earlier version - below message is what you 
would get as they didn’t exist.

Best regards,
Ekaterina

On Wed, 6 Apr 2022 at 9:53, Sébastien Rebecchi 
mailto:srebec...@kameleoon.com>> wrote:
Hi Benjamin, Hi everybody,

I found in the documentation that we should add "allow_insecure_udfs: true" and 
optionally "allow_extra_insecure_udfs: true" so that 
"enable_user_defined_functions_threads: false" is really taken into account (I 
understood like that). That would explain why my UDF still does not run even 
with "enable_user_defined_functions_threads: false". Found in 
https://github.com/apache/cassandra/blob/cassandra-4.0/NEWS.txt

So I tried to add "allow_insecure_udfs: true" and "allow_extra_insecure_udfs: 
true" in cassandra.yaml, but then Cassandra failed to restart and I got that 
error in logs "Exception 
(org.apache.cassandra.exceptions.ConfigurationException) encountered during 
startup: Invalid yaml. Please remove properties [allow_insecure_udfs, 
allow_extra_insecure_udfs] from your cassandra.yaml".

Should I understand that we can activate that 2 extra confs only by changing 
source code? That would be really disappointing :( And if no, then how to 
activate all UDF possibilities from cassandra.yaml please?

Thanks in advance,

Sébastien.


Le mar. 5 avr. 2022 à 10:36, Benjamin Lerer 
mailto:ble...@apache.org>> a écrit :
Unfortunately, I do not have much time for doing some digging. Sorry for that 
:-(
You should look at JavaBasedUDFunction and  UDFExecutorServic.

Le lun. 4 avr. 2022 à 17:25, Sébastien Rebecchi 
mailto:srebec...@kameleoon.com>> a écrit :
Hi!
Do you have any more ideas for me?
Cordially,
Sébastien.

Le lun. 28 mars 2022 à 16:39, Sébastien Rebecchi 
mailto:srebec...@kameleoon.com>> a écrit :
Unfortunately, it is not working even with 
"enable_user_defined_functions_threads: false" in cassandra.yaml :/
Is there any way to check the running configuration?

Le lun. 28 mars 2022 à 15:35, Benjamin Lerer 
mailto:ble...@apache.org>> a écrit :
I do not think that allowing to customize UDF classes whitelist has been 
discussed before. Feel free to open a JIRA ticket :-)
I have some plans to revisit how we securise UDFs as the current threading 
approach has some impact in terms of latency. That can be a good opportunity to 
look into providing more flexibility.

Le lun. 28 mars 2022 à 15:00, Sébastien Rebecchi 
mailto:srebec...@kameleoon.com>> a écrit :
Thanks you very much! I will try that.
As you know, would it be a long-terms solution? Or is there any plan to add the 
possibility to customize UDF classes whitelist?

Le lun. 28 mars 2022 à 14:31, Benjamin Lerer 
mailto:ble...@apache.org>> a écrit :
Is there a way to customize that default behaviour?

Looking at JavaBasedUDFunction quickly it seems that the ClassLoader is only 
used when you use the UDFExecutorService to execute your UDFs. You can try to 
disable it using "enable_user_defined_functions_threads: false" and see if it 
works.
Now that also means that you have to ensure that only trusted persons can 
create UDF or UDA as it removes all safety mechanisms.

Le lun. 28 mars 2022 à 13:23, Sébastien Rebecchi 
mailto:srebec...@kameleoon.com>> a écrit :
Hi Benjamin,

Thanks for the answer.
Is there a way to customize that default behaviour? If no, could you indicate 
where to find this class loader in the github of Cassandra please?

Le lun. 28 mars 2022 à 12:40, Benjamin Lerer 
mailto:ble...@apache.org>> a écrit :
Hi Sébastien,

Cassandra uses a special classloader for UDFs that limit which classes can be 
used.
You cannot rely on non-JDK classes for UDFs and some of the JDK packages like 
the IO package for example cannot be used.
The goal is simply to 

Re: [GitHub] [cassandra-website] ossarga commented on a diff in pull request #121: move and prepare files for content folder

2022-04-04 Thread bened...@apache.org
I just did the same for cassandra-accord, I guess some config was lost in the 
upgrade

https://issues.apache.org/jira/projects/INFRA/issues/INFRA-23074


From: Mick Semb Wever 
Date: Monday, 4 April 2022 at 08:55
To: dev@cassandra.apache.org 
Subject: Re: [GitHub] [cassandra-website] ossarga commented on a diff in pull 
request #121: move and prepare files for content folder

these notifications should be going to pr@

have created https://issues.apache.org/jira/browse/INFRA-23073

On Mon, 4 Apr 2022 at 03:09, GitBox mailto:g...@apache.org>> 
wrote:

ossarga commented on code in PR #121:
URL: https://github.com/apache/cassandra-website/pull/121#discussion_r841307662


##
site-content/docker-entrypoint.sh:
##
@@ -196,14 +196,87 @@ generate_site_yaml() {

 render_site_content_to_html() {
   pushd "${CASSANDRA_WEBSITE_DIR}/site-content" > /dev/null
-  echo "Building the site HTML content."
+  log_message "INFO" "Building the site HTML content."
   antora --generator antora-site-generator-lunr site.yaml
-  echo "Rendering complete!"
+  log_message "INFO" "Rendering complete!"
   popd > /dev/null
 }

+
+prepare_site_html_for_publication() {
+  pushd "${CASSANDRA_WEBSITE_DIR}" > /dev/null
+
+  # copy everything to content/ directory
+  log_message "INFO" "Moving site HTML to content/"
+  mkdir -p content/doc
+  cp -r site-content/build/html/* content/
+
+  # remove hardcoded domain name, and empty domain names first before we 
duplicate and documentation
+  content_files_to_change=($(grep -rl 'https://cassandra.apache.org/' 
content/))
+  log_message "INFO" "Removing hardcoded domain names in 
${#content_files_to_change[*]} files"
+  for content_file in ${content_files_to_change[*]}
+  do
+log_message "DEBUG" "Processing file ${content_file}"
+# sed automatically uses the character following the 's' as a delimiter.
+# In this case we will use the ',' so we can avoid the need to escape the 
'/' characters
+sed -i 's,https://cassandra.apache.org/,/,g' ${content_file}
+  done
+
+  content_files_to_change=($(grep -rl 'href="//' content/))
+  log_message "INFO" "Removing empty domain names in 
${#content_files_to_change[*]} files"
+  for content_file in ${content_files_to_change[*]}
+  do
+log_message "DEBUG" "Processing file ${content_file}"
+sed -i 's,href="//,href="/,g' ${content_file}
+  done
+
+  # move around the in-tree docs if generated
+  if [ "${COMMAND_GENERATE_DOCS}" = "run" ]
+  then
+log_message "INFO" "Moving versioned documentation HTML to content/doc"
+move_intree_document_directories "3.11" "3.11.11" "3.11.12"
+move_intree_document_directories "4.0" "4.0.0" "4.0.1" "4.0.2" "4.0.3" 
"stable"
+move_intree_document_directories "trunk" "4.1" "latest"

Review Comment:
   Agreed. I have been thinking about this for some time. It will be addressed 
in [CASSANDRA-17517](https://issues.apache.org/jira/browse/CASSANDRA-17517).



--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
dev-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Re: [DISCUSS] Should we deprecate / freeze python dtests

2022-03-29 Thread bened...@apache.org
> a well-defined path to reduce/eliminate code duplication and basic 
> documentation for newcomers to get up to speed with writing in-jvm dtests and 
> extending the framework

Are python tests much better here? If not, I do not see why these should be 
blockers for their deprecation.

Perfect feature parity also seems unnecessary - unless a missing feature is an 
active impediment. But as far as I know every missing feature is actively under 
development and can be expected very soon.

Let’s get this decision over and done with.


From: Paulo Motta 
Date: Wednesday, 30 March 2022 at 00:46
To: Cassandra DEV 
Subject: Re: [DISCUSS] Should we deprecate / freeze python dtests
I support deprecating python dtests, as long as in-jvm dtests have feature 
parity with python dtests, a well-defined path to reduce/eliminate code 
duplication and basic documentation for newcomers to get up to speed with 
writing in-jvm dtests and extending the framework.

Em ter., 29 de mar. de 2022 às 20:09, 
bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> escreveu:
It often does not work. I can attest to many wasted weeks, on some environments 
never getting them to work.

They happen to work right now for me, though.

I think the learning curve thing is a bit of a distraction, personally. I have 
always found python dtests hard to work with, both developing against and 
running, so their learning curve for me is going on 10 years. Some folk may be 
more comfortable with python dtests due to their familiarity with python, ccm 
or other tooling, but that is a different matter.

Looking at git, most contributors to python dtests are contributors to in-jvm 
dtests, and the latter have received 20x as many net code contributions over 
the past year.

I think it’s quite justified to just say in-jvm dtests are simply better to 
work with, and already better and more widely used despite their youth, 
whatever their remaining teething problems.

I vote we immediately discontinue python dtest development, and discontinue 
running python dtests pre-commit, retaining them for releases only. This will 
provide the necessary impetus to polish off any last remaining gaps, without 
reducing coverage.

From: Brandon Williams mailto:dri...@gmail.com>>
Date: Tuesday, 29 March 2022 at 23:42
To: dev mailto:dev@cassandra.apache.org>>
Subject: Re: [DISCUSS] Should we deprecate / freeze python dtests
> In fact there is a high learning curve to setup cassandra-dtest environment

I think this is fairly well documented:
https://github.com/apache/cassandra-dtest/blob/trunk/README.md

On Tue, Mar 29, 2022 at 5:27 PM Paulo Motta 
mailto:pauloricard...@gmail.com>> wrote:
>
> > I am curious about this comment.  When I first joined I learned jvm-dtest 
> > within an hour and started walking Repair code in a debugger (and this was 
> > way before the improvements that let us do things like nodetool)… python 
> > dtest took weeks to get working correctly (still having issues with the 
> > MBean library we use… so have to comment out error handling to get some 
> > tests to pass)….
>
> Thanks for sharing your perspective. In fact there is a high learning curve 
> to setup cassandra-dtest environment, but once it's working it's pretty 
> straightforward to test any existing or new functionality.
>
> I think with in-jvm dtests you don't have the hassle of setting up a 
> different environment and this is a great motivator to standardize on this 
> solution. The main difficulty I had was testing features not supported by the 
> framework, which require you to extend the framework. I don't recall having 
> to extend ccm/cassandra-dtest many times when working on new features.
>
> Perhaps this has improved recently and we no longer need to worry about 
> extending the framework or duplicating code when testing new functionality.
>
> Em ter., 29 de mar. de 2022 às 15:12, Ekaterina Dimitrova 
> mailto:e.dimitr...@gmail.com>> escreveu:
>>
>> One thing that we can add to docs is for people how to update the in-jvm 
>> framework and test their patches before asking for in-jvm api release. The 
>> assumption is those won’t be many updates needed I think, but it is good to 
>> be documented.
>>
>> On Tue, 29 Mar 2022 at 13:51, David Capwell 
>> mailto:dcapw...@apple.com>> wrote:
>>>
>>> They use a separate implementation of instance initialization and thus they 
>>> test the test server rather than the real node.
>>>
>>>
>>> I think we can get rid of this by extending CassandraDaemon, just need to 
>>> add a few hooks to mock out gossip/internode/client (for cases where the 
>>> mocks are desired), and when mocks are not desired just run the real logic.
>>>
>>> Too many t

Re: [DISCUSS] Should we deprecate / freeze python dtests

2022-03-29 Thread bened...@apache.org
t;>> we should at least write extensive documentation on how to use/modify 
>>> in-jvm dtest framework before deprecating python dtests.
>>>
>>> We should have this for all our testing frameworks period, in-jvm dtest, 
>>> python dtest, and ccm. They're woefully under-documented IMO.
>>>
>>> On Tue, Mar 29, 2022, at 6:11 AM, Paulo Motta wrote:
>>>
>>> To elaborate a bit on the steep learning curve point, when mentoring new 
>>> contributors on a couple of occasions I told them to "just write a python 
>>> dtest" because we had no idea on how to test that functionality on in-jvm 
>>> tests while the python dtest was fairly straightforward to implement (I 
>>> can't recall exactly what feature was it but I can dig if necessary).
>>>
>>> While we might be already familiar with the in-jvm dtest framework due to 
>>> our exposure to it, we shouldn't neglect that there is a significant 
>>> learning curve associated with it for new contributors which IMO is much 
>>> lower for pyhton dtests. So we should at least write extensive 
>>> documentation on how to use/modify in-jvm dtest framework before 
>>> deprecating python dtests.
>>>
>>> Em ter., 29 de mar. de 2022 às 06:58, Paulo Motta 
>>>  escreveu:
>>>
>>> > They use a separate implementation of instance initialization and thus 
>>> > they test the test server rather than the real node.
>>>
>>> I also have this concern. When adding a new service on CASSANDRA-16789 we 
>>> had to explicitly modify the in-jvm dtest server to match the behavior from 
>>> the actual server [1] (this is just a minor example but I remember having 
>>> to do something similar on other tickets).
>>>
>>> Besides having a steep learning curve since users need to be familiar with 
>>> the in-jvm dtest framework in order to add new functionality not supported 
>>> by it, this is potentially unsafe, since the implementations can diverge 
>>> without being caught by tests.
>>>
>>> Is there any way we could avoid duplicating functionality on the test 
>>> server and use the same initialization code on in-jvm dtests?
>>>
>>> [1] - 
>>> https://github.com/apache/cassandra/commit/ad249424814836bd00f47931258ad58bfefb24fd#diff-321b52220c5bd0aaadf275a845143eb208c889c2696ba0d48a5fc880551131d8R735
>>>
>>> Em ter., 29 de mar. de 2022 às 04:22, Benjamin Lerer  
>>> escreveu:
>>>
>>> They use a separate implementation of instance initialization and thus they 
>>> test the test server rather than the real node.
>>>
>>>
>>> This is actually my main concern. What is the real gap between the in-JVM 
>>> tests server instance and a server as run by python DTests?
>>>
>>> Le mar. 29 mars 2022 à 00:08, bened...@apache.org  a 
>>> écrit :
>>>
>>> > Other than that, it can be problematic to test upgrades when the starting 
>>> > version must run with a different Java version than the end release
>>>
>>>
>>>
>>> python upgrade tests seem to be particularly limited (from a quick skim, 
>>> primarily testing major upgrade points that are now long in the past), so 
>>> I’m not sure how much of a penalty this is today in practice - but it might 
>>> well become a problem.
>>>
>>>
>>>
>>> There’s several questions to answer, namely how many versions we want to:
>>>
>>>
>>>
>>> - test upgrades across
>>>
>>> - maintain backwards compatibility of the in-jvm dtest api across
>>>
>>> - support a given JVM for
>>>
>>>
>>>
>>> However, if we need to, we can probably use RMI to transparently support 
>>> multiple JVMs for tests that require it. Since we already use serialization 
>>> to cross the ClassLoader boundary it might not even be very difficult.
>>>
>>>
>>>
>>>
>>>
>>> From: Jacek Lewandowski 
>>> Date: Monday, 28 March 2022 at 22:30
>>> To: dev@cassandra.apache.org 
>>> Subject: Re: [DISCUSS] Should we deprecate / freeze python dtests
>>>
>>> Although I like in-jvm DTests for many scenarios, I can see that they do 
>>> not test the production code as it is. They use a separate implementation 
>>> of instance initialization and thus they test the test server rather than 
>>> the real node. Other than that, it can be problematic to test upgrades wh

Re: [DISCUSS] Should we deprecate / freeze python dtests

2022-03-28 Thread bened...@apache.org
> Other than that, it can be problematic to test upgrades when the starting 
> version must run with a different Java version than the end release

python upgrade tests seem to be particularly limited (from a quick skim, 
primarily testing major upgrade points that are now long in the past), so I’m 
not sure how much of a penalty this is today in practice - but it might well 
become a problem.

There’s several questions to answer, namely how many versions we want to:

- test upgrades across
- maintain backwards compatibility of the in-jvm dtest api across
- support a given JVM for

However, if we need to, we can probably use RMI to transparently support 
multiple JVMs for tests that require it. Since we already use serialization to 
cross the ClassLoader boundary it might not even be very difficult.


From: Jacek Lewandowski 
Date: Monday, 28 March 2022 at 22:30
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Should we deprecate / freeze python dtests
Although I like in-jvm DTests for many scenarios, I can see that they do not 
test the production code as it is. They use a separate implementation of 
instance initialization and thus they test the test server rather than the real 
node. Other than that, it can be problematic to test upgrades when the starting 
version must run with a different Java version than the end release. One more 
thing I've been observing sometimes is high consumption of metaspace, which 
does not seem to be cleaned after individual test cases. Given each started 
instance uses a dedicated class loader there is some amount of trash left and 
when there are a couple of multi-node test cases in a single test class, it 
sometimes happens that the test fail with out of memory in metaspace error.

Thanks,
Jacek

On Mon, Mar 28, 2022 at 10:06 PM David Capwell 
mailto:dcapw...@apple.com>> wrote:
I am back and the work for trunk to support vnode is at the last stage of 
review; I had not planned to backport the changes to other branches (aka, older 
branches would only support single token), so if someone would like to pick up 
this work it is rather LHF after 17332 goes in (see trunk patch GH PR: 
trunk<https://github.com/apache/cassandra/pull/1432>).

I am in favor of deprecating python dtests, and agree we should figure out what 
the gaps are (once vnode support is merged) so we can either shrink them or 
special case to unfreeze (such as startup changes being allowed).


On Mar 14, 2022, at 6:13 AM, Josh McKenzie 
mailto:jmcken...@apache.org>> wrote:

vnode support for in-jvm dtests is in flight and fairly straightforward:

https://issues.apache.org/jira/browse/CASSANDRA-17332

David's OOO right now but I suspect we can get this in in April some time.

On Mon, Mar 14, 2022, at 8:36 AM, 
bened...@apache.org<mailto:bened...@apache.org> wrote:
This is the limitation I mentioned. I think this is solely a question of 
supplying an initial config that uses vnodes, i.e. that specifies multiple 
tokens for each node. It is not really a limitation – I believe a dtest could 
be written today using vnodes, by overriding the config’s tokens. It does look 
like the token handling has been refactored since the initial implementation to 
make this a little uglier than should be necessary.

We should make this trivial, anyway, and perhaps offer a way to run all of the 
dtests with vnodes (and suitably annotating those that cannot be run with 
vnodes). This should be quite easy.



From: Andrés de la Peña mailto:adelap...@apache.org>>
Date: Monday, 14 March 2022 at 12:28
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: [DISCUSS] Should we deprecate / freeze python dtests
Last time I checked there wasn't support for vnodes on in-jvm dtests, which 
seems an important limitation.

On Mon, 14 Mar 2022 at 12:24, bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
I am strongly in favour of deprecating python dtests in all cases where they 
are currently superseded by in-jvm dtests. They are environmentally more 
challenging to work with, causing many problems on local and remote machines. 
They are harder to debug, slower, flakier, and mostly less sophisticated.

> all focus on getting the in-jvm framework robust enough to cover edge-cases

Would be great to collect gaps. I think it’s just vnodes, which is by no means 
a fundamental limitation? There may also be some stuff to do startup/shutdown 
and environmental scripts, that may be a niche we retain something like python 
dtests for.

> people aren’t familiar

I would be interested to hear from these folk to understand their concerns or 
problems using in-jvm dtests, if there is a cohort holding off for this reason

> This is going to require documentation work from some of the original authors

I think a collection of template-like tests we can point people to would be a 
cheap i

Re: Updating our Code Contribution/Style Guide

2022-03-21 Thread bened...@apache.org
It looks like the doc already specified this behaviour for ternary operator 
line wrapping. For your proposal I’ve also added the following:

It is usually preferable to carry the operator for multiline expressions, with 
the exception of some multiline string literals.

Does that work for you? The “usually” at least leaves some wiggle room, as 
ultimately I would prefer this decision to be made by an author (even if a 
general norm of carrying the operator is preferable).

I am concerned that this starts leaning towards being too specific, though. It 
opens up questions like whether we should also be specifying spacing for loop 
guards, conditions, casts, etc?


From: bened...@apache.org 
Date: Sunday, 20 March 2022 at 21:37
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide
> We are talking about one extra line, not a dozen or more.

I think you are confused about the context. The case I was discussing often 
means 10+ additional lines at each call-site.

> Once the code gets more real, it is faster to read the difference between (a) 
> and (c)

This isn’t a great example, as if you are line-wrapping with multiple 
parameters you should be assigning any computation to a clearly named local 
variable before passing the result to the constructor (amongst other things). 
We can perhaps highlight this in the style guide.

We also do not only produce multi-line computations in this context. If 
constructing into a variable (where there is no such ambiguity), it is much 
easier to parse parameters first without the concatenation operator preceding 
them.

My point is simply that legislating on this kind of detail is a waste of our 
time, and probably counter-productive. I don’t want to enumerate all the 
possible ways we might construct multi-line computations.

Ternary operators are pretty clear, so maybe we can just agree to define those, 
and leave the rest to the judgement of the authors?


From: Mick Semb Wever 
Date: Sunday, 20 March 2022 at 20:56
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide


> I support this too… leads to more noise in, and less readability of, the 
> patch.

Readability of the patch is not harmed with modern tooling (with whitespace 
being highlighted differently to content changes).

Legibility of the code (not patch) should always be preferred IMO. To aid code 
comprehension, we should aim for density of useful information for the reader; 
wasting a dozen or more lines on zero information density, solely to solve a 
problem already handled by modern diff tools, is a false economy.


We are talking about one extra line, not a dozen or more.
It also improves the readability of the code IMHO.


> I would also like to suggest that an operator should always carry on line 
> wraps

For the ternary operator I agree, however I am less convinced in other cases. 
String concatenation is probably cleaner with the opposite norm, so that string 
literals are aligned.


IMHO it works for string concatenation too.
The example that comes to mind is
a)

method(
"a",
"b",
"c"
)
b)

method(
"a" +
"b" +
"c"
)
c)

method(
"a"
+ "b"
+ "c"
)

Once the code gets more real, it is faster to read the difference between (a) 
and (c) than it (a) and (b).





Re: Updating our Code Contribution/Style Guide

2022-03-20 Thread bened...@apache.org
> We are talking about one extra line, not a dozen or more.

I think you are confused about the context. The case I was discussing often 
means 10+ additional lines at each call-site.

> Once the code gets more real, it is faster to read the difference between (a) 
> and (c)

This isn’t a great example, as if you are line-wrapping with multiple 
parameters you should be assigning any computation to a clearly named local 
variable before passing the result to the constructor (amongst other things). 
We can perhaps highlight this in the style guide.

We also do not only produce multi-line computations in this context. If 
constructing into a variable (where there is no such ambiguity), it is much 
easier to parse parameters first without the concatenation operator preceding 
them.

My point is simply that legislating on this kind of detail is a waste of our 
time, and probably counter-productive. I don’t want to enumerate all the 
possible ways we might construct multi-line computations.

Ternary operators are pretty clear, so maybe we can just agree to define those, 
and leave the rest to the judgement of the authors?


From: Mick Semb Wever 
Date: Sunday, 20 March 2022 at 20:56
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide


> I support this too… leads to more noise in, and less readability of, the 
> patch.

Readability of the patch is not harmed with modern tooling (with whitespace 
being highlighted differently to content changes).

Legibility of the code (not patch) should always be preferred IMO. To aid code 
comprehension, we should aim for density of useful information for the reader; 
wasting a dozen or more lines on zero information density, solely to solve a 
problem already handled by modern diff tools, is a false economy.


We are talking about one extra line, not a dozen or more.
It also improves the readability of the code IMHO.


> I would also like to suggest that an operator should always carry on line 
> wraps

For the ternary operator I agree, however I am less convinced in other cases. 
String concatenation is probably cleaner with the opposite norm, so that string 
literals are aligned.


IMHO it works for string concatenation too.
The example that comes to mind is
a)

method(
"a",
"b",
"c"
)
b)

method(
"a" +
"b" +
"c"
)
c)

method(
"a"
+ "b"
+ "c"
)

Once the code gets more real, it is faster to read the difference between (a) 
and (c) than it (a) and (b).





Re: Updating our Code Contribution/Style Guide

2022-03-20 Thread bened...@apache.org
> I support this too… leads to more noise in, and less readability of, the 
> patch.

Readability of the patch is not harmed with modern tooling (with whitespace 
being highlighted differently to content changes).

Legibility of the code (not patch) should always be preferred IMO. To aid code 
comprehension, we should aim for density of useful information for the reader; 
wasting a dozen or more lines on zero information density, solely to solve a 
problem already handled by modern diff tools, is a false economy.


> I also agree that several arguments on the one line should be avoided, that 
> too many method parameters is the problem here.

Method parameters aren’t a problem if they are strongly typed, parameters are 
cleanly grouped, and all call-sites utilise approximately the same behaviour. 
The alternatives are builders or mutability, the latter of which we broadly 
avoid. Builders make code navigation clunkier (creating more indirection to 
navigate when searching callers), as well as potentially creating additional 
heap pressure and code pollution (both in the call sites and the builder 
itself).

Builders are helpful when there are lot of different ways to configure an 
object, but the more common case of simply propagating a relevant subset of 
existing parameters (plus perhaps a couple of new but required parameters), at 
just a handful of equivalent call-sites, they are IMO unhelpful.

Note also that builders have the exact same legibility concerns as 
parameter-per-line: a significant amount of screen real-estate is taken up by 
scaffolding/noise. This is only useful if the builder call-sites communicate 
some unique configuration details about the particular call-site.

> I would also like to suggest that an operator should always carry on line 
> wraps

For the ternary operator I agree, however I am less convinced in other cases. 
String concatenation is probably cleaner with the opposite norm, so that string 
literals are aligned.



From: Mick Semb Wever 
Date: Sunday, 20 March 2022 at 19:55
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide


On Tue, 15 Mar 2022 at 11:46, Ruslan Fomkin 
mailto:ruslan.fom...@gmail.com>> wrote:
…
I support Jacek’s request to have each argument on a separate line when they 
are many and need to be placed on multiple lines. For me it takes less effort 
to grasp arguments on separate lines than when several arguments are combined 
on the same line. IMHO the root cause is having too many arguments, which is 
common issue for non-OOP languages.


I support this too. It has always bugged me that by having the first parameter 
not on a newline, then when the method (or variable assigned) name changes it 
causes all subsequent wrapped parameter lines have to be re-indented, which 
leads to more noise in, and less readability of, the patch.

For example,

method(
"a",
"b",
"c"
)

is better than

method("a",

   "b",

   "c"
)

I also agree that several arguments on the one line should be avoided, that too 
many method parameters is the problem here.

I would also like to suggest that an operator should always carry on line 
wraps. This makes it faster to read the difference between arguments per line 
wrapping, and operations over multiple lines.
For example, ensuring it looks like

var = bar == null

  ? doFoo()

  : doBar();

and not

var = bar == null ?

  doFoo() :

  doBar();




Re: Using labels on pull requests in GitHub

2022-03-16 Thread bened...@apache.org
+1, let’s change our merge strategy 


From: Josh McKenzie 
Date: Wednesday, 16 March 2022 at 12:47
To: dev@cassandra.apache.org 
Subject: Re: Using labels on pull requests in GitHub
I think the fact that they pile up is because our merge strategy means we don't 
actually merge using the PR's we use for review so there's nothing codified in 
the workflow to close them out when a ticket's done.

An easy fix would be to change our merge strategy and use the merge button on 
PR's to merge things in so they auto-close. :)

(/grinding my axe)

On Wed, Mar 16, 2022, at 7:27 AM, Paulo Motta wrote:
Thanks for doing this Stefan.

The fact that PRs are abandoned and piling up on github demonstrates a hygiene 
problem and creates a bad user experience to newcomers which are accustomed to 
the Github workflow. I'm supportive of any initiative to improve this

I think starting labelling PRs manually and then looking into ways to automate 
this would be a good improvement from the status quo.



Re: Using labels on pull requests in GitHub

2022-03-16 Thread bened...@apache.org
Since PRs are a second class citizen to Jira, mostly used for a scratch pad for 
nits and questions with code context, I suspect any improvements here will need 
to be automated to have any hope of success.

From: Stefan Miklosovic 
Date: Wednesday, 16 March 2022 at 08:16
To: dev@cassandra.apache.org 
Subject: Re: Using labels on pull requests in GitHub
Yeah, what I see quite frequently is that people come over, they open
PR but it does not have any related JIRA ticket and they just drop it
there and never return hence these PRs are in a constant limbo, not in
JIRA and they are more often than not left behind completely. Creating
categories would at least provide some minimal visibility where we are
at.

On Wed, 16 Mar 2022 at 09:07, Erick Ramirez  wrote:
>
> +1 it's a great idea. I have to admit that I don't go through the PRs and I 
> only pay attention to tickets so if doc PRs are "orphans" (don't have 
> associated tickets), I don't ever work on them. I'll aim to do this when I 
> have bandwidth. Cheers! 
>
> On Wed, 16 Mar 2022 at 19:02, Stefan Miklosovic 
>  wrote:
>>
>> Hello,
>>
>> Is somebody fundamentally opposing the idea of applying labels to pull
>> requests when applicable? I went through the pull requests and it
>> would be nice to have some basic filters, like "show me all pull
>> requests related to documentation" would be labeled as "docs", then
>> PRs fixing some tests would be "tests" and so on. We may further
>> narrow it down for subsystems etc.
>>
>> I do not mind applying myself in this to tag the PRs as they come if
>> people do not tag it themselves in order to have at least some basic
>> "filterability". As I went through PRs closing already committed ones,
>> I noticed there are a lot of PRs related to documentation which just
>> tend to be completely forgotten in the long run.
>>
>> Does this make sense to people?
>>
>> Regards
>>
>> Stefan


Re: [FOR REVIEW] Blog post: An Interview with Project Contributor, Lorina Poland

2022-03-16 Thread bened...@apache.org
+1

From: Erick Ramirez 
Date: Tuesday, 15 March 2022 at 22:08
To: dev@cassandra.apache.org 
Subject: Re: [FOR REVIEW] Blog post: An Interview with Project Contributor, 
Lorina Poland
Looks good to me! 

On Wed, 16 Mar 2022 at 08:17, Chris Thornett 
mailto:ch...@constantia.io>> wrote:
As requested, I'm posting content contributions for community review on the ML 
for those that might not spot them on Slack.

We're currently mid-review for our first contributor Q which is with Lorina 
Poland:
https://docs.google.com/document/d/1nnH4V1XvTcfTeeUdZ_mjSxlNlWTbSXFu_qKtJRQUFBk/edit.
 Please add edits or suggests as comments.

Thanks!
--

Chris Thornett
senior content strategist, Constantia.io
ch...@constantia.io


Re: Updating our Code Contribution/Style Guide

2022-03-15 Thread bened...@apache.org
I’d be fine with that, though I think if we want to start enforcing imports we 
probably want to mass correct them first. It’s not like other style 
requirements in that there should not be unintended consequences. A single 
(huge) commit to standardise the orders and introduce a build-time check would 
be fine IMO.

I also don’t really think it is that important.

From: Jacek Lewandowski 
Date: Tuesday, 15 March 2022 at 05:18
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide
I do think that we should at least enforce the import order. What is now is a 
complete mess and causes a lot of conflicts during rebasing / merging. Perhaps 
we could start enforcing such rules only on modified files, this way we could 
gradually go towards consistency... wdyt?

- - -- --- -  -
Jacek Lewandowski


On Tue, Mar 15, 2022 at 1:52 AM Dinesh Joshi 
mailto:djo...@apache.org>> wrote:
Benedict, I agree. We should not be rigid about applying any style. stylechecks 
are meant to bring uniformity in the codebase. I assure you what I am proposing 
is neither rigid nor curbs the ability to apply the rules flexibly.


On Mar 14, 2022, at 4:52 PM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

I’m a strong -1 on strictly enforcing any style guide. It is there to help 
shape contributions, review feedback and responding to said feedback. It can 
also be used to setup IntelliJ’s code formatter to configure default behaviours.

It is not meant to be turned into a linter. Plenty of the rules are stated in a 
flexible manner, so as to permit breaches where overall legibility and 
aesthetics are improved.


From: Dinesh Joshi mailto:djo...@apache.org>>
Date: Monday, 14 March 2022 at 23:44
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: Updating our Code Contribution/Style Guide
I am also in favor of updating the style guide. We should ideally have custom 
checkstyle configuration that can ensure adherence to the style guide.

I also don't think this is a contended topic. We want to explicitly codify our 
current practices so new contributors have an easier time writing code.

It is also important to note that the current codebase is not consistent since 
it was written over a long period of time so it tends to confuse folks who are 
working in different parts of the codebase. So this style guide would be very 
helpful.

On Mar 14, 2022, at 2:41 AM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

Our style guide hasn’t been updated in about a decade, and I think it is 
overdue some improvements that address some shortcomings as well as modern 
facilities such as streams and lambdas.

Most of this was put together for an effort Dinesh started a few years ago, but 
has languished since, in part because the project has always seemed to have 
other priorities. I figure there’s never a good time to raise a contended 
topic, so here is my suggested update to contributor guidelines:

https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo

Many of these suggestions codify norms already widely employed, sometimes in 
spite of the style guide, but some likely remain contentious. Some potentially 
contentious things to draw your attention to:


  *   Deemphasis of getX() nomenclature, in favour of richer set of prefixes 
and more succinct simple x() to retrieve where clear
  *   Avoid implementing methods, incl. equals(), hashCode() and toString(), 
unless actually used
  *   Modified new-line rules for multi-line function calls
  *   External dependency rules (require DISCUSS thread before introducing)



Re: Updating our Code Contribution/Style Guide

2022-03-14 Thread bened...@apache.org
I’m a strong -1 on strictly enforcing any style guide. It is there to help 
shape contributions, review feedback and responding to said feedback. It can 
also be used to setup IntelliJ’s code formatter to configure default behaviours.

It is not meant to be turned into a linter. Plenty of the rules are stated in a 
flexible manner, so as to permit breaches where overall legibility and 
aesthetics are improved.


From: Dinesh Joshi 
Date: Monday, 14 March 2022 at 23:44
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide
I am also in favor of updating the style guide. We should ideally have custom 
checkstyle configuration that can ensure adherence to the style guide.

I also don't think this is a contended topic. We want to explicitly codify our 
current practices so new contributors have an easier time writing code.

It is also important to note that the current codebase is not consistent since 
it was written over a long period of time so it tends to confuse folks who are 
working in different parts of the codebase. So this style guide would be very 
helpful.


On Mar 14, 2022, at 2:41 AM, bened...@apache.org<mailto:bened...@apache.org> 
wrote:

Our style guide hasn’t been updated in about a decade, and I think it is 
overdue some improvements that address some shortcomings as well as modern 
facilities such as streams and lambdas.

Most of this was put together for an effort Dinesh started a few years ago, but 
has languished since, in part because the project has always seemed to have 
other priorities. I figure there’s never a good time to raise a contended 
topic, so here is my suggested update to contributor guidelines:

https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo

Many of these suggestions codify norms already widely employed, sometimes in 
spite of the style guide, but some likely remain contentious. Some potentially 
contentious things to draw your attention to:


  *   Deemphasis of getX() nomenclature, in favour of richer set of prefixes 
and more succinct simple x() to retrieve where clear
  *   Avoid implementing methods, incl. equals(), hashCode() and toString(), 
unless actually used
  *   Modified new-line rules for multi-line function calls
  *   External dependency rules (require DISCUSS thread before introducing)



Re: Updating our Code Contribution/Style Guide

2022-03-14 Thread bened...@apache.org
I think it is fine to count generated implementations of interfaces as 
interfaces, even if they are not defined. If you would like to explicitly 
mention this, that is fine. Though, if I’m perfectly honest, I do not find that 
mocking improves testing in many cases (instead making it more tightly coupled 
and brittle). But that is a separate discussion.

> Having interfaces encourages better unit tests IMHO.

Having unnecessary and unused interfaces encourages messier code, IMHO. 
Premature abstraction is bad. Introduce interfaces, methods or indeed any 
concept as and when you need them, for testing or otherwise.

> For the instance() / getInstance() methods - I know it is an additional 
> effort, but on the other hand it has many advantages because you can replace 
> the singleton for testing

Again, do this as necessary. I think for public instances this is a fine 
recommendation, but for private uses it should not be prescribed, only used if 
there is an explicit benefit.

> And the continuation indent - currently, when I have IntelliJ configured with 
> provided formatting setup, I get something like this

Ah, I thought you meant for lambdas. I’m not sure how best to specify a 
continuation indent, or in which contexts it applies – only when there is no 
other indentation? Conversely, the following works quite nicely. Typically I 
try to ensure the start of the line is as succinct as possible to permit clean 
indentation follow-up.


method("a",

   "b",

   "c"
)



EndpointState removedState = endpointStateMap.stream(endpoint)

 .map()…




From: Jacek Lewandowski 
Date: Monday, 14 March 2022 at 16:45
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide
Regarding interfaces, mocks created by Mockito are not really the 
implementations. We also cannot predict tests which will be written in the 
future. Having interfaces encourages better unit tests IMHO.

An addendum for exception handling guidelines sounds like a good idea.

For the instance() / getInstance() methods - I know it is an additional effort, 
but on the other hand it has many advantages because you can replace the 
singleton for testing - replace with a newly created instance for a certain 
test case

And the continuation indent - currently, when I have IntelliJ configured with 
provided formatting setup, I get something like this:

method(
"a",
"b",
"c"
)
or


EndpointState removedState = endpointStateMap
.remove(endpoint);


I know it is preferred to move to the previous line, but sometimes it makes the 
line much too long due to some nested calls or something else.



On Mon, Mar 14, 2022 at 4:02 PM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
Hi Jacek,


> Sometimes, although it would be just a single implementation, interface can 
> make sense for testing purposes - for mocking in particular

This would surely mean there are two implementations, one of which is in the 
test tree? I think this is therefore already covered.


> For exception handling, perhaps we should explicitly mention in the guideline 
> that we should always handle Exception or Throwable (which is frequently 
> being catched in the code) by methods from Throwables, which would properly 
> deal with InterruptedException

I do not think this properly handles InterruptedException – 
InterruptedException that are not to be handled directly should now really be 
handled by propagating UncheckedInterruptedException, which is very different 
to catching all Throwable. In many cases InterruptedException should be handled 
explicitly, however.

I do not think catching Exception or Throwable is the correct solution in most 
cases either – we should ideally only do so at the top level at which we want 
broad unforeseen problems to be handled, or where we need to take specific 
actions to handle exception, in which case we should ideally always rethrow the 
Throwable unmolested. I can see some benefit from explicitly outlining these 
cases, as it is not trivial to handle exceptions cleanly and correctly. We 
could perhaps create an exception handling addendum, perhaps in a separate 
page, that goes into greater detail?


> I found it useful to access singletons by getInstance() method rather than 
> directly



This can be beneficial for public use cases, but for private use cases it is 
oftentimes unhelpful to pollute the code. Also note that the document 
explicitly proposes avoiding getX, so we would instead have e.g. a method 
called instance(). Happy to add a section for this.



>- "...If a line wraps inside a method call, try to group natural parameters 
>together on a single line..." while I'm generally ok with that approach, 
>putting

Re: Updating our Code Contribution/Style Guide

2022-03-14 Thread bened...@apache.org
ould run on 
CircleCI just for the modified files?



Thanks,

jacek














- - -- --- -  -
Jacek Lewandowski


On Mon, Mar 14, 2022 at 1:10 PM Josh McKenzie 
mailto:jmcken...@apache.org>> wrote:

we should add Python code style guides to it
Strongly agree. We're hurting ourselves by treating our python as a 2nd class 
citizen.

if we should avoid making method parameters and local variables `final` - this 
is inconsistent over the code base, but I'd prefer not having them. If the 
method is large enough that we might mistakenly reuse parameters/variables, we 
should probably refactor the met
Why not both (i.e. use final where possible and refactor when at length / doing 
too much)? The benefits of immutability are generally well recognized as are 
the benefits of keeping methods to reasonable lengths and complexity.


On Mon, Mar 14, 2022, at 7:59 AM, Marcus Eriksson wrote:
Looks good

One thing that might be missing is direction on if we should avoid making 
method parameters and local variables `final` - this is inconsistent over the 
code base, but I'd prefer not having them. If the method is large enough that 
we might mistakenly reuse parameters/variables, we should probably refactor the 
method.

/Marcus

On Mon, Mar 14, 2022 at 09:41:35AM +, 
bened...@apache.org<mailto:bened...@apache.org> wrote:
> Our style guide hasn’t been updated in about a decade, and I think it is 
> overdue some improvements that address some shortcomings as well as modern 
> facilities such as streams and lambdas.
>
> Most of this was put together for an effort Dinesh started a few years ago, 
> but has languished since, in part because the project has always seemed to 
> have other priorities. I figure there’s never a good time to raise a 
> contended topic, so here is my suggested update to contributor guidelines:
>
> https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo
>
> Many of these suggestions codify norms already widely employed, sometimes in 
> spite of the style guide, but some likely remain contentious. Some 
> potentially contentious things to draw your attention to:
>
>
>   *   Deemphasis of getX() nomenclature, in favour of richer set of prefixes 
> and more succinct simple x() to retrieve where clear
>   *   Avoid implementing methods, incl. equals(), hashCode() and toString(), 
> unless actually used
>   *   Modified new-line rules for multi-line function calls
>   *   External dependency rules (require DISCUSS thread before introducing)
>
>
>




Re: [DISCUSS] Should we deprecate / freeze python dtests

2022-03-14 Thread bened...@apache.org
This is the limitation I mentioned. I think this is solely a question of 
supplying an initial config that uses vnodes, i.e. that specifies multiple 
tokens for each node. It is not really a limitation – I believe a dtest could 
be written today using vnodes, by overriding the config’s tokens. It does look 
like the token handling has been refactored since the initial implementation to 
make this a little uglier than should be necessary.

We should make this trivial, anyway, and perhaps offer a way to run all of the 
dtests with vnodes (and suitably annotating those that cannot be run with 
vnodes). This should be quite easy.


From: Andrés de la Peña 
Date: Monday, 14 March 2022 at 12:28
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Should we deprecate / freeze python dtests
Last time I checked there wasn't support for vnodes on in-jvm dtests, which 
seems an important limitation.

On Mon, 14 Mar 2022 at 12:24, bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
I am strongly in favour of deprecating python dtests in all cases where they 
are currently superseded by in-jvm dtests. They are environmentally more 
challenging to work with, causing many problems on local and remote machines. 
They are harder to debug, slower, flakier, and mostly less sophisticated.

> all focus on getting the in-jvm framework robust enough to cover edge-cases

Would be great to collect gaps. I think it’s just vnodes, which is by no means 
a fundamental limitation? There may also be some stuff to do startup/shutdown 
and environmental scripts, that may be a niche we retain something like python 
dtests for.

> people aren’t familiar

I would be interested to hear from these folk to understand their concerns or 
problems using in-jvm dtests, if there is a cohort holding off for this reason

> This is going to require documentation work from some of the original authors

I think a collection of template-like tests we can point people to would be a 
cheap initial effort. Cutting and pasting an existing test with the required 
functionality, then editing to suit, should get most people off to a quick 
start who aren’t familiar.

> Labor and process around revving new releases of the in-jvm dtest API

I think we need to revisit how we do this, as it is currently broken. We should 
consider either using ASF snapshots until we cut new releases of C* itself, or 
else using git subprojects. This will also become a problem for Accord’s 
integration over time, and perhaps other subprojects in future, so it is worth 
better solving this.

I think this has been made worse than necessary by moving too many 
implementation details to the shared API project – some should be retained 
within the C* tree, with the API primarily serving as the shared API itself to 
ensure cross-version compatibility. However, this is far from a complete 
explanation of (or solution to) the problem.



From: Josh McKenzie mailto:jmcken...@apache.org>>
Date: Monday, 14 March 2022 at 12:11
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: [DISCUSS] Should we deprecate / freeze python dtests
I've been wrestling with the python dtests recently and that led to some 
discussions with other contributors about whether we as a project should be 
writing new tests in the python dtest framework or the in-jvm framework. This 
discussion has come up tangentially on some other topics, including the lack of 
documentation / expertise on the in-jvm framework dis-incentivizing some folks 
from authoring new tests there vs. the difficulty debugging and maintaining 
timer-based, sleep-based non-deterministic python dtests, etc.

I don't know of a place where we've formally discussed this and made a 
project-wide call on where we expect new distributed tests to be written; if 
I've missed an email about this someone please link on the thread here (and 
stop reading! ;))

At this time we don't specify a preference for where you write new multi-node 
distributed tests on our "development/testing" portion of the site and 
documentation: https://cassandra.apache.org/_/development/testing.html

The primary tradeoffs as I understand them for moving from python-based 
multi-node testing to jdk-based are:
Pros:

  1.  Better debugging functionality (breakpoints, IDE integration, etc)
  2.  Integration with simulator
  3.  More deterministic runtime (anecdotally; python dtests _should_ be 
deterministic but in practice they prove to be very prone to environmental 
disruption)
  4.  Test time visibility to internals of cassandra
Cons:

  1.  The framework is not as mature as the python dtest framework (some 
functionality missing)
  2.  Labor and process around revving new releases of the in-jvm dtest API
  3.  People aren't familiar with it yet and there's a learning curve

So my bid here: I personally think we as a project should freeze writ

Re: [DISCUSS] Should we deprecate / freeze python dtests

2022-03-14 Thread bened...@apache.org
I am strongly in favour of deprecating python dtests in all cases where they 
are currently superseded by in-jvm dtests. They are environmentally more 
challenging to work with, causing many problems on local and remote machines. 
They are harder to debug, slower, flakier, and mostly less sophisticated.

> all focus on getting the in-jvm framework robust enough to cover edge-cases

Would be great to collect gaps. I think it’s just vnodes, which is by no means 
a fundamental limitation? There may also be some stuff to do startup/shutdown 
and environmental scripts, that may be a niche we retain something like python 
dtests for.

> people aren’t familiar

I would be interested to hear from these folk to understand their concerns or 
problems using in-jvm dtests, if there is a cohort holding off for this reason

> This is going to require documentation work from some of the original authors

I think a collection of template-like tests we can point people to would be a 
cheap initial effort. Cutting and pasting an existing test with the required 
functionality, then editing to suit, should get most people off to a quick 
start who aren’t familiar.

> Labor and process around revving new releases of the in-jvm dtest API

I think we need to revisit how we do this, as it is currently broken. We should 
consider either using ASF snapshots until we cut new releases of C* itself, or 
else using git subprojects. This will also become a problem for Accord’s 
integration over time, and perhaps other subprojects in future, so it is worth 
better solving this.

I think this has been made worse than necessary by moving too many 
implementation details to the shared API project – some should be retained 
within the C* tree, with the API primarily serving as the shared API itself to 
ensure cross-version compatibility. However, this is far from a complete 
explanation of (or solution to) the problem.



From: Josh McKenzie 
Date: Monday, 14 March 2022 at 12:11
To: dev@cassandra.apache.org 
Subject: [DISCUSS] Should we deprecate / freeze python dtests
I've been wrestling with the python dtests recently and that led to some 
discussions with other contributors about whether we as a project should be 
writing new tests in the python dtest framework or the in-jvm framework. This 
discussion has come up tangentially on some other topics, including the lack of 
documentation / expertise on the in-jvm framework dis-incentivizing some folks 
from authoring new tests there vs. the difficulty debugging and maintaining 
timer-based, sleep-based non-deterministic python dtests, etc.

I don't know of a place where we've formally discussed this and made a 
project-wide call on where we expect new distributed tests to be written; if 
I've missed an email about this someone please link on the thread here (and 
stop reading! ;))

At this time we don't specify a preference for where you write new multi-node 
distributed tests on our "development/testing" portion of the site and 
documentation: https://cassandra.apache.org/_/development/testing.html

The primary tradeoffs as I understand them for moving from python-based 
multi-node testing to jdk-based are:
Pros:

  1.  Better debugging functionality (breakpoints, IDE integration, etc)
  2.  Integration with simulator
  3.  More deterministic runtime (anecdotally; python dtests _should_ be 
deterministic but in practice they prove to be very prone to environmental 
disruption)
  4.  Test time visibility to internals of cassandra
Cons:

  1.  The framework is not as mature as the python dtest framework (some 
functionality missing)
  2.  Labor and process around revving new releases of the in-jvm dtest API
  3.  People aren't familiar with it yet and there's a learning curve

So my bid here: I personally think we as a project should freeze writing new 
tests in the python dtest framework and all focus on getting the in-jvm 
framework robust enough to cover edge-cases that might still be causing new 
tests to be written in the python framework. This is going to require 
documentation work from some of the original authors of the in-jvm framework as 
well as folks currently familiar with it and effort from those of us not yet 
intimately familiar with the API to get to know it, however I believe the 
long-term benefits to the project will be well worth it.

We could institute a pre-commit check that warns on a commit increasing our raw 
count of python dtests to help provide process-based visibility to this change 
in direction for the project's testing.

So: what do we think?



Re: Updating our Code Contribution/Style Guide

2022-03-14 Thread bened...@apache.org
Agreed, how about a section like so:

Variable Mutability
As a general norm, parameters and variables should be treated as immutable and 
not be re-used. Where possible, variables that are mutated within a loop should 
be declared in the loop guard or body. Sometimes it is necessary for clarity to 
declare mutable variables outside of these contexts, but these should be scoped 
to the narrowest reasonable code block, with explicit code blocks utilised as 
necessary for clarity.

As a result of this norm, use of the final keyword within a method body is 
prohibited.

We could instead say “discouraged”, but I am not aware of any context where it 
is helpful today.

From: Marcus Eriksson 
Date: Monday, 14 March 2022 at 12:00
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide
Looks good

One thing that might be missing is direction on if we should avoid making 
method parameters and local variables `final` - this is inconsistent over the 
code base, but I'd prefer not having them. If the method is large enough that 
we might mistakenly reuse parameters/variables, we should probably refactor the 
method.

/Marcus

On Mon, Mar 14, 2022 at 09:41:35AM +, bened...@apache.org wrote:
> Our style guide hasn’t been updated in about a decade, and I think it is 
> overdue some improvements that address some shortcomings as well as modern 
> facilities such as streams and lambdas.
>
> Most of this was put together for an effort Dinesh started a few years ago, 
> but has languished since, in part because the project has always seemed to 
> have other priorities. I figure there’s never a good time to raise a 
> contended topic, so here is my suggested update to contributor guidelines:
>
> https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo
>
> Many of these suggestions codify norms already widely employed, sometimes in 
> spite of the style guide, but some likely remain contentious. Some 
> potentially contentious things to draw your attention to:
>
>
>   *   Deemphasis of getX() nomenclature, in favour of richer set of prefixes 
> and more succinct simple x() to retrieve where clear
>   *   Avoid implementing methods, incl. equals(), hashCode() and toString(), 
> unless actually used
>   *   Modified new-line rules for multi-line function calls
>   *   External dependency rules (require DISCUSS thread before introducing)
>
>
>


Re: Updating our Code Contribution/Style Guide

2022-03-14 Thread bened...@apache.org
I think the community would be happy to introduce a python style guide, but I 
am not well placed to do so, having chosen throughout my career to limit my 
exposure to python. Probably a parallel effort would be best - perhaps you 
could work with Stefan and others to produce such a proposal?


From: Bowen Song 
Date: Monday, 14 March 2022 at 10:53
To: dev@cassandra.apache.org 
Subject: Re: Updating our Code Contribution/Style Guide

I found there's no mentioning of Python code style at all. If we are going to 
update the style guide, can this be addressed too?

FYI, a quick "flake8" style check shows many existing issues in the Python 
code, including libraries imported but unused, redefinition of unused imports 
and invalid escape sequence in strings.


On 14/03/2022 09:41, bened...@apache.org<mailto:bened...@apache.org> wrote:
Our style guide hasn’t been updated in about a decade, and I think it is 
overdue some improvements that address some shortcomings as well as modern 
facilities such as streams and lambdas.

Most of this was put together for an effort Dinesh started a few years ago, but 
has languished since, in part because the project has always seemed to have 
other priorities. I figure there’s never a good time to raise a contended 
topic, so here is my suggested update to contributor guidelines:

https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo

Many of these suggestions codify norms already widely employed, sometimes in 
spite of the style guide, but some likely remain contentious. Some potentially 
contentious things to draw your attention to:


  1.  Deemphasis of getX() nomenclature, in favour of richer set of prefixes 
and more succinct simple x() to retrieve where clear
  2.  Avoid implementing methods, incl. equals(), hashCode() and toString(), 
unless actually used
  3.  Modified new-line rules for multi-line function calls
  4.  External dependency rules (require DISCUSS thread before introducing)





Updating our Code Contribution/Style Guide

2022-03-14 Thread bened...@apache.org
Our style guide hasn’t been updated in about a decade, and I think it is 
overdue some improvements that address some shortcomings as well as modern 
facilities such as streams and lambdas.

Most of this was put together for an effort Dinesh started a few years ago, but 
has languished since, in part because the project has always seemed to have 
other priorities. I figure there’s never a good time to raise a contended 
topic, so here is my suggested update to contributor guidelines:

https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo

Many of these suggestions codify norms already widely employed, sometimes in 
spite of the style guide, but some likely remain contentious. Some potentially 
contentious things to draw your attention to:


  *   Deemphasis of getX() nomenclature, in favour of richer set of prefixes 
and more succinct simple x() to retrieve where clear
  *   Avoid implementing methods, incl. equals(), hashCode() and toString(), 
unless actually used
  *   Modified new-line rules for multi-line function calls
  *   External dependency rules (require DISCUSS thread before introducing)





Re: [DISCUSS] Next release cut

2022-03-08 Thread bened...@apache.org
At the very least we should wait until the current issues with CI have 
resolved, so that pending work can merge, before declaring any freeze.

From: Mick Semb Wever 
Date: Tuesday, 8 March 2022 at 15:13
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Next release cut

Should we plan some soft freeze before that?


Good question! :-)

We do not want to encourage/enable the rush to commit stuff before the 1st May 
cut-off. IMHO we should be comfortable leaning towards saving any significant 
commits for the next dev cycle. How do we create the correct incentive?

If we don't feel that we have achieved stable trunk this dev cycle then a soft 
feature freeze during April makes sense to me. IMHO a need to first cut alpha1 
instead of going straight to a beta1 is an indicator we have not achieved 
stable trunk;

I'd rather not see a long testing run on the release branch before rc1. And I 
think this is necessary if we want GA by July.
So for me this boils down to… are we (and will we be) comfortable making the 
first cut 4.1-beta1 ?



Re: [DISCUSS] CASSANDRA-17292 Move cassandra.yaml toward a nested structure around major database concepts

2022-02-22 Thread bened...@apache.org
I agree that a new configuration layout should be introduced once only, not 
incrementally.

However, I disagree that we should immediately deprecate the old config file 
and refuse to parse it. We can maintain compatibility indefinitely at low cost, 
so we should do so.

Users of the old format, when using new configuration options, can simply use 
dot separators to specify them. Since most settings are not required, this is 
by far the least painful upgrade process.


From: Berenguer Blasi 
Date: Wednesday, 23 February 2022 at 06:53
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CASSANDRA-17292 Move cassandra.yaml toward a nested 
structure around major database concepts
+1 to a non-incremental approach as well.

On 23/2/22 1:27, Caleb Rackliffe wrote:
> @Patrick I’m absolutely intending for this to be a 5.0 concern. The only 
> reason why it would have any bearing on 4.x is the case where we’re adding 
> new config that could fit into the v2 structure now and not require any later 
> changes.
>
>> On Feb 22, 2022, at 3:22 PM, Bernardo Sanchez 
>>  wrote:
>>
>> unsubscribe
>>
>> -Original Message-
>> From: Stefan Miklosovic 
>> Sent: Tuesday, February 22, 2022 3:53 PM
>> To: dev@cassandra.apache.org
>> Subject: Re: [DISCUSS] CASSANDRA-17292 Move cassandra.yaml toward a nested 
>> structure around major database concepts
>>
>> "EXTERNAL EMAIL" - This email originated from outside of the organization. 
>> Do not click or open attachments unless you recognize the sender and know 
>> the content is safe. If you are unsure, please contact 
>> hel...@pointclickcare.com.
>>
>> I want to add that to, however, on the other hand, we also do have dtests in 
>> Python and they need to run with old configs too. That is what Ekaterina was 
>> doing - supporting old configuration while introducing new one. If we make 
>> "a big cut" and old way of doing things would not be possible, how are we 
>> going to treat this in dtests when we will have stuff for 3.11, 4 on old 
>> configs and 5 on new configs?
>>
>>> On Tue, 22 Feb 2022 at 21:48, Stefan Miklosovic 
>>>  wrote:
>>>
>>> +1 to what Patrick says.
>>>
 On Tue, 22 Feb 2022 at 21:40, Patrick McFadin  wrote:

 I'm going to put up a red flag of making config file changes of this scale 
 on a dot release. This should really be a 5.0 consideration.

 With that, I would propose a #5. 5.0 nodes will only read the new config 
 files and reject old config files. If any of you went through the config 
 file changes from Apache HTTPd 1.3 -> 2.0 you know how much of a lifesaver 
 that can be for ops. Make it a part of the total upgrade to a new major 
 version, not a radical change inside of a dot version, and make it a clean 
 break. No "legacy config" laying around. That's just a recipe for 
 surprises later if there are new required config values and somebody 
 doesn't even realize they have some old 4.x yaml files laying around.

 Patrick

 On Tue, Feb 22, 2022 at 11:51 AM Tibor Répási  
 wrote:
> Glad to be agree on #4. That feature could be add anytime.
>
> If a version element is added to the YAML, then it is not necessary to 
> change the filename, thus we could end up with #3. The value of the 
> version element could default to 1 in the first phase, which does not 
> need any change for legacy format configuration. New config format must 
> include version: 2. When in some later version the support for legacy 
> configuration is removed, the default for the version element could be 
> changed to 2 or removed.
>
> On 22. Feb 2022, at 19:30, Caleb Rackliffe  
> wrote:
>
> My initial preference would be something like combining #1 and #4. We 
> could add something like a simple "version: <1|2>" element to the YAML 
> that would eliminate any possible confusion about back-compat within a 
> given file.
>
> Thanks for enumerating these!
>
> On Tue, Feb 22, 2022 at 10:42 AM Tibor Répási  
> wrote:
>> Hi,
>>
>> I like the idea of having cassandra.yaml better structured, as an 
>> operator, my primer concern is the transition. How would we change the 
>> config structure from legacy to the new one during a rolling upgrade? My 
>> thoughts on this:
>>
>> 1. Legacy and new configuration is stored in different files. Cassandra 
>> will read the legacy file on startup if it exists, the new one 
>> otherwise. May raise warning on startup when legacy was used.
>>pros:
>> - separate files for separate formats
>> - clean and operator controlled switch to new format
>> - already known procedure, e.g. change from PropertyFileSnitch to 
>> GossipingPropertyFileSnitch
>>cons:
>> - name of the config file would change from cassandra.yaml to 
>> something else (cassandra_v2.yaml, config.yaml ???)
>> - would need 

Re: Apache Cassandra fuzz testing

2022-02-18 Thread bened...@apache.org
> There are many tests that are currently purely manual, and some are just hard 
> to maintain….. And whenever we add support for, say, UDTs, overnight you'll 
> just get UDTs for all existing tests

Yes, something worth really highlighting here is that many of our tests are 
flaky because we have so many tests, many of low quality, where 
determinism/reliability has been too costly to deliver. With fewer tests able 
to cover more functionality, investment in reliability and determinism more 
easily pay off. Also, by moving to frameworks that have done some of this heavy 
lifting, it is anyway easier to achieve.

I agree that some areas of the codebase might be quite ripe for this kind of 
work, particularly for more complex CQL features and ones being invested in 
today, or in the near future. MVs seem an obvious example, as part of work to 
move them out of experimental status. I’m uncertain if SAI is suitable for use 
with Harry, but it could be explored.

From: Alex Petrov 
Date: Friday, 18 February 2022 at 11:39
To: dev@cassandra.apache.org 
Subject: Re: Apache Cassandra fuzz testing
I did not intend to imply that we should migrate all tests. To be more specific 
than I was, we can pick up only ones where Harry just makes more sense than 
manual tests, where it can cover more ground. GROUP BY comes to mind as a 
perfect example: its current test suite is rather limited. Fuzzing it can yield 
a lot of useful things, with very little risk for flakiness. It can completely 
replace existing test suite and test many more cases.

Another example - SelectTest and many tests like it, which is just a manual way 
to go through a bunch of cases, while leaving out many other potential 
edge-cases. TTL tests would be the next example. Range tombstones - yet 
another. Read repair tests would also be good to expand. Many python dtests 
that use stress to load data are another potential candidate.

There are many tests that are currently purely manual, and some are just hard 
to maintain. Many of those can be good candidates for switching to 
property-based. But, as Benedict mentioned, we don't have much bandwidth to 
migrate the tests anyways.

It could be that you are skeptical since you haven't had much experience with 
Harry just yet. While many features are still missing, it still is more 
powerful than many existing manually written tests. And whenever we add support 
for, say, UDTs, overnight you'll just get UDTs for all existing tests, followed 
by collections, and other things. Moreover, we will be able to see if all our 
tests pass under failure conditions, and test them with different sets of 
parameters.

Maybe if I reframe it and say that we add fuzz tests for the mentioned areas of 
code and, if we, at some point in the future, decide that manually-written 
tests are redundant, we can consider deprecating them.



On Fri, Feb 18, 2022, at 9:41 AM, Benjamin Lerer wrote:
Thanks a lot for raising that topic Alex.

I did not have the chance to use Harry yet and I guess it is the case for most 
of us.
Starting to use it in our new tests makes total sense to me.
I am more concerned about starting to migrate/update existing tests. It took us 
time to build some reliable and non flaky tests to guarantee the correctness of 
the codebase. As far as I can see from Harry's documentation some features are 
still missing. The people lack experience with this tool and it will take a bit 
of time for them to build that knowledge. Along the way we might also discover 
some issues with Harry that need to be addressed.

So I am +1 for starting to use it in our new tests and build our knowledge of 
Harry. Regarding a migration of existing tests to it, I would wait a bit before 
choosing to go down that path.



Le mer. 16 févr. 2022 à 16:30, bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> a écrit :

+1



The Simulator is hopefully going to be another powerful tool for this kind of 
work, and we should be encouraging the use of both for large or complex pieces 
of work.





From: Alex Petrov mailto:al...@coffeenco.de>>
Date: Wednesday, 16 February 2022 at 11:56
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: Apache Cassandra fuzz testing

(apologies for sending an incomplete email)



Hi everyone,



As you may know, we’ve been actively working on fuzz testing Apache Cassandra 
for the past several years and made quite a large progress on that front.



We’ve cut a 0.0.1 release of Harry [1], a fuzz testing tool for apache 
Cassandra and merged CASSANDRA-16262 [2].



I’d recommend us as a community to take the next logical step and demand fuzz / 
property-based tests for all marjor patches, and start migrating/updating 
existing tests to be property-based rather than using hardoced values.



Harry can be used to generate data, and then check that a sequence of events 
corresponds to Cassa

Re: Apache Cassandra fuzz testing

2022-02-18 Thread bened...@apache.org
I’m not sure we have lots of bandwidth for upgrading existing tests anyway. 
However, the source of flakiness in existing tests is primarily either 
environmental or poor test design (relying on timings being a major culprit). 
If Harry were to produce flakiness it would have a higher likelihood of being 
real problems, and they would be reproducible if the tests were deterministic.

The Simulator on the other hand might be helpful for flaky tests, by being 
deterministic. We might want to develop a JUnitRunner that is backed by the 
simulator so we can easily switch it on to help diagnose flaky tests, and also 
for improved testing of concurrent code unit tests. We probably would not want 
to use it for all tests, however, as it might well slow down execution.

From: Stefan Miklosovic 
Date: Friday, 18 February 2022 at 08:57
To: dev@cassandra.apache.org 
Subject: Re: Apache Cassandra fuzz testing
Benjamin's email could be written by myself :) Fully agree.

On Fri, 18 Feb 2022 at 09:42, Benjamin Lerer  wrote:
>
> Thanks a lot for raising that topic Alex.
>
> I did not have the chance to use Harry yet and I guess it is the case for 
> most of us.
> Starting to use it in our new tests makes total sense to me.
> I am more concerned about starting to migrate/update existing tests. It took 
> us time to build some reliable and non flaky tests to guarantee the 
> correctness of the codebase. As far as I can see from Harry's documentation 
> some features are still missing. The people lack experience with this tool 
> and it will take a bit of time for them to build that knowledge. Along the 
> way we might also discover some issues with Harry that need to be addressed.
>
> So I am +1 for starting to use it in our new tests and build our knowledge of 
> Harry. Regarding a migration of existing tests to it, I would wait a bit 
> before choosing to go down that path.
>
>
>
> Le mer. 16 févr. 2022 à 16:30, bened...@apache.org  a 
> écrit :
>>
>> +1
>>
>>
>>
>> The Simulator is hopefully going to be another powerful tool for this kind 
>> of work, and we should be encouraging the use of both for large or complex 
>> pieces of work.
>>
>>
>>
>>
>>
>> From: Alex Petrov 
>> Date: Wednesday, 16 February 2022 at 11:56
>> To: dev@cassandra.apache.org 
>> Subject: Re: Apache Cassandra fuzz testing
>>
>> (apologies for sending an incomplete email)
>>
>>
>>
>> Hi everyone,
>>
>>
>>
>> As you may know, we’ve been actively working on fuzz testing Apache 
>> Cassandra for the past several years and made quite a large progress on that 
>> front.
>>
>>
>>
>> We’ve cut a 0.0.1 release of Harry [1], a fuzz testing tool for apache 
>> Cassandra and merged CASSANDRA-16262 [2].
>>
>>
>>
>> I’d recommend us as a community to take the next logical step and demand 
>> fuzz / property-based tests for all marjor patches, and start 
>> migrating/updating existing tests to be property-based rather than using 
>> hardoced values.
>>
>>
>>
>> Harry can be used to generate data, and then check that a sequence of events 
>> corresponds to Cassandra resolution rules. We will continue expanding Harry 
>> coverage and writing new models and checkers, too.
>>
>>
>>
>> If you would like to learn more about Harry, you can refer to a recent blog 
>> post [3]. I will also be happy to answer any questions you may have about 
>> Harry and assist you in writing your tests, and helping to extend Harry in 
>> case there’s a feature you may need to accomplish it.
>>
>>
>>
>> Thank you,
>>
>> —Alex
>>
>>
>>
>> [1] [GitHub - apache/cassandra-harry: Apache Cassandra - 
>> Harry](https://github.com/apache/cassandra-harry)
>>
>> [2] [CASSANDRA-16262 4.0 Quality: Coordination & Replication Fuzz Testing - 
>> ASF JIRA](https://issues.apache.org/jira/browse/CASSANDRA-16262)
>>
>> [3] [Apache Cassandra | Apache Cassandra 
>> Documentation](https://cassandra.apache.org/_/blog/Harry-an-Open-Source-Fuzz-Testing-and-Verification-Tool-for-Apache-Cassandra.html)


Re: Apache Cassandra fuzz testing

2022-02-16 Thread bened...@apache.org
+1

The Simulator is hopefully going to be another powerful tool for this kind of 
work, and we should be encouraging the use of both for large or complex pieces 
of work.


From: Alex Petrov 
Date: Wednesday, 16 February 2022 at 11:56
To: dev@cassandra.apache.org 
Subject: Re: Apache Cassandra fuzz testing
(apologies for sending an incomplete email)

Hi everyone,

As you may know, we’ve been actively working on fuzz testing Apache Cassandra 
for the past several years and made quite a large progress on that front.

We’ve cut a 0.0.1 release of Harry [1], a fuzz testing tool for apache 
Cassandra and merged CASSANDRA-16262 [2].

I’d recommend us as a community to take the next logical step and demand fuzz / 
property-based tests for all marjor patches, and start migrating/updating 
existing tests to be property-based rather than using hardoced values.

Harry can be used to generate data, and then check that a sequence of events 
corresponds to Cassandra resolution rules. We will continue expanding Harry 
coverage and writing new models and checkers, too.

If you would like to learn more about Harry, you can refer to a recent blog 
post [3]. I will also be happy to answer any questions you may have about Harry 
and assist you in writing your tests, and helping to extend Harry in case 
there’s a feature you may need to accomplish it.

Thank you,
—Alex

[1] [GitHub - apache/cassandra-harry: Apache Cassandra - 
Harry](https://github.com/apache/cassandra-harry)
[2] [CASSANDRA-16262 4.0 Quality: Coordination & Replication Fuzz Testing - ASF 
JIRA](https://issues.apache.org/jira/browse/CASSANDRA-16262)
[3] [Apache Cassandra | Apache Cassandra 
Documentation](https://cassandra.apache.org/_/blog/Harry-an-Open-Source-Fuzz-Testing-and-Verification-Tool-for-Apache-Cassandra.html)


Re: [VOTE] CEP-19: Trie memtable implementation

2022-02-16 Thread bened...@apache.org
+1

From: Branimir Lambov 
Date: Wednesday, 16 February 2022 at 08:58
To: dev@cassandra.apache.org 
Subject: [VOTE] CEP-19: Trie memtable implementation
Hi everyone,

I'd like to propose CEP-19 for approval.

Proposal: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-19%3A+Trie+memtable+implementation
Discussion: https://lists.apache.org/thread/fdvf1wmxwnv5jod59jznbnql23nqosty

The vote will be open for 72 hours.
Votes by committers are considered binding.
A vote passes if there are at least three binding +1s and no binding vetoes.

Thank you,
Branimir


Re: [DISCUSS] Hotfix release procedure

2022-02-15 Thread bened...@apache.org
One issue with this approach is that we are advertising that we are preparing a 
security release by preparing such a release candidate.

I wonder if we need to find a way to produce binaries without leaving an 
obvious public mark (i.e. private CI, private branch)


From: Josh McKenzie 
Date: Tuesday, 15 February 2022 at 14:09
To: dev@cassandra.apache.org 
Subject: [DISCUSS] Hotfix release procedure
On the release thread for 4.0.2 Jeremiah brought up a point about hotfix 
releases and CI:
https://lists.apache.org/thread/7zc22z5vw5b58hdzpx2nypwfzjzo3qbr

If we are making this release for a security incident/data loss/hot fix reason, 
then I would expect to see the related change set only containing those 
patches. But the change set in the tag here the latest 4.0-dev commits.

I'd like to propose that in the future, regardless of the state of CI, if we 
need to cut a hotfix release we do so from the previous released SHA + only the 
changes required to address the hotfix to minimally impact our end users and 
provide them with as minimally disruptive a fix as possible.


Re: [RELEASE] Apache Cassandra 4.0.2 released

2022-02-12 Thread bened...@apache.org
As discussed on 15234, there is never a rush to remove Config parameters, and 
it should only be done when there’s some clear value. Since the overhead of 
having an unused parameter is ~zero, in my opinion this occurs only when we 
really need the operator to consider the semantic impact of its deprecation.

We should never break minor upgrades, but we shouldn’t break any upgrade 
unnecessarily.

I wonder if we should introduce a compile time check that validates config 
compatibility with prior versions, to avoid this in future.

From: Dinesh Joshi 
Date: Saturday, 12 February 2022 at 00:09
To: dev@cassandra.apache.org 
Subject: Re: [RELEASE] Apache Cassandra 4.0.2 released
We should also have deprecation guidance in Config.java. This will help when 
anybody is making changes in the future.


On Feb 11, 2022, at 3:07 PM, Ekaterina Dimitrova 
mailto:e.dimitr...@gmail.com>> wrote:

Note taken, I had to document only in 4.0.x that those are placeholder. I just 
opened ticket to fix that - CASSANDRA-17377. I am going to submit a patch soon

On Fri, 11 Feb 2022 at 17:44, Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
We don't HAVE TO remove the Config.java entry - we can mark it as deprecated 
and ignored and remove it in a future version (and you could update Config.java 
to log a message about having a deprecated config option). It's a much better 
operator experience: log for a major version, then remove in the next.

On Fri, Feb 11, 2022 at 2:41 PM Ekaterina Dimitrova 
mailto:e.dimitr...@gmail.com>> wrote:
This had to be removed in 4.0 but it wasn’t. The patch mentioned did it to fix 
a bug that gives impression those work. Confirmed with Benedict on the ticket.

I agree I absolutely had to document it better, a ticket for documentation was 
opened but it slipped from my mind with this emergency release this week. It is 
unfortunate it is still in our backlog after the ADOC migration.

Note taken. I truly apologize and I am going to prioritize CASSANDRA-17135. Let 
me know if there is anything else I can/should do at this point.

On Fri, 11 Feb 2022 at 17:26, Erick Ramirez 
mailto:erick.rami...@datastax.com>> wrote:
(moved dev@ to BCC)

It looks like the otc_coalescing_strategy config key is no longer supported in 
cassandra.yaml in 4.0.2, despite this not being mentioned anywhere in 
CHANGES.txt or NEWS.txt.

James, you're right -- it was removed by 
CASSANDRA-17132 in 4.0.2 
and 4.1.

I agree that the CHANGES.txt entry should be clearer and we'll improve it plus 
add detailed info in NEWS.txt. I'll get this done soon in 
CASSANDRA-17135. Thanks 
for the feedback. Cheers!



Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-09 Thread bened...@apache.org
> Is there some mechanism such as experimental flags, which would allow the 
> SAI-only OR support to be merged into trunk

FWIW, I’m OK with this merging to trunk, either hidden behind a CI-only flag or 
exposed to the user via some experimental flag (and a suitable NEWS.txt). We’ve 
discussed the need to periodically merge feature branches with trunk before 
they are complete. If the work is logically complete for SAI, and we’re only 
pending work to make OR consistent between SAI and non-SAI queries, I think 
that more than meets this criterion.


From: Henrik Ingo 
Date: Monday, 7 February 2022 at 12:03
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-7 Storage Attached Index
Thanks Benjamin for reviewing and raising this.

While I don't speak for the CEP authors, just some thoughts from me:

On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer 
mailto:ble...@apache.org>> wrote:
I would like to raise 2 points regarding the current CEP proposal:

1. There are mention of some target versions and of the removal of SASI

At this point, we have not agreed on any version numbers and I do not feel that 
removing SASI should be part of the proposal for now.
It seems to me that we should see first the adoption surrounding SAI before 
talking about deprecating other solutions.


This seems rather uncontroversial. I think the CEP template and previous CEPs 
invite  the discussion on whether the new feature will or may replace an 
existing feature. But at the same time that's of course out of scope for the 
work at hand. I have no opinion one way or the other myself.


2. OR queries

It is unclear to me if the proposal is about adding OR support only for SAI 
index or for other types of queries too.
In the past, we had the nasty habit for CQL to provide only partialially 
implemented features which resulted in a bad user experience.
Some examples are:
* LIKE restrictions which were introduced for the need of SASI and were not 
never supported for other type of queries
* IS NOT NULL restrictions for MATERIALIZED VIEWS that are not supported 
elsewhere
* != operator only supported for conditional inserts or updates
And there are unfortunately many more.

We are currenlty slowly trying to fix those issue and make CQL a more mature 
language. By consequence, I would like that we change our way of doing things. 
If we introduce support for OR it should also cover all the other type of 
queries and be fully tested.
I also believe that it is a feature that due to its complexity fully deserves 
its own CEP.


The current code that would be submitted for review after the CEP is adopted, 
contains OR support beyond just SAI indexes. An initial implementation first 
targeted only such queries where all columns in a WHERE clause using OR needed 
to be backed by an SAI index. This was since extended to also support ALLOW 
FILTERING mode as well as OR with clustering key columns. The current 
implementation is by no means perfect as a general purpose OR support, the 
focus all the time was on implementing OR support in SAI. I'll leave it to 
others to enumerate exactly the limitations of the current implementation.

Seeing that also Benedict supports your point of view, I would steer the 
conversation more into a project management perspective:
* How can we advance CEP-7 so that the bulk of the SAI code can still be added 
to Cassandra, so that  users can benefit from this new index type, albeit 
without OR?
* This is also an important question from the point of view that this is a 
large block of code that will inevitably diverged if it's not in trunk. Also, 
merging it to trunk will allow future enhancements, including the OR syntax 
btw, to happen against trunk (aka upstream first).
* Since OR support nevertheless is a feature of SAI, it needs to be at least 
unit tested, but ideally even would be exposed so that it is possible to test 
on the CQL level. Is there some mechanism such as experimental flags, which 
would allow the SAI-only OR support to be merged into trunk, while a separate 
CEP is focused on implementing "proper" general purpose OR support? I should 
note that there is no guarantee that the OR CEP would be implemented in time 
for the next release. So the answer to this point needs to be something that 
doesn't violate the desire for good user experience.

henrik




Re: [DISCUSS] CEP-19: Trie memtable implementation

2022-02-09 Thread bened...@apache.org
Why not have some default templates that can be specified by the schema without 
touching the yaml, but overridden in the yaml as necessary?

From: Branimir Lambov 
Date: Wednesday, 9 February 2022 at 09:35
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-19: Trie memtable implementation
If I understand this correctly, you prefer _not_ to have an option to give the 
configuration explicitly in the schema. I.e. force the configurations 
("templates" in current terms) to be specified in the yaml, and only allow 
tables to specify which one to use among them?

This does sound at least as good to me, and I'll happily change the API.

Regards,
Branimir

On Tue, Feb 8, 2022 at 10:40 PM Dinesh Joshi 
mailto:djo...@apache.org>> wrote:
My quick reading of the code suggests that schema will override the operator's 
default preference in the YAML. In the event of a bug in the new 
implementation, there could be situation where the operator might need to 
override this via the YAML.


On Feb 8, 2022, at 12:29 PM, Jeremiah D Jordan 
mailto:jeremiah.jor...@gmail.com>> wrote:

I don’t really see most users touching the default implementation.  I would 
expect the main reason someone would change would be
1. They run into some bug that is only in one of the implementations.
2. They have persistent memory and so want to use 
https://issues.apache.org/jira/browse/CASSANDRA-13981

Given that I doubt most people will touch it, I think it is good to give 
advanced operators the ability to have more control over switching to things 
that have new performance characteristics.  So I like the idea that the 
proposed configuration approach which allows someone to change to a new 
implementation one node at a time and only for specific tables.


On Feb 8, 2022, at 2:21 PM, Dinesh Joshi 
mailto:djo...@apache.org>> wrote:

Thank you for sharing the perf test results.

Going back to the schema vs yaml configuration. I am concerned users may pick 
the wrong implementation for their use-case. Is there any chance for us to 
automatically pick a MemTable implementation based on heuristics? Do we foresee 
users ever picking the existing SkipList implementation over the Trie Given the 
performance tests, it seems the Trie implementation is the clear winner.

To be clear, I am not suggesting we remove the existing implementation. I am 
for maintaining a pluggable API for various components.

Dinesh


On Feb 7, 2022, at 8:39 AM, Branimir Lambov 
mailto:blam...@apache.org>> wrote:

Added some performance results to the ticket: 
https://issues.apache.org/jira/browse/CASSANDRA-17240

Regards,
Branimir

On Sat, Feb 5, 2022 at 10:59 PM Dinesh Joshi 
mailto:djo...@apache.org>> wrote:
This is excellent. Thanks for opening up this CEP. It would be great to get 
some stats around GC allocation rate / memory pressure, read & write latencies, 
etc. compared to existing implementation.

Dinesh


On Jan 18, 2022, at 2:13 AM, Branimir Lambov 
mailto:blam...@apache.org>> wrote:

The memtable pluggability API (CEP-11) is per-table to enable memtable 
selection that suits specific workflows. It also makes full sense to permit 
per-node configuration, both to be able to modify the configuration to suit 
heterogeneous deployments better, as well as to test changes for improvements 
such as this one.
Recognizing this, the patch comes with a modification to the 
API
 that defines memtable templates in cassandra.yaml (i.e. per node) and allows 
the schema to select a template (in addition to being able to specify the full 
memtable configuration). One could use this e.g. by adding:

memtable_templates:
trie:
class: TrieMemtable
shards: 16
skiplist:
class: SkipListMemtable
memtable:
template: skiplist
(which defines two templates and specifies the default memtable implementation 
to use) to cassandra.yaml and specifying  WITH memtable = {'template' : 'trie'} 
in the table schema.

I intend to commit this modification with the memtable API 
(CASSANDRA-17034/CEP-11).

Performance comparisons will be published soon.

Regards,
Branimir

On Fri, Jan 14, 2022 at 4:15 PM Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
Sounds like a great addition

Can you share some of the details around gc and latency improvements you’ve 
observed with the list?

Any specific reason the confirmation is through schema vs yaml? Presumably it’s 
so a user can test per table, but this changes every host in a cluster, so the 
impact of a bug/regression is much higher.



On Jan 10, 2022, at 1:30 AM, Branimir Lambov 
mailto:blam...@apache.org>> wrote:

We would like to contribute our TrieMemtable to Cassandra.

https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-19%3A+Trie+memtable+implementation

This is a new memtable solution aimed to replace the legacy implementation, 
developed with the following objectives:
- lowering the on-heap complexity and the ability 

Re: [DISCUSS] CEP-19: Trie memtable implementation

2022-02-08 Thread bened...@apache.org
FWIW, I think the proposed approach to configuration is fine.

I think selecting a choice for the user should be done simply and 
deterministically. We should probably default to Trie based memtables for users 
with a fresh config file, and we can consider changing the default in a later 
release for those with an old config file that does not specify an 
implementation.


From: Dinesh Joshi 
Date: Tuesday, 8 February 2022 at 20:21
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-19: Trie memtable implementation
Thank you for sharing the perf test results.

Going back to the schema vs yaml configuration. I am concerned users may pick 
the wrong implementation for their use-case. Is there any chance for us to 
automatically pick a MemTable implementation based on heuristics? Do we foresee 
users ever picking the existing SkipList implementation over the Trie Given the 
performance tests, it seems the Trie implementation is the clear winner.

To be clear, I am not suggesting we remove the existing implementation. I am 
for maintaining a pluggable API for various components.

Dinesh


On Feb 7, 2022, at 8:39 AM, Branimir Lambov 
mailto:blam...@apache.org>> wrote:

Added some performance results to the ticket: 
https://issues.apache.org/jira/browse/CASSANDRA-17240

Regards,
Branimir

On Sat, Feb 5, 2022 at 10:59 PM Dinesh Joshi 
mailto:djo...@apache.org>> wrote:
This is excellent. Thanks for opening up this CEP. It would be great to get 
some stats around GC allocation rate / memory pressure, read & write latencies, 
etc. compared to existing implementation.

Dinesh


On Jan 18, 2022, at 2:13 AM, Branimir Lambov 
mailto:blam...@apache.org>> wrote:

The memtable pluggability API (CEP-11) is per-table to enable memtable 
selection that suits specific workflows. It also makes full sense to permit 
per-node configuration, both to be able to modify the configuration to suit 
heterogeneous deployments better, as well as to test changes for improvements 
such as this one.
Recognizing this, the patch comes with a modification to the 
API
 that defines memtable templates in cassandra.yaml (i.e. per node) and allows 
the schema to select a template (in addition to being able to specify the full 
memtable configuration). One could use this e.g. by adding:

memtable_templates:
trie:
class: TrieMemtable
shards: 16
skiplist:
class: SkipListMemtable
memtable:
template: skiplist
(which defines two templates and specifies the default memtable implementation 
to use) to cassandra.yaml and specifying  WITH memtable = {'template' : 'trie'} 
in the table schema.

I intend to commit this modification with the memtable API 
(CASSANDRA-17034/CEP-11).

Performance comparisons will be published soon.

Regards,
Branimir

On Fri, Jan 14, 2022 at 4:15 PM Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
Sounds like a great addition

Can you share some of the details around gc and latency improvements you’ve 
observed with the list?

Any specific reason the confirmation is through schema vs yaml? Presumably it’s 
so a user can test per table, but this changes every host in a cluster, so the 
impact of a bug/regression is much higher.



On Jan 10, 2022, at 1:30 AM, Branimir Lambov 
mailto:blam...@apache.org>> wrote:

We would like to contribute our TrieMemtable to Cassandra.

https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-19%3A+Trie+memtable+implementation

This is a new memtable solution aimed to replace the legacy implementation, 
developed with the following objectives:
- lowering the on-heap complexity and the ability to store memtable indexing 
structures off-heap,
- leveraging byte order and a trie structure to lower the memory footprint and 
improve mutation and lookup performance.

The new memtable relies on CASSANDRA-6936 to translate to and from byte-ordered 
representations of types, and CASSANDRA-17034 / CEP-11 to plug into Cassandra. 
The memtable is built on multiple shards of custom in-memory single-writer 
multiple-reader tries, whose implementation uses a combination of 
state-of-the-art and novel features for greater efficiency.

The CEP's JIRA ticket (https://issues.apache.org/jira/browse/CASSANDRA-17240) 
contains the initial version of the implementation. In its current form it 
achieves much better garbage collection latency, significantly bigger data 
sizes between flushes for the same memory allocation, as well as drastically 
increased write throughput, and we expect the memory and garbage collection 
improvements to go much further with upcoming improvements to the solution.

I am interested in hearing your thoughts on the proposal.

Regards,
Branimir





Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-07 Thread bened...@apache.org
I don’t have a strong opinion about CEP-7 taking a hard dependency on any new 
CQL CEP, particularly from a point of view of first landing in the codebase.


From: Henrik Ingo 
Date: Monday, 7 February 2022 at 12:03
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-7 Storage Attached Index
Thanks Benjamin for reviewing and raising this.

While I don't speak for the CEP authors, just some thoughts from me:

On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer 
mailto:ble...@apache.org>> wrote:
I would like to raise 2 points regarding the current CEP proposal:

1. There are mention of some target versions and of the removal of SASI

At this point, we have not agreed on any version numbers and I do not feel that 
removing SASI should be part of the proposal for now.
It seems to me that we should see first the adoption surrounding SAI before 
talking about deprecating other solutions.


This seems rather uncontroversial. I think the CEP template and previous CEPs 
invite  the discussion on whether the new feature will or may replace an 
existing feature. But at the same time that's of course out of scope for the 
work at hand. I have no opinion one way or the other myself.


2. OR queries

It is unclear to me if the proposal is about adding OR support only for SAI 
index or for other types of queries too.
In the past, we had the nasty habit for CQL to provide only partialially 
implemented features which resulted in a bad user experience.
Some examples are:
* LIKE restrictions which were introduced for the need of SASI and were not 
never supported for other type of queries
* IS NOT NULL restrictions for MATERIALIZED VIEWS that are not supported 
elsewhere
* != operator only supported for conditional inserts or updates
And there are unfortunately many more.

We are currenlty slowly trying to fix those issue and make CQL a more mature 
language. By consequence, I would like that we change our way of doing things. 
If we introduce support for OR it should also cover all the other type of 
queries and be fully tested.
I also believe that it is a feature that due to its complexity fully deserves 
its own CEP.


The current code that would be submitted for review after the CEP is adopted, 
contains OR support beyond just SAI indexes. An initial implementation first 
targeted only such queries where all columns in a WHERE clause using OR needed 
to be backed by an SAI index. This was since extended to also support ALLOW 
FILTERING mode as well as OR with clustering key columns. The current 
implementation is by no means perfect as a general purpose OR support, the 
focus all the time was on implementing OR support in SAI. I'll leave it to 
others to enumerate exactly the limitations of the current implementation.

Seeing that also Benedict supports your point of view, I would steer the 
conversation more into a project management perspective:
* How can we advance CEP-7 so that the bulk of the SAI code can still be added 
to Cassandra, so that  users can benefit from this new index type, albeit 
without OR?
* This is also an important question from the point of view that this is a 
large block of code that will inevitably diverged if it's not in trunk. Also, 
merging it to trunk will allow future enhancements, including the OR syntax 
btw, to happen against trunk (aka upstream first).
* Since OR support nevertheless is a feature of SAI, it needs to be at least 
unit tested, but ideally even would be exposed so that it is possible to test 
on the CQL level. Is there some mechanism such as experimental flags, which 
would allow the SAI-only OR support to be merged into trunk, while a separate 
CEP is focused on implementing "proper" general purpose OR support? I should 
note that there is no guarantee that the OR CEP would be implemented in time 
for the next release. So the answer to this point needs to be something that 
doesn't violate the desire for good user experience.

henrik




Re: Build tool

2022-02-03 Thread bened...@apache.org
+1

If we can get a pros/cons list we can have a ranked choice vote, move forwards, 
and maybe agree not to revisit this for a few years at least?


From: Joshua McKenzie 
Date: Thursday, 3 February 2022 at 13:59
To: dev 
Subject: Re: Build tool
Could someone take on clearly enumerating the pros and cons of ant vs. maven?

Without clarity this is going to keep stagnating as a war of unsubstantiated 
opinions and fizzle out like it has so many times in the past.

I'd like to see it either change or the topic be put to rest. :)


On Thu, Feb 3, 2022 at 8:38 AM Brandon Williams 
mailto:dri...@gmail.com>> wrote:
On Thu, Feb 3, 2022 at 7:19 AM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
> It pretends to be Maven for dependency management, but this is a small part 
> of the job of a build file.

It doesn't pretend, it actually uses part of the Maven project to
accomplish its goals.  It's half the Maven it could be.


Re: Build tool

2022-02-03 Thread bened...@apache.org
I have been in the guts of build.xml probably a lot more than you realise.

It pretends to be Maven for dependency management, but this is a small part of 
the job of a build file.


From: Brandon Williams 
Date: Thursday, 3 February 2022 at 13:07
To: dev 
Subject: Re: Build tool
On Thu, Feb 3, 2022 at 3:23 AM bened...@apache.org  wrote:

> If we’re struggling to actually use ant how we want that’s another matter, 
> but it’s easy to forget how much just works for us with ant

If you don't regularly work on the build system, it may be easy to
forget that ant works by actually trying to be maven.  First we had
maven-ant-tasks, and then when that was deprecated we had to switch to
resolver-ant-tasks (in quite a rush at the time) so that we could do
needed releases.

For the fact that our ant is a poor maven implementation, I am +1 on maven.


Re: Build tool

2022-02-03 Thread bened...@apache.org
> I took a massive productivity hit when In-JVM dtests landed the codebase

These introduced new capabilities that were well understood as outcomes at the 
time. Here we are proposing replacing something that works fine, with something 
that works equivalently.

The build file is not only consumed as a user, but also edited by everyone on 
the project. Most are familiar with it, can navigate it and edit it. This will 
be lost in any migration.

Sure, like any migration it can be justified by its benefits. I just haven’t 
see anything specific besides that ant is “old” and people are “surprised” we 
use it. So what? It works. Sometimes things might be painful for some folk to 
do, but that will be true in any new build system, no?



From: Paulo Motta 
Date: Thursday, 3 February 2022 at 11:53
To: Cassandra DEV 
Subject: Re: Build tool
> I am productive today with ant. I will almost certainly take a productivity 
> hit sometime during and after the migration to maven, as will others.

I took a massive productivity hit when In-JVM dtests landed the codebase, but 
this was not a consideration when that framework was added because it was 
perceived as a net gain for the project in the long term. In that case the 
productivity hit was definitely worth it, and I think in this case it will be.
I personally don't think the productivity hit of adopting a new build tool will 
be very noticeable (nothing that you can't catch up in a couple of weeks), but 
in order to not block this effort on this feeling perhaps we can make reducing 
the productivity hit an explicit goal of this undertaking (ie. enumerate 
potential productivity hits and/or make the UX look as close as possible to the 
current).

It could be helpful to come up with a concrete list of benefits and downsides 
of adopting a new build tool, to allow the community to decide whether the 
change is worth pursuing.

Em qui., 3 de fev. de 2022 às 07:28, 
bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> escreveu:
> Aleksei has proven that he was able to deliver work of quality and to push 
> things forward. He is willing to try to tackle that work.

I am not questioning his ability to deliver, I am questioning the value of 
burdening the project with this migration?

I am productive today with ant. I will almost certainly take a productivity hit 
sometime during and after the migration to maven, as will others.

From: Benjamin Lerer mailto:ble...@apache.org>>
Date: Thursday, 3 February 2022 at 10:13
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: Build tool
I don’t have a super strong desire to stay with ant, I just have a desire not 
to unduly burden the project with unnecessary churn. Tooling changes can be 
quite painful.

Aleksei has proven that he was able to deliver work of quality and to push 
things forward. He is willing to try to tackle that work. I am in favor of 
giving him the chance to do it.
I fully agree that it is not a simple task but I am also convinced that other 
people might be interested in joining the effort.

With regards to contributions, this is often brought up but the reality is the 
project has always struggled to bring in new ongoing contributors, in large 
part due to the barrier to entry of such a complex project (which has only 
grown as our expectations on patch quality have gone up). I struggle to believe 
that ANT is more than a rounding error on our efficacy here, since we have 
always struggled.

I think we found some way around the barrier to entry :-). I will trigger 
another discussion about that.

Le jeu. 3 févr. 2022 à 10:17, bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> a écrit :
I don’t have a super strong desire to stay with ant, I just have a desire not 
to unduly burden the project with unnecessary churn. Tooling changes can be 
quite painful.

With regards to contributions, this is often brought up but the reality is the 
project has always struggled to bring in new ongoing contributors, in large 
part due to the barrier to entry of such a complex project (which has only 
grown as our expectations on patch quality have gone up). I struggle to believe 
that ANT is more than a rounding error on our efficacy here, since we have 
always struggled.

If we’re struggling to actually use ant how we want that’s another matter, but 
it’s easy to forget how much just works for us with ant, and forget the things 
we will have pain with adopting a new build system. I have had more frustration 
with Gradle in a few months than I have with ant in a decade. I’m sure Maven is 
better, but I doubt it will be without issue.


From: Benjamin Lerer mailto:b.le...@gmail.com>>
Date: Thursday, 3 February 2022 at 09:03
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: Build tool
I t

Re: Build tool

2022-02-03 Thread bened...@apache.org
> Aleksei has proven that he was able to deliver work of quality and to push 
> things forward. He is willing to try to tackle that work.

I am not questioning his ability to deliver, I am questioning the value of 
burdening the project with this migration?

I am productive today with ant. I will almost certainly take a productivity hit 
sometime during and after the migration to maven, as will others.

From: Benjamin Lerer 
Date: Thursday, 3 February 2022 at 10:13
To: dev@cassandra.apache.org 
Subject: Re: Build tool
I don’t have a super strong desire to stay with ant, I just have a desire not 
to unduly burden the project with unnecessary churn. Tooling changes can be 
quite painful.

Aleksei has proven that he was able to deliver work of quality and to push 
things forward. He is willing to try to tackle that work. I am in favor of 
giving him the chance to do it.
I fully agree that it is not a simple task but I am also convinced that other 
people might be interested in joining the effort.

With regards to contributions, this is often brought up but the reality is the 
project has always struggled to bring in new ongoing contributors, in large 
part due to the barrier to entry of such a complex project (which has only 
grown as our expectations on patch quality have gone up). I struggle to believe 
that ANT is more than a rounding error on our efficacy here, since we have 
always struggled.

I think we found some way around the barrier to entry :-). I will trigger 
another discussion about that.

Le jeu. 3 févr. 2022 à 10:17, bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> a écrit :
I don’t have a super strong desire to stay with ant, I just have a desire not 
to unduly burden the project with unnecessary churn. Tooling changes can be 
quite painful.

With regards to contributions, this is often brought up but the reality is the 
project has always struggled to bring in new ongoing contributors, in large 
part due to the barrier to entry of such a complex project (which has only 
grown as our expectations on patch quality have gone up). I struggle to believe 
that ANT is more than a rounding error on our efficacy here, since we have 
always struggled.

If we’re struggling to actually use ant how we want that’s another matter, but 
it’s easy to forget how much just works for us with ant, and forget the things 
we will have pain with adopting a new build system. I have had more frustration 
with Gradle in a few months than I have with ant in a decade. I’m sure Maven is 
better, but I doubt it will be without issue.


From: Benjamin Lerer mailto:b.le...@gmail.com>>
Date: Thursday, 3 February 2022 at 09:03
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: Build tool
I think that there are 2 main issues (Aleksei can correct me):
* ANT is pretty old and a lot of newcomers are unfamiliar with it and surprised 
by it. By consequence, it might slow down the on-boarding of newcomers which we 
want to make as smooth as possible.
* Aleksei has been working on migrating our test to JUnit 5 and faced multiple 
issues with ANT. He provided five new features to the ANT project to fix the 
problems he encountered and some got rejected.

I totally agree with your feeling that the current solution works for now and 
that staying with it is also a valid choice. I do like ANT. The question for me 
is really if ANT makes sense for the future of Cassandra. From the feedback I 
got, I start to doubt that it is the case.

Le jeu. 3 févr. 2022 à 09:32, bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> a écrit :
I’m going to be a killjoy and once again query what value changing build system 
brings, that outweighs the disruption to current long-term contributors that 
can easily get things done today?

At the very least there should be a ranked choice vote that includes today’s 
build system.

From: Maulin Vasavada 
mailto:maulin.vasav...@gmail.com>>
Date: Thursday, 3 February 2022 at 05:52
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: Build tool
Hi Aleksei

I was thinking about the same - build tool. I have used both - Maven and 
Gradle. In my experience, while Gradle has a rich DSL and the corresponding 
power, with constant changes in Gradle across versions it is difficult to focus 
on the actual product (like Cassandra in this case) development. With Maven the 
learning is once and it doesn't change that much and one can focus on the 
actual product better.

Of course, this is IMHO. +1 for using Maven. I would like to participate in the 
migration of the build tool if it needs more hands.

Thanks
Maulin

On Wed, Feb 2, 2022 at 2:35 PM Aleksei Zotov 
mailto:azotc...@apache.org>> wrote:
Hi All,

Some time ago I created https://issues.apache.org/jira/browse/CASSANDRA-17015 

Re: Build tool

2022-02-03 Thread bened...@apache.org
I don’t have a super strong desire to stay with ant, I just have a desire not 
to unduly burden the project with unnecessary churn. Tooling changes can be 
quite painful.

With regards to contributions, this is often brought up but the reality is the 
project has always struggled to bring in new ongoing contributors, in large 
part due to the barrier to entry of such a complex project (which has only 
grown as our expectations on patch quality have gone up). I struggle to believe 
that ANT is more than a rounding error on our efficacy here, since we have 
always struggled.

If we’re struggling to actually use ant how we want that’s another matter, but 
it’s easy to forget how much just works for us with ant, and forget the things 
we will have pain with adopting a new build system. I have had more frustration 
with Gradle in a few months than I have with ant in a decade. I’m sure Maven is 
better, but I doubt it will be without issue.


From: Benjamin Lerer 
Date: Thursday, 3 February 2022 at 09:03
To: dev@cassandra.apache.org 
Subject: Re: Build tool
I think that there are 2 main issues (Aleksei can correct me):
* ANT is pretty old and a lot of newcomers are unfamiliar with it and surprised 
by it. By consequence, it might slow down the on-boarding of newcomers which we 
want to make as smooth as possible.
* Aleksei has been working on migrating our test to JUnit 5 and faced multiple 
issues with ANT. He provided five new features to the ANT project to fix the 
problems he encountered and some got rejected.

I totally agree with your feeling that the current solution works for now and 
that staying with it is also a valid choice. I do like ANT. The question for me 
is really if ANT makes sense for the future of Cassandra. From the feedback I 
got, I start to doubt that it is the case.

Le jeu. 3 févr. 2022 à 09:32, bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> a écrit :
I’m going to be a killjoy and once again query what value changing build system 
brings, that outweighs the disruption to current long-term contributors that 
can easily get things done today?

At the very least there should be a ranked choice vote that includes today’s 
build system.

From: Maulin Vasavada 
mailto:maulin.vasav...@gmail.com>>
Date: Thursday, 3 February 2022 at 05:52
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
mailto:dev@cassandra.apache.org>>
Subject: Re: Build tool
Hi Aleksei

I was thinking about the same - build tool. I have used both - Maven and 
Gradle. In my experience, while Gradle has a rich DSL and the corresponding 
power, with constant changes in Gradle across versions it is difficult to focus 
on the actual product (like Cassandra in this case) development. With Maven the 
learning is once and it doesn't change that much and one can focus on the 
actual product better.

Of course, this is IMHO. +1 for using Maven. I would like to participate in the 
migration of the build tool if it needs more hands.

Thanks
Maulin

On Wed, Feb 2, 2022 at 2:35 PM Aleksei Zotov 
mailto:azotc...@apache.org>> wrote:
Hi All,

Some time ago I created https://issues.apache.org/jira/browse/CASSANDRA-17015 
to migrate from ant to maven/gradle. Originally I was going to implement both, 
compare and pick the best in terms of project needs. However, now I feel it 
would be a significant overhead to try out both. Therefore, I'd like to make a 
collective decision on the build tool before starting any actual work.

I saw on Slack 
(https://app.slack.com/client/T4S1WH2J3/CK23JSY2K/thread/CK23JSY2K-1643748908.929809)
 that many people prefer maven. I'm leaning towards maven as well.

I guess we need to have a formal poll on the build tool since it is a 
significant part of the project. Please, suggest what the best way to proceed 
is. Should I just raise a vote for maven and just see if someone -1 in favor of 
gradle?

PS:
Please, bear in mind that Robert has already made some progress on gradle 
migration. I do not know how much is done there and whether he is willing to 
get it completed.

On 2020/06/02 13:39:34 Robert Stupp wrote:
> Yea - it's already in a pretty good state.
>
> Some work-in-progress-state is already available in either
> https://github.com/snazy/cassandra/tree/tryout-gradle (or
> https://github.com/snazy/cassandra/tree/tryout-gradle-dist-test with an
> additional commit).
>
> I already use it on my machine for a bunch of things and it already
> "feels bad" to go back to a branch without Gradle.
>
> I'll start a separate dev-ML thread with some more information in the
> next days, because getting C* 4.0-beta released is a higher priority atm.
>
> On 6/1/20 2:41 AM, Joshua McKenzie wrote:
> > Build tools are like religions, that's why. Or maybe cults. Or all
> > Stockholm Syndrome creators? :)
> >
> > Robert Stupp has been noodling around with a gradle based 

Re: Build tool

2022-02-03 Thread bened...@apache.org
I’m going to be a killjoy and once again query what value changing build system 
brings, that outweighs the disruption to current long-term contributors that 
can easily get things done today?

At the very least there should be a ranked choice vote that includes today’s 
build system.

From: Maulin Vasavada 
Date: Thursday, 3 February 2022 at 05:52
To: dev@cassandra.apache.org 
Subject: Re: Build tool
Hi Aleksei

I was thinking about the same - build tool. I have used both - Maven and 
Gradle. In my experience, while Gradle has a rich DSL and the corresponding 
power, with constant changes in Gradle across versions it is difficult to focus 
on the actual product (like Cassandra in this case) development. With Maven the 
learning is once and it doesn't change that much and one can focus on the 
actual product better.

Of course, this is IMHO. +1 for using Maven. I would like to participate in the 
migration of the build tool if it needs more hands.

Thanks
Maulin

On Wed, Feb 2, 2022 at 2:35 PM Aleksei Zotov 
mailto:azotc...@apache.org>> wrote:
Hi All,

Some time ago I created https://issues.apache.org/jira/browse/CASSANDRA-17015 
to migrate from ant to maven/gradle. Originally I was going to implement both, 
compare and pick the best in terms of project needs. However, now I feel it 
would be a significant overhead to try out both. Therefore, I'd like to make a 
collective decision on the build tool before starting any actual work.

I saw on Slack 
(https://app.slack.com/client/T4S1WH2J3/CK23JSY2K/thread/CK23JSY2K-1643748908.929809)
 that many people prefer maven. I'm leaning towards maven as well.

I guess we need to have a formal poll on the build tool since it is a 
significant part of the project. Please, suggest what the best way to proceed 
is. Should I just raise a vote for maven and just see if someone -1 in favor of 
gradle?

PS:
Please, bear in mind that Robert has already made some progress on gradle 
migration. I do not know how much is done there and whether he is willing to 
get it completed.

On 2020/06/02 13:39:34 Robert Stupp wrote:
> Yea - it's already in a pretty good state.
>
> Some work-in-progress-state is already available in either
> https://github.com/snazy/cassandra/tree/tryout-gradle (or
> https://github.com/snazy/cassandra/tree/tryout-gradle-dist-test with an
> additional commit).
>
> I already use it on my machine for a bunch of things and it already
> "feels bad" to go back to a branch without Gradle.
>
> I'll start a separate dev-ML thread with some more information in the
> next days, because getting C* 4.0-beta released is a higher priority atm.
>
> On 6/1/20 2:41 AM, Joshua McKenzie wrote:
> > Build tools are like religions, that's why. Or maybe cults. Or all
> > Stockholm Syndrome creators? :)
> >
> > Robert Stupp has been noodling around with a gradle based build env for C*
> > that'll live alongside ant. Not sure what the status is on that atm through.
> >
> > On Sun, May 31, 2020 at 3:16 PM Abhishek Singh 
> > mailto:abh23...@gmail.com>> wrote:
> >
> >> Hi All,
> >>Hope you are doing well and are safe.
> >>   I just wanted to know why is the build still on ant and is there any plan
> >> to migrate to a modern build tool?
> >>
> >> Regards,
> >> Abhishek Singh
> >>
> --
> Robert Stupp
> @snazy
>
>
> -
> To unsubscribe, e-mail: 
> dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: 
> dev-h...@cassandra.apache.org
>
>


Re: Have we considered static type checking for our python libs?

2022-01-26 Thread bened...@apache.org
Python execution is obviously not a bottleneck, but cluster startup/shutdown 
and lengthy waits due to lack of visibility to state changes in the tests quite 
probably are. The python tests are also IME quite rudimentary compared to their 
java equivalents, and more brittle, due to worse tooling for e.g. data 
generation, interfering with message delivery etc.

They are also comparatively poorly maintained, due to most on the project not 
wanting to touch them beyond what is necessary. It would be hugely beneficial 
to migrate to a platform the majority uses, and that can produce more powerful 
tests that are also able to execute faster (for reasons of integration, not 
language).


From: Brandon Williams 
Date: Wednesday, 26 January 2022 at 16:09
To: dev 
Subject: Re: Have we considered static type checking for our python libs?
On Wed, Jan 26, 2022 at 7:43 AM bened...@apache.org  wrote:
> I might even venture to predict that it might payoff with lower development 
> overhead, as we can run our tests much more quickly, and debug failures much 
> more easily.

I don't think in practice these will happen at all, let alone 'much
more.'  Python execution is nowhere near the bottleneck, not that
either of these would speed it up significantly.  I'm unable to think
of an instance where typing in python could have helped me, at least
in the dtest tickets I've worked on.  Maybe someone with more
experience has a different estimation?


Re: Have we considered static type checking for our python libs?

2022-01-26 Thread bened...@apache.org
I don’t think this support would be very hard to add, if this is the only 
stumbling block.

From: Andrés de la Peña 
Date: Wednesday, 26 January 2022 at 14:10
To: dev@cassandra.apache.org 
Subject: Re: Have we considered static type checking for our python libs?
Last time I ported dtests during the 4.0 quality test epic there wasn't support 
for virtual nodes in in-jvm dtests. We have many Python dtests depending on 
vnodes that can't be totally ported if we still don't have support for vnodes, 
I don't know if it's still the case.

On Wed, 26 Jan 2022 at 14:02, Joshua McKenzie 
mailto:jmcken...@apache.org>> wrote:
Could be a very fruitful source of LHF tickets to highlight in the biweekly 
email and would be pretty trivial to integrate this into the build lead role 
(getting an epic and jira tickets created to port tests over, etc).

we can run our tests much more quickly, and debug failures much more easily.
Please Yes. If we can get away from python upgrade tests I think all our lives 
would be improved.

I like it.


On Wed, Jan 26, 2022 at 8:42 AM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
We could set this as a broad goal of the project, and like the build lead role 
could each volunteer to adopt a test every X weeks. We would have migrated in 
no time, I expect, with this kind of concerted effort, and might not even 
notice a significant penalty to other ongoing work.

Last time I ported a dtest it was a very easy thing to do.

I might even venture to predict that it might payoff with lower development 
overhead, as we can run our tests much more quickly, and debug failures much 
more easily.

From: Joshua McKenzie mailto:jmcken...@apache.org>>
Date: Wednesday, 26 January 2022 at 13:40
To: dev mailto:dev@cassandra.apache.org>>
Subject: Re: Have we considered static type checking for our python libs?
I have yet to encounter this class of problem in the dtests.
It's more about development velocity and convenience than about preventing 
defects in our case, since we're not abusing duck-typing everywhere. Every time 
I have to work on python dtests (for instance, when doing build lead work and 
looking at flaky tests) it's a little irritating and I think of this.

 I would hate to expend loads of effort modernising them when the same effort 
could see them superseded by much better versions of the same test.
I completely agree, however this is something someone would have to take on as 
an effort and I don't believe I've seen anybody step up yet. At the current 
rate we're going to be dragging along the python dtests into perpetuity.


On Wed, Jan 26, 2022 at 8:16 AM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
I was sort of hoping we would retire the python dtests before long, at least in 
large part (probably not ever entirely, but 99%).

I think many of them could be migrated to in-jvm dtests without much effort. I 
would hate to expend loads of effort modernising them when the same effort 
could see them superseded by much better versions of the same test.


From: Joshua McKenzie mailto:jmcken...@apache.org>>
Date: Wednesday, 26 January 2022 at 12:59
To: dev mailto:dev@cassandra.apache.org>>
Subject: Have we considered static type checking for our python libs?
Relevant links:
1) Optional static typing for python: 
https://docs.python.org/3/library/typing.html
2) Mypy static type checker for python: https://github.com/python/mypy

So the question - has anyone given any serious thought to introducing type 
hints and a static type checker in ccm and python dtests? A search on dev 
ponymail doesn't turn up anything.

I've used it pretty extensively in the past and found it incredibly helpful 
combined with other linters in surfacing troublesome edge cases, and also found 
it accelerated development quite a bit.

Any thoughts on the topic for or against?

~Josh


Re: Have we considered static type checking for our python libs?

2022-01-26 Thread bened...@apache.org
We could set this as a broad goal of the project, and like the build lead role 
could each volunteer to adopt a test every X weeks. We would have migrated in 
no time, I expect, with this kind of concerted effort, and might not even 
notice a significant penalty to other ongoing work.

Last time I ported a dtest it was a very easy thing to do.

I might even venture to predict that it might payoff with lower development 
overhead, as we can run our tests much more quickly, and debug failures much 
more easily.

From: Joshua McKenzie 
Date: Wednesday, 26 January 2022 at 13:40
To: dev 
Subject: Re: Have we considered static type checking for our python libs?
I have yet to encounter this class of problem in the dtests.
It's more about development velocity and convenience than about preventing 
defects in our case, since we're not abusing duck-typing everywhere. Every time 
I have to work on python dtests (for instance, when doing build lead work and 
looking at flaky tests) it's a little irritating and I think of this.

 I would hate to expend loads of effort modernising them when the same effort 
could see them superseded by much better versions of the same test.
I completely agree, however this is something someone would have to take on as 
an effort and I don't believe I've seen anybody step up yet. At the current 
rate we're going to be dragging along the python dtests into perpetuity.


On Wed, Jan 26, 2022 at 8:16 AM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
I was sort of hoping we would retire the python dtests before long, at least in 
large part (probably not ever entirely, but 99%).

I think many of them could be migrated to in-jvm dtests without much effort. I 
would hate to expend loads of effort modernising them when the same effort 
could see them superseded by much better versions of the same test.


From: Joshua McKenzie mailto:jmcken...@apache.org>>
Date: Wednesday, 26 January 2022 at 12:59
To: dev mailto:dev@cassandra.apache.org>>
Subject: Have we considered static type checking for our python libs?
Relevant links:
1) Optional static typing for python: 
https://docs.python.org/3/library/typing.html
2) Mypy static type checker for python: https://github.com/python/mypy

So the question - has anyone given any serious thought to introducing type 
hints and a static type checker in ccm and python dtests? A search on dev 
ponymail doesn't turn up anything.

I've used it pretty extensively in the past and found it incredibly helpful 
combined with other linters in surfacing troublesome edge cases, and also found 
it accelerated development quite a bit.

Any thoughts on the topic for or against?

~Josh


Re: Have we considered static type checking for our python libs?

2022-01-26 Thread bened...@apache.org
I was sort of hoping we would retire the python dtests before long, at least in 
large part (probably not ever entirely, but 99%).

I think many of them could be migrated to in-jvm dtests without much effort. I 
would hate to expend loads of effort modernising them when the same effort 
could see them superseded by much better versions of the same test.


From: Joshua McKenzie 
Date: Wednesday, 26 January 2022 at 12:59
To: dev 
Subject: Have we considered static type checking for our python libs?
Relevant links:
1) Optional static typing for python: 
https://docs.python.org/3/library/typing.html
2) Mypy static type checker for python: https://github.com/python/mypy

So the question - has anyone given any serious thought to introducing type 
hints and a static type checker in ccm and python dtests? A search on dev 
ponymail doesn't turn up anything.

I've used it pretty extensively in the past and found it incredibly helpful 
combined with other linters in surfacing troublesome edge cases, and also found 
it accelerated development quite a bit.

Any thoughts on the topic for or against?

~Josh


Re: [DISCUSS] Releasable trunk and quality

2022-01-06 Thread bened...@apache.org
So, one advantage of merge commits is that review of each branch is potentially 
easier, as the merge commit helps direct the reviewer’s attention. However, in 
my experience most of the focus during review is to the main branch anyway. 
Having separate tickets to track backports, and permitting them to occur out of 
band, could improve the quality of review. We can also likely synthesise the 
merge commits for the purpose of review using something like

git checkout patch-4.0~1
git checkout -b patch-4.0-review
git merge -s ours patch-4.1~1
git merge --no-commit patch-4.1
git checkout patch-4.0 .
git commit


From: bened...@apache.org 
Date: Wednesday, 5 January 2022 at 21:07
To: Mick Semb Wever 
Cc: dev 
Subject: Re: [DISCUSS] Releasable trunk and quality

> If you see a merge commit in the history, isn't it normal to presume that it 
> will contain the additional change for that branch for the parent commit 
> getting merged in?

Sure, but it is exceptionally non-trivial to treat the work as a single diff in 
any standard UX. In practice it becomes 3 or 4 diffs, none of which tell the 
whole story (and all of which bleed legacy details of earlier branches). This 
is a genuine significant pain point when doing archaeology, something I and 
others do quite frequently, and a large part of why I want to see them gone.

> Folk forget to pull, rebase, then go to push and realise one of their patches 
> on a branch needs rebasing and rework. That rework may make them reconsider 
> the patch going to the other branches too.

Conversely, this is exceptionally painful when maintaining branches and forks, 
and I can attest to the pain of maintaining these branches so they may be 
committed atomically to having wasted literal person-weeks in my time on the 
project. I do not recall experiencing a significant benefit in return.

> do i have to start text searching the git history

Yes. This is so simple as to be a non-issue - surely you must search git log 
semi-regularly? It is a frequent part of the job of developing against the 
project in my experience.

> Developing patch on hardest branch first, then working on each softer branch. 
> I don't know how important this is, but I have found it a useful practice 
> that encourages smaller, more precise patches overall.

I don’t think this strategy determines which branch is first developed against. 
However, if it did, it would seem to me to be a clear mark against the current 
system, which incentivises fully developing against the oldest version before 
forward-porting its entirety. Developing primarily against the most recent 
branch incentivises back-porting more minimal versions of the work, once the 
scope of the work is fully understood.





Re: [DISCUSS] Releasable trunk and quality

2022-01-05 Thread bened...@apache.org
I think simple, consistent, reliable and unavoidable are *the* killer features 
for QA. All features (give or take) of the industry standard approach of using 
CI hooks to gate PR merge.

From: Joshua McKenzie 
Date: Wednesday, 5 January 2022 at 14:53
To: dev 
Subject: Re: [DISCUSS] Releasable trunk and quality
A wise man once said "Simple is a feature" ;)

Our current process (commit oldest, merge up or merge -s ours w/ --amend):
- is confusing for new contributors to understand
- hides changes inside the merge commit
- masks future ability to see things with git attribute on commits
- is exposed to race w/other committers across multiple branches requiring 
--atomic
- is non-automatable requiring human intervention and prone to error
- prevents us from using industry standard tooling and workflows around CI thus 
contributing to CI degrading over time
+ Helps enforce that we don't forget to apply something to all branches
+(?) Is the devil we know

That's a lot of negatives for a very fixable single positive and some FUD.

On Tue, Jan 4, 2022 at 7:01 PM bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>> wrote:
To answer your point, I don’t have anything ideologically against a temporary 
divergence in treatment, but we should have a clear unified endpoint we are 
aiming for.

I would hate for this discussion to end without a clear answer about what that 
endpoint should be, though - even if we don’t get there immediately.

I personally dislike the idea of relying on scripts to enforce this, at least 
in the long run, as there is no uniformity of environment, so no uniformity of 
process, and when things go wrong due to diverging systems we’re creating 
additional work for people (and CI is headache enough when it goes wrong).


From: bened...@apache.org<mailto:bened...@apache.org> 
mailto:bened...@apache.org>>
Date: Tuesday, 4 January 2022 at 23:52
To: David Capwell mailto:dcapw...@apple.com>>, Joshua 
McKenzie mailto:jmcken...@apache.org>>
Cc: Henrik Ingo mailto:henrik.i...@datastax.com>>, 
dev mailto:dev@cassandra.apache.org>>
Subject: Re: [DISCUSS] Releasable trunk and quality
That all sounds terribly complicated to me.

My view is that we should switch to the branch strategy outlined by Henrik (I 
happen to prefer it anyway) and move to GitHub integrations to control merge 
for each branch independently. Simples.


From: David Capwell mailto:dcapw...@apple.com>>
Date: Tuesday, 4 January 2022 at 23:33
To: Joshua McKenzie mailto:jmcken...@apache.org>>
Cc: Henrik Ingo mailto:henrik.i...@datastax.com>>, 
dev mailto:dev@cassandra.apache.org>>
Subject: Re: [DISCUSS] Releasable trunk and quality
The more I think on it, the more I am anyway strongly -1 on having some 
bifurcated commit process. We should decide on a uniform commit process for the 
whole project, for all patches, whatever that may be.

Making the process stable and handle all the random things we need to handle 
takes a lot of time, for that reason I strongly feel we should start with trunk 
only and look to expand to other branches and/or handle multi-branch commits.  
I agree that each branch should NOT have a different process, but feel its ok 
if we are evolving what the process should be.

About the merge commit thing, we can automate (think Josh wants to OSS my 
script) the current process so this isn’t a blocker for automation; the thing I 
hate about it is that I have not found any tool able to understand our history, 
so it forces me to go to CLI to figure out how the merge actually changed 
things (only the smallest version can be displayed properly), I am 100% in 
favor of removing, but don’t think its a dependency on automating our merge 
process.



On Jan 4, 2022, at 11:58 AM, Joshua McKenzie 
mailto:jmcken...@apache.org>> wrote:

I put together a draft confluence wiki page (login required) for the Build Lead 
role covering what we discussed in the thread here. Link: 
https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=199527692=96dfa1ef-d927-427a-bff8-0cf711c790c9;

The only potentially controversial thing in there is text under what to do with 
a consistent test failure introduced by a diff to trunk: "If consistent, git 
revert the SHA that introduced the failure, re-open the original JIRA ticket, 
and leave a note for the original assignee about the breakage they introduced".

This would apply only to patches to trunk that introduce consistent failures to 
a test clearly attributable to that patch.

I am deferring on the topic of merge strategy as there's a lot of progress we 
can make without considering that more controversial topic yet.

On Tue, Dec 21, 2021 at 9:02 AM Henrik Ingo 
mailto:henrik.i...@datastax.com>> wrote:
FWIW, I thought I could link to an example MongoDB commit:

https://github.com/mongodb/mongo/commit/dec388494b652488259072cf61fd987af3fa8470

* Fixes start

Re: [DISCUSS] Releasable trunk and quality

2022-01-05 Thread bened...@apache.org

> If you see a merge commit in the history, isn't it normal to presume that it 
> will contain the additional change for that branch for the parent commit 
> getting merged in?

Sure, but it is exceptionally non-trivial to treat the work as a single diff in 
any standard UX. In practice it becomes 3 or 4 diffs, none of which tell the 
whole story (and all of which bleed legacy details of earlier branches). This 
is a genuine significant pain point when doing archaeology, something I and 
others do quite frequently, and a large part of why I want to see them gone.

> Folk forget to pull, rebase, then go to push and realise one of their patches 
> on a branch needs rebasing and rework. That rework may make them reconsider 
> the patch going to the other branches too.

Conversely, this is exceptionally painful when maintaining branches and forks, 
and I can attest to the pain of maintaining these branches so they may be 
committed atomically to having wasted literal person-weeks in my time on the 
project. I do not recall experiencing a significant benefit in return.

> do i have to start text searching the git history

Yes. This is so simple as to be a non-issue - surely you must search git log 
semi-regularly? It is a frequent part of the job of developing against the 
project in my experience.

> Developing patch on hardest branch first, then working on each softer branch. 
> I don't know how important this is, but I have found it a useful practice 
> that encourages smaller, more precise patches overall.

I don’t think this strategy determines which branch is first developed against. 
However, if it did, it would seem to me to be a clear mark against the current 
system, which incentivises fully developing against the oldest version before 
forward-porting its entirety. Developing primarily against the most recent 
branch incentivises back-porting more minimal versions of the work, once the 
scope of the work is fully understood.





Re: [DISCUSS] Releasable trunk and quality

2022-01-04 Thread bened...@apache.org
To answer your point, I don’t have anything ideologically against a temporary 
divergence in treatment, but we should have a clear unified endpoint we are 
aiming for.

I would hate for this discussion to end without a clear answer about what that 
endpoint should be, though - even if we don’t get there immediately.

I personally dislike the idea of relying on scripts to enforce this, at least 
in the long run, as there is no uniformity of environment, so no uniformity of 
process, and when things go wrong due to diverging systems we’re creating 
additional work for people (and CI is headache enough when it goes wrong).


From: bened...@apache.org 
Date: Tuesday, 4 January 2022 at 23:52
To: David Capwell , Joshua McKenzie 
Cc: Henrik Ingo , dev 
Subject: Re: [DISCUSS] Releasable trunk and quality
That all sounds terribly complicated to me.

My view is that we should switch to the branch strategy outlined by Henrik (I 
happen to prefer it anyway) and move to GitHub integrations to control merge 
for each branch independently. Simples.


From: David Capwell 
Date: Tuesday, 4 January 2022 at 23:33
To: Joshua McKenzie 
Cc: Henrik Ingo , dev 
Subject: Re: [DISCUSS] Releasable trunk and quality
The more I think on it, the more I am anyway strongly -1 on having some 
bifurcated commit process. We should decide on a uniform commit process for the 
whole project, for all patches, whatever that may be.

Making the process stable and handle all the random things we need to handle 
takes a lot of time, for that reason I strongly feel we should start with trunk 
only and look to expand to other branches and/or handle multi-branch commits.  
I agree that each branch should NOT have a different process, but feel its ok 
if we are evolving what the process should be.

About the merge commit thing, we can automate (think Josh wants to OSS my 
script) the current process so this isn’t a blocker for automation; the thing I 
hate about it is that I have not found any tool able to understand our history, 
so it forces me to go to CLI to figure out how the merge actually changed 
things (only the smallest version can be displayed properly), I am 100% in 
favor of removing, but don’t think its a dependency on automating our merge 
process.




On Jan 4, 2022, at 11:58 AM, Joshua McKenzie 
mailto:jmcken...@apache.org>> wrote:

I put together a draft confluence wiki page (login required) for the Build Lead 
role covering what we discussed in the thread here. Link: 
https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=199527692=96dfa1ef-d927-427a-bff8-0cf711c790c9;

The only potentially controversial thing in there is text under what to do with 
a consistent test failure introduced by a diff to trunk: "If consistent, git 
revert the SHA that introduced the failure, re-open the original JIRA ticket, 
and leave a note for the original assignee about the breakage they introduced".

This would apply only to patches to trunk that introduce consistent failures to 
a test clearly attributable to that patch.

I am deferring on the topic of merge strategy as there's a lot of progress we 
can make without considering that more controversial topic yet.

On Tue, Dec 21, 2021 at 9:02 AM Henrik Ingo 
mailto:henrik.i...@datastax.com>> wrote:
FWIW, I thought I could link to an example MongoDB commit:

https://github.com/mongodb/mongo/commit/dec388494b652488259072cf61fd987af3fa8470

* Fixes start from trunk or whatever is the highest version that includes the 
bug
* It is then cherry picked to each stable version that needs to fix. Above link 
is an example of such a cherry pick. The original sha is referenced in the 
commit message.
* I found that it makes sense to always cherry pick from the immediate higher 
version, since if you had to make some changes to the previous commit, they 
probably need to be in the next one as well.
* There are no merge commits. Everything is always cherry picked or rebased to 
the top of a branch.
* Since this was mentioned, MongoDB indeed tracks the cherry picking process 
explicitly: The original SERVER ticket is closed when fix is committed to trunk 
branch. However, new BACKPORT tickets are created and linked to the SERVER 
ticket, one per stable version that will need a cherry-pick. This way 
backporting the fix is never forgotten, as the team can just track open BACKPRT 
tickets and work on them to close them.

henrik

On Tue, Dec 14, 2021 at 8:53 PM Joshua McKenzie 
mailto:jmcken...@apache.org>> wrote:
>
> I like a change originating from just one commit, and having tracking
> visible across the branches. This gives you immediate information about
> where and how the change was applied without having to go to the jira
> ticket (and relying on it being accurate)

I have the exact opposite experience right now (though this may be a
shortcoming of my env / workflow). When I'm showing annotations in intellij
and I see walls of merge commits as commit mes

Re: [DISCUSS] Releasable trunk and quality

2022-01-04 Thread bened...@apache.org
That all sounds terribly complicated to me.

My view is that we should switch to the branch strategy outlined by Henrik (I 
happen to prefer it anyway) and move to GitHub integrations to control merge 
for each branch independently. Simples.


From: David Capwell 
Date: Tuesday, 4 January 2022 at 23:33
To: Joshua McKenzie 
Cc: Henrik Ingo , dev 
Subject: Re: [DISCUSS] Releasable trunk and quality
The more I think on it, the more I am anyway strongly -1 on having some 
bifurcated commit process. We should decide on a uniform commit process for the 
whole project, for all patches, whatever that may be.

Making the process stable and handle all the random things we need to handle 
takes a lot of time, for that reason I strongly feel we should start with trunk 
only and look to expand to other branches and/or handle multi-branch commits.  
I agree that each branch should NOT have a different process, but feel its ok 
if we are evolving what the process should be.

About the merge commit thing, we can automate (think Josh wants to OSS my 
script) the current process so this isn’t a blocker for automation; the thing I 
hate about it is that I have not found any tool able to understand our history, 
so it forces me to go to CLI to figure out how the merge actually changed 
things (only the smallest version can be displayed properly), I am 100% in 
favor of removing, but don’t think its a dependency on automating our merge 
process.



On Jan 4, 2022, at 11:58 AM, Joshua McKenzie 
mailto:jmcken...@apache.org>> wrote:

I put together a draft confluence wiki page (login required) for the Build Lead 
role covering what we discussed in the thread here. Link: 
https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=199527692=96dfa1ef-d927-427a-bff8-0cf711c790c9;

The only potentially controversial thing in there is text under what to do with 
a consistent test failure introduced by a diff to trunk: "If consistent, git 
revert the SHA that introduced the failure, re-open the original JIRA ticket, 
and leave a note for the original assignee about the breakage they introduced".

This would apply only to patches to trunk that introduce consistent failures to 
a test clearly attributable to that patch.

I am deferring on the topic of merge strategy as there's a lot of progress we 
can make without considering that more controversial topic yet.

On Tue, Dec 21, 2021 at 9:02 AM Henrik Ingo 
mailto:henrik.i...@datastax.com>> wrote:
FWIW, I thought I could link to an example MongoDB commit:

https://github.com/mongodb/mongo/commit/dec388494b652488259072cf61fd987af3fa8470

* Fixes start from trunk or whatever is the highest version that includes the 
bug
* It is then cherry picked to each stable version that needs to fix. Above link 
is an example of such a cherry pick. The original sha is referenced in the 
commit message.
* I found that it makes sense to always cherry pick from the immediate higher 
version, since if you had to make some changes to the previous commit, they 
probably need to be in the next one as well.
* There are no merge commits. Everything is always cherry picked or rebased to 
the top of a branch.
* Since this was mentioned, MongoDB indeed tracks the cherry picking process 
explicitly: The original SERVER ticket is closed when fix is committed to trunk 
branch. However, new BACKPORT tickets are created and linked to the SERVER 
ticket, one per stable version that will need a cherry-pick. This way 
backporting the fix is never forgotten, as the team can just track open BACKPRT 
tickets and work on them to close them.

henrik

On Tue, Dec 14, 2021 at 8:53 PM Joshua McKenzie 
mailto:jmcken...@apache.org>> wrote:
>
> I like a change originating from just one commit, and having tracking
> visible across the branches. This gives you immediate information about
> where and how the change was applied without having to go to the jira
> ticket (and relying on it being accurate)

I have the exact opposite experience right now (though this may be a
shortcoming of my env / workflow). When I'm showing annotations in intellij
and I see walls of merge commits as commit messages and have to bounce over
to a terminal or open the git panel to figure out what actual commit on a
different branch contains the minimal commit message pointing to the JIRA
to go to the PR and actually finally find out _why_ we did a thing, then
dig around to see if we changed the impl inside a merge commit SHA from the
original base impl...

Well, that is not my favorite.  :D

All ears on if there's a cleaner way to do the archaeology here.


On Tue, Dec 14, 2021 at 1:34 PM Stefan Miklosovic <
stefan.mikloso...@instaclustr.com> 
wrote:

> Does somebody else use the git workflow we do as of now in Apache
> universe? Are not we quite unique? While I do share the same opinion
> Mick has in his last response, I also see the disadvantage in having
> the commit history polluted by merges. I am 

Re: [DISCUSS] Periodic snapshot publishing with minor version bumps

2021-12-22 Thread bened...@apache.org
> You were part of that slack thread, so it was a bad presumption on my behalf.

I am flattered, but I’m sure your intention was in fact to involve everyone in 
this discussion. As it happens, I commented only on the end of that lengthy 
thread and did not participate in the section you linked, so was unaware of it 
– as I’m sure were most folk here.

> the complaint raised by you is that it doesn't case-sensitively lexically 
> order and undermines the proposal choice you want to see go forward

Actually my complaint was more general, but I was letting another pet peeve of 
mine leak into this discussion. We should have a separate discussion around 
dependency policy in the new year. I think new dependencies should not be 
included without discussion on list, as they introduce significant new code to 
the project that is rarely audited even cursorily either on inclusion to the 
project or update. For such a trivial feature as this, that was adequately 
implemented in the project already, I consider the inclusion of a dependency to 
be a mistake.

As it happens, I don’t think this problem you raise is a concern, even with 
this recently introduced faulty implementation of Semver. A2 is zero cost to 
implement, but even A1 would be fine without any work. It is unlikely we would 
ever need to compare a -pre version to -alpha or any other pre-release version, 
as we are unlikely to perform upgrade tests across these versions since we will 
have no users deploying them.


From: Mick Semb Wever 
Date: Wednesday, 22 December 2021 at 16:02
To:
Cc: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Periodic snapshot publishing with minor version bumps
> > Yeah, not described enough in this thread, it is part of the motivation to 
> > the proposal
>
> I don’t believe it has been mentioned once in this thread. This should have 
> been clearly stated upfront as a motivation. Thus far no positive case has 
> been made on this topic, we have instead wasted a lot of time discussing 
> clearly peripheral topics, demonstrating that the more obvious approach for 
> anyone without this motivation is indeed fine.
>


Apologies for not previously stating and explaining this additional
motivation in this thread. You were part of that slack thread, so it
was a bad presumption on my behalf.


> > Setting up versioning to be extensible in this manner is not endorsing such 
> > artefacts and distributions.
>
> Yes, setting up versioning in this way with the intention of permitting 
> comparisons between these “not Cassandra” releases and actual Cassandra 
> releases is the same thing as endorsing this behaviour. It’s equally bad if 
> this “internal” release is, say, used to support some cloud service that is 
> advertised as Cassandra-compatible.


We have many versionings at play, and they are used between codebases
in our ecosystem. Forcing people to use their own versioning  may well
require them to then have to adapt other codebases/components
unnecessarily. In turn we might lose some of the benefits from their
testing efforts.

Here I disagree with you, let's leave it at that.


>
> You broke the spec as part of this work. The “NIH approach” was more 
> standards compliant prior to this work (as it correctly sorted prior to 
> 16649, except for SNAPSHOT releases).
>


Please, we are chasing each other's tails here. I understand that you
to follow the SemVer spec strictly wrt to pre-release fields being
case-sensitively lexically ordered, despite the author's of the spec
stating that this was an oversight, recommending the field be kept
lowercase, and suggesting it might become case-insensitively naturally
ordered in version 3 of the spec, while implementations of the spec
are also taking different approaches because of the ambiguity caused
here.

But, I wish not to be dragged into having to defend my contributions
made when we were fixing tests and trying to get 4.0.0 over the line.
Especially when those contributions moved things forward on a pure bug
count, and the complaint raised by you is that it doesn't
case-sensitively lexically order and undermines the proposal choice
you want to see go forward. I hear your opinion, stand by your right
to express and vote by it, but do not wish to engage in this way.


>
> I think if anything the recent log4j issues have hopefully demonstrated that 
> “NIH” is not the pejorative people think it should be.
>


I certainly do not stand firm to NIH or DRY - it's only
food-for-thought in any code discussion, nothing more. The comparison
to log4j isn't fair, it's a small library of five small classes that
does exactly what we needed. There's no risk here, and I'd not be
interested in rewriting every small and safe library we use. And I
agree with the sentiment that we need to be vigilant about the
libraries we introduce into the project.

But again, I really wish not to go around in circles.
For now I hope we can engage on more "open" fronts, so we can continue
to explore and understand…


Re: [DISCUSS] Periodic snapshot publishing with minor version bumps

2021-12-22 Thread bened...@apache.org
> Yeah, not described enough in this thread, it is part of the motivation to 
> the proposal

I don’t believe it has been mentioned once in this thread. This should have 
been clearly stated upfront as a motivation. Thus far no positive case has been 
made on this topic, we have instead wasted a lot of time discussing clearly 
peripheral topics, demonstrating that the more obvious approach for anyone 
without this motivation is indeed fine.

> Setting up versioning to be extensible in this manner is not endorsing such 
> artefacts and distributions.

Yes, setting up versioning in this way with the intention of permitting 
comparisons between these “not Cassandra” releases and actual Cassandra 
releases is the same thing as endorsing this behaviour. It’s equally bad if 
this “internal” release is, say, used to support some cloud service that is 
advertised as Cassandra-compatible.

Given the above, I am rescinding my support for either B or C, and now only 
endorse approach A.

> Adding that library made things much easier to deal with, as adhering to a 
> spec and standard should

You broke the spec as part of this work. The “NIH approach” was more standards 
compliant prior to this work (as it correctly sorted prior to 16649, except for 
SNAPSHOT releases).

I think if anything the recent log4j issues have hopefully demonstrated that 
“NIH” is not the pejorative people think it should be.  Independent of this 
discussion, I was disappointed to see this unnecessary Semver dependency 
introduced at the time that it was, creating unnecessary churn and 
compatibility work to no apparent advantage.


From: Mick Semb Wever 
Date: Wednesday, 22 December 2021 at 12:14
To:
Cc: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Periodic snapshot publishing with minor version bumps
>
> Do you intend to use this capability, and if so could you point out where you 
> highlighted this motivation previously?
>


Yeah, not described enough in this thread, it is part of the motivation
to the proposal, and was discussed in the slack thread:
 
https://the-asf.slack.com/archives/CK23JSY2K/p1638975919339400?thread_ts=1638950961.325900=CK23JSY2K


>
> These snapshots are not releases for broad consumption, and definitely are 
> not meant to be consumed by third-parties for release as Cassandra-like 
> software.
>


They wouldn't be.


>
> Third-parties releasing such software are _not offering Apache Cassandra_. 
> Helping these entities to release software that might be interpreted as 
> Apache Cassandra, and to be consistent with Apache Cassandra’s release 
> schedule, is almost certainly a problem with this approach, not an asset.
>


Setting up versioning to be extensible in this manner is not endorsing
such artefacts and distributions. I think of it as similar to
how open-sourcing the code in the first place isn't an endorsement of
those that extend it.

We have rules in place to avoid abuse and confusion around identities.
Let's not conflate that with appreciating the ecosystem we have and
the benefits of extending interoperability.


>
> Such software should have its own versioning that is distinct from Apache 
> Cassandra.
>


Versioning does not need to be distinct for the offering to be
distinct. The versioning is entirely up to them, but sharing the same
versioning comes with benefits, as mentioned above. The offering does
not need to be something public either, it could be related to
internal deployments.


> >
> > (A) does not work with the codebase as it is today. It requires additional 
> > work.
>
>  A1’s only issue is that we use a standard’s non-compliant Semver 
> implementation that was introduced in May for some unspecified reason. If we 
> simply used CassandraVersion (which is broadly equivalent, but standard’s 
> compliant) everything would seem to be fine.
>


Being one of those that dealt with the upgrade bugs and added in the
SemVer library, I can say it is a decent amount of work (and testing)
to revert and redo, and that the upgrade tests have a different need
to CassandraVersion. Adding that library made things much easier to
deal with, as adhering to a spec and standard should. It feels like
reverting that work would be going down a NIH path (and we would have
be copy the CassandraVersion class rather than being able to re-use
it) for the sake of wanting to lexically instead of naturally order
pre-release fields: which the spec folk see as a mistake and oversight
(and have recommended users to stick with lowercase for general system
compatibility). If folk really wanted A1 then I'd be pushing to A2 for
this reason (and then to the limitations of A2 as described above).


Re: [DISCUSS] Periodic snapshot publishing with minor version bumps

2021-12-21 Thread bened...@apache.org
> If we simply used CassandraVersion (which is broadly equivalent, but 
> standard’s compliant)

Actually it’s got the same issue, but it’s a one line fix.


From: Mick Semb Wever 
Date: Tuesday, 21 December 2021 at 22:06
To:
Cc: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Periodic snapshot publishing with minor version bumps
> These can be further subdivided to:
>
> A1.   4.1.0-PRE{1,2,3,4} -> 4.1.0-alpha1
> A2.   4.1.0-alpha{1,2,3,4} -> 4.1.0-alpha5
> B1. 4.1.{0,1,2,3} -> 4.1.4-alpha1
> B2. 4.{1,2,3,4} -> 4.5.0-alpha1
> B3. 4.{1,2,3,4} -> 5.0.0-alpha1
> C1. 4.1.{0,1,2,3}-pre -> 4.1.4-alpha1
> C2. 4.{1,2,3,4}.0-pre -> 4.5.0-alpha1
> C3. 4.{1,2,3,4}.0-pre -> 5.0.0-alpha1
>


(A) does not work with the codebase as it is today. It requires additional work.

(B) has not been suggested anywhere in the thread, so I find it odd
that it was proposed.

(C2) is the original proposal. (Though the build metadata suffix is
not "pre", it is either SNAPSHOT or a timestamp.)

(B3) and (C3) doesn't follow SemVer.

(C1) is interesting, I wouldn't object to it, though share similar
initial reaction to Josh.


Re: [DISCUSS] Periodic snapshot publishing with minor version bumps

2021-12-21 Thread bened...@apache.org
>The problem I have with (A2) is that third-parties, vendors, etc, can only 
>clumsily extend and continue on those version numbers. 4.1.0-alpha2-myvendor-3 
>is awkward.

Do you intend to use this capability, and if so could you point out where you 
highlighted this motivation previously?

These snapshots are not releases for broad consumption, and definitely are not 
meant to be consumed by third-parties for release as Cassandra-like software. 
Third-parties releasing such software are _not offering Apache Cassandra_. 
Helping these entities to release software that might be interpreted as Apache 
Cassandra, and to be consistent with Apache Cassandra’s release schedule, is 
almost certainly a problem with this approach, not an asset. Such software 
should have its own versioning that is distinct from Apache Cassandra.

>(B) has not been suggested anywhere in the thread, so I find it odd that it 
>was proposed.

Ok, I misunderstood your proposal, I had not anywhere seen you intended to 
include a -SNAPSHOT suffix. C{1,2,3} can be interpreted as supporting -SNAPSHOT 
rather than -pre, and I guess (B) will get very few votes, and/or can be 
ignored.

> (A) does not work with the codebase as it is today. It requires additional 
> work.

This is untrue. A2 requires no additional work, and A1’s only issue is that we 
use a standard’s non-compliant Semver implementation that was introduced in May 
for some unspecified reason. If we simply used CassandraVersion (which is 
broadly equivalent, but standard’s compliant) everything would seem to be fine.


From: Mick Semb Wever 
Date: Tuesday, 21 December 2021 at 22:06
To:
Cc: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Periodic snapshot publishing with minor version bumps
> These can be further subdivided to:
>
> A1.   4.1.0-PRE{1,2,3,4} -> 4.1.0-alpha1
> A2.   4.1.0-alpha{1,2,3,4} -> 4.1.0-alpha5
> B1. 4.1.{0,1,2,3} -> 4.1.4-alpha1
> B2. 4.{1,2,3,4} -> 4.5.0-alpha1
> B3. 4.{1,2,3,4} -> 5.0.0-alpha1
> C1. 4.1.{0,1,2,3}-pre -> 4.1.4-alpha1
> C2. 4.{1,2,3,4}.0-pre -> 4.5.0-alpha1
> C3. 4.{1,2,3,4}.0-pre -> 5.0.0-alpha1
>


(A) does not work with the codebase as it is today. It requires additional work.

(B) has not been suggested anywhere in the thread, so I find it odd
that it was proposed.

(C2) is the original proposal. (Though the build metadata suffix is
not "pre", it is either SNAPSHOT or a timestamp.)

(B3) and (C3) doesn't follow SemVer.

(C1) is interesting, I wouldn't object to it, though share similar
initial reaction to Josh.


Re: [DISCUSS] Periodic snapshot publishing with minor version bumps

2021-12-21 Thread bened...@apache.org
The purpose of indicative votes is to seek input from the broader community. 
There is no deadline, it is not an official vote, and can run across the 
holiday period. Discussion can continue in parallel, but I do not get the 
impression many others are very invested in this discussion. Certainly, I have 
said all I think needs to be said in support of my position. You have expressed 
your concerns repeatedly also; anybody who is interested is likely well aware 
of your position.


From: Mick Semb Wever 
Date: Tuesday, 21 December 2021 at 19:25
To:
Cc: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Periodic snapshot publishing with minor version bumps
Benedict, I had said above in the thread to let it run through til
January, can we please respect that. I do not think the week before
xmas is a great time to push it into a vote, when this is not urgent.

I pointed to the code where we sort versions in a case-insensitive
way. That means PRE1 breaks.
Such breakages occurred in the lead up to 4.0.0, and we would have to
custom patch the semver4j library to make PRE1 be before the alphas.

The test run you provided did not include upgrade tests, and for those
to break you would need a wrongly ordered incompatibility, e.g. from
an alpha to a PRE. It has also been mentioned that drivers (and other
test systems) break.

Currently our snapshots do come with a pre-release label, but they do
come with a build metadata label (being either the explicit "SNAPSHOT"
or the timestamp). This is visible in the nightlies link I provided.
So we do have a precedence in place to distinguish between released
versions and snapshot builds, I say let's continue that.


Re: [DISCUSS] Disabling MIME-part filtering on this mailing list

2021-12-21 Thread bened...@apache.org
(I’ve taken this off list for now)

From: Bowen Song 
Date: Tuesday, 21 December 2021 at 18:29
To: bened...@apache.org 
Cc: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Disabling MIME-part filtering on this mailing list

Hmm, that's a bit unexpected.

Could you please have a look at the email headers and see what's the reason for 
it being treated that way? Have a look at the "Authentication-Results" headers, 
and the headers with "spam" in their names, especially "X-Spam-Status" if it's 
present.

I suspect it was treated as spam for other reasons, not because failing the 
DMARC, which would result in the email being rejected completely.
On 21/12/2021 17:49, bened...@apache.org<mailto:bened...@apache.org> wrote:
Unfortunately it still arrived in my junk mail folder ☹


From: Bowen Song <mailto:bo...@bso.ng>
Date: Tuesday, 21 December 2021 at 12:02
To: dev@cassandra.apache.org<mailto:dev@cassandra.apache.org> 
<mailto:dev@cassandra.apache.org>
Subject: Re: [DISCUSS] Disabling MIME-part filtering on this mailing list
I have just received a confirmation from Infra informing me that this
change has been made. I'm sending this email as an update but also a
test. Hopefully it arrives in your inbox without trouble, and my email
address no longer has the ".INVALID" append to it.

On 04/12/2021 17:15, Bowen Song wrote:
> Hello,
>
>
> Currently this mailing list has MIME-part filtering turned on, which
> will results in "From:" address munging (appending ".INVALID" to the
> sender's email address) for domains enforcing strict DMARC rules, such
> as apple.com, zoho.com and all Yahoo.** domains. This behaviour may
> cause some emails being treated as spam by the recipients' email
> service providers, because the result "From:" address, such as
> "some...@yahoo.com.INVALID"<mailto:some...@yahoo.com.INVALID> is not valid 
> and cannot be verified.
>
> I have created a Jira ticket INFRA-22548
> <https://issues.apache.org/jira/browse/INFRA-22548> asking to change
> this, but the Infra team said dropping certain MIME part types is to
> prevent spam and harmful attachments, and would require a consensus
> from the project before they can make the change. Therefore I'm
> sending this email asking for your opinions on this.
>
> To be clear, turning off the MIME-part filtering will not turn off the
> anti-spam and anti-virus feature on the mailing list, all emails sent
> to the list will still need to pass the checks before being forwarded
> to subscribers. Morden (since 90s?) anti-spam and anti-virus software
> will scan the MIME parts too, in addition to the plain-text and/or
> HTML email body. Your email service provider is also almost certainly
> going to have their own anti-spam and anti-virus software, in addition
> to the one on the mailing list. The difference is whether the mailing
> list proactively removing MIME parts not in the predefined whitelist.
>
> To help you understand the change, here's the difference between the
> two behaviours:
>
>
> With the MIME-part filtering enabled (current behaviour)
>
> * the mailing list will remove certain MIME-part types, such as
> executable file attachments, before forwarding it
>
> * the mailing list will append ".INVALID" to some senders' email address
>
> * the emails from the "*@*.INVALID"<mailto:*@*.INVALID> sender address are 
> more likely to
> end up in recipients' spam folder
>
> * it's harder for people to directly reply to someone who's email
> address has been modified in this way
>
> * recipients running their own email server without anti-spam and/or
> anti-virus software on it have some extra protections
>
>
> With MIME-part filtering disabled
>
> * the mailing list forward all non-spam and non-infected emails as it
> is without changing them
>
> * the mailing list will not change senders' email address
>
> * the emails from this mailing list are less likely to end up in
> recipients' spam folder
>
> * it's easier for people to directly reply to anyone in this mailing list
>
> * recipients running their own email server without anti-spam and/or
> anti-virus software on it may be exposed to some threats
>
>
> What's your opinion on this? Do you support or oppose disabling the
> MIME-part filtering on the Cassandra-dev mailing list?
>
>
> p.s.: as you can see, my email address has the ".INVALID" appended to
> it by this mailing list.
>
>
> Regards,
>
> Bowen
>


Re: [DISCUSS] Periodic snapshot publishing with minor version bumps

2021-12-21 Thread bened...@apache.org
After much discussion, I see three basic categories of approach:

A) distinguish releases using unstable release suffixes only
B) distinguish releases using some version number modification
C) distinguish releases using some version number modification AND unstable 
release suffixes to indicate these builds are unsupported

These can be further subdivided to:

A1.   4.1.0-PRE{1,2,3,4} -> 4.1.0-alpha1
A2.   4.1.0-alpha{1,2,3,4} -> 4.1.0-alpha5
B1. 4.1.{0,1,2,3} -> 4.1.4-alpha1
B2. 4.{1,2,3,4} -> 4.5.0-alpha1
B3. 4.{1,2,3,4} -> 5.0.0-alpha1
C1. 4.1.{0,1,2,3}-pre -> 4.1.4-alpha1
C2. 4.{1,2,3,4}.0-pre -> 4.5.0-alpha1
C3. 4.{1,2,3,4}.0-pre -> 5.0.0-alpha1

I vote, in order of preference, for A{1,2},C{1,2,3},B{1,2,3}


From: Mick Semb Wever 
Date: Tuesday, 21 December 2021 at 15:48
To:
Cc: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Periodic snapshot publishing with minor version bumps
> My preference is to get our versioning as standard Semantic Versioning as 
> possible, to avoid any precedence that depends on finely reading through the 
> spec that isn't otherwise popular.  Requiring the ordering of the pre-release 
> tag to be case-sensitive alphanumeric is an example of this, but only one.


I am not comfortable relying upon the pre-release tag being
case-sensitive alphanumerically ordered.

Standardising on lower-case pre-release labels and build metadata can
be beneficial, especially when
cross-platform scenarios are considered.  (One of SemVer author's own words.)

One example implementation exists here (in a library that we do already use)
https://github.com/vdurmont/semver4j/blob/master/src/main/java/com/vdurmont/semver4j/Semver.java#L235

And when it comes to cognitive load, I would question the load of
knowing what the difference between PRE1 and SNAPSHOT is (when there
is none), and why we use a custom pre-release tag on builds that are
not releases.


Re: [DISCUSS] Periodic snapshot publishing with minor version bumps

2021-12-21 Thread bened...@apache.org
> Nevertheless, it requires fixes

I have run all tests successfully against 4.1.0-PRE1, without modification[1].

> more importantly requires others in the ecosystem to adapt

There is no such requirement for publishing these as alphas, but without 
evidence to the contrary I doubt the downstream impact of this change will 
greatly exceed the time we are spending discussing this.

> I am not comfortable relying upon the pre-release tag being case-sensitive 
> alphanumerically ordered.

Whereas I am very uncomfortable publishing pre-release snapshots that are not 
marked as such.

I think we’ve hashed this out more than enough, so let’s get some indicative 
votes to determine which one the community endorses. I will send a separate 
email for that purpose.

[1] 
https://app.circleci.com/pipelines/github/belliottsmith/cassandra?branch=pre1=all




From: Mick Semb Wever 
Date: Tuesday, 21 December 2021 at 15:48
To:
Cc: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Periodic snapshot publishing with minor version bumps
> My preference is to get our versioning as standard Semantic Versioning as 
> possible, to avoid any precedence that depends on finely reading through the 
> spec that isn't otherwise popular.  Requiring the ordering of the pre-release 
> tag to be case-sensitive alphanumeric is an example of this, but only one.


I am not comfortable relying upon the pre-release tag being
case-sensitive alphanumerically ordered.

Standardising on lower-case pre-release labels and build metadata can
be beneficial, especially when
cross-platform scenarios are considered.  (One of SemVer author's own words.)

One example implementation exists here (in a library that we do already use)
https://github.com/vdurmont/semver4j/blob/master/src/main/java/com/vdurmont/semver4j/Semver.java#L235

And when it comes to cognitive load, I would question the load of
knowing what the difference between PRE1 and SNAPSHOT is (when there
is none), and why we use a custom pre-release tag on builds that are
not releases.


Re: [DISCUSS] Disabling MIME-part filtering on this mailing list

2021-12-21 Thread bened...@apache.org
Unfortunately it still arrived in my junk mail folder ☹


From: Bowen Song 
Date: Tuesday, 21 December 2021 at 12:02
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Disabling MIME-part filtering on this mailing list
I have just received a confirmation from Infra informing me that this
change has been made. I'm sending this email as an update but also a
test. Hopefully it arrives in your inbox without trouble, and my email
address no longer has the ".INVALID" append to it.

On 04/12/2021 17:15, Bowen Song wrote:
> Hello,
>
>
> Currently this mailing list has MIME-part filtering turned on, which
> will results in "From:" address munging (appending ".INVALID" to the
> sender's email address) for domains enforcing strict DMARC rules, such
> as apple.com, zoho.com and all Yahoo.** domains. This behaviour may
> cause some emails being treated as spam by the recipients' email
> service providers, because the result "From:" address, such as
> "some...@yahoo.com.INVALID" is not valid and cannot be verified.
>
> I have created a Jira ticket INFRA-22548
>  asking to change
> this, but the Infra team said dropping certain MIME part types is to
> prevent spam and harmful attachments, and would require a consensus
> from the project before they can make the change. Therefore I'm
> sending this email asking for your opinions on this.
>
> To be clear, turning off the MIME-part filtering will not turn off the
> anti-spam and anti-virus feature on the mailing list, all emails sent
> to the list will still need to pass the checks before being forwarded
> to subscribers. Morden (since 90s?) anti-spam and anti-virus software
> will scan the MIME parts too, in addition to the plain-text and/or
> HTML email body. Your email service provider is also almost certainly
> going to have their own anti-spam and anti-virus software, in addition
> to the one on the mailing list. The difference is whether the mailing
> list proactively removing MIME parts not in the predefined whitelist.
>
> To help you understand the change, here's the difference between the
> two behaviours:
>
>
> With the MIME-part filtering enabled (current behaviour)
>
> * the mailing list will remove certain MIME-part types, such as
> executable file attachments, before forwarding it
>
> * the mailing list will append ".INVALID" to some senders' email address
>
> * the emails from the "*@*.INVALID" sender address are more likely to
> end up in recipients' spam folder
>
> * it's harder for people to directly reply to someone who's email
> address has been modified in this way
>
> * recipients running their own email server without anti-spam and/or
> anti-virus software on it have some extra protections
>
>
> With MIME-part filtering disabled
>
> * the mailing list forward all non-spam and non-infected emails as it
> is without changing them
>
> * the mailing list will not change senders' email address
>
> * the emails from this mailing list are less likely to end up in
> recipients' spam folder
>
> * it's easier for people to directly reply to anyone in this mailing list
>
> * recipients running their own email server without anti-spam and/or
> anti-virus software on it may be exposed to some threats
>
>
> What's your opinion on this? Do you support or oppose disabling the
> MIME-part filtering on the Cassandra-dev mailing list?
>
>
> p.s.: as you can see, my email address has the ".INVALID" appended to
> it by this mailing list.
>
>
> Regards,
>
> Bowen
>


Re: [DISCUSS] Periodic snapshot publishing with minor version bumps

2021-12-17 Thread bened...@apache.org
> I would like to point out that the code and tests do not support "pre" as a
pre-release label. 4.1.0-pre1 would break the code.

If true this can easily be fixed, but AFAICT CassandraVersion is happy to parse 
this just fine so I doubt there would be many breakages.

> using a pre-release version needs a label that is alphanumerically before
"alpha"

4.0.0-PRE1

> not having to rollback version numbers and changelogs

What is unique about this situation versus alpha, beta and rc? Because these 
are again much more common, so whatever we do to handle these can surely be 
applied here? Why can’t we leave in 4.0.0-PRE1 changelog and release notes, if 
this is such a big deal? What’s so different about using 4.1.0 that permits 
avoiding extra work?

If this is truly impossible, why not use patch numbers rather than minors (with 
additional PRE1)? i.e. we could go 4.0.0-PRE1, 4.0.1-PRE2, 4.0.2-PRE3, 
4.0.4-alpha. I don’t like this, but I dislike it a lot less than using 
unqualified minors.

> We still have only one proposal on the table that works, as was first
raised in this thread.

I’m afraid I’m still flummoxed by this. Could you enumerate precisely what 
makes this proposal not work, as I still don’t see it?


From: Mick Semb Wever 
Date: Friday, 17 December 2021 at 09:18
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Periodic snapshot publishing with minor version bumps
> "During the lead up to 4.0.0 there was plenty of headache and fixes going
> in to deal with how we parse version numbers in different places and
> alpa|beta|rc etc. I would rather bump the versions during the dev cycle and
> work on fixing it, than have that headache again at release time. I also
> feel for third-parties that have to parse our own way of versioning."
> Thank you Mick for sharing again the release management point of view. It
> is always a challenge to find a release manager who will have the time to
> spend on those things and often those efforts are not even really visible
> so it is easy to underestimate them. (All the break that goes with it)
>


Thanks for the summary run-through and support Ekaterina, much appreciated.


I would like to point out that the code and tests do not support "pre" as a
pre-release label.
4.1.0-pre1 would break the code.

Furthermore, the pre-release version is alphanumerically sorted, therefore
"pre" would land between the last beta and the first rc version. Such
a proposal
using a pre-release version needs a label that is alphanumerically before
"alpha". And the code would need to be fixed to accept and sort the new
label. Maybe the drivers too, Jeremiah?



"For the release manager this is a simpler approach (not having to rollback
> version numbers and changelogs), and for those using development published
> artefacts (nightlies, staging, etc) (not having versions clobbered).
> Release manager practices aside, as a user I agree with Brandon, what
> matters is the version is greater and whether major/minor/patch numbers are
> greater."
> This is a very important point. Release management is time consuming enough
> and from what I've seen there are not many people who have that time to
> dedicate it. If there are suggestions for different ways to improve that
> experience, please, share them.



Such a change (replacing takeX with version increments when a vote fails)
wasn't part of my proposal here. It was only meant as anecdotal. It is
still useful to know that this situation can arise for the release manager,
e.g. if the artefacts were accidentally published.



After carefully reading the thread, it seems to me we need to find the
> right balance between:
> 1) users' understanding about versions; also usability
> Please, people, share your experience and feedback, we want to hear it!
> 2) no breaking changes for the ecosystem (or at least as little as
> possible)
> 3) efficient release management (minimal maintenance).
>



We still have only one proposal on the table that works, as was first
raised in this thread.

The only valid objection raised so far is cosmetic, touching on (1). I want
to emphasise that it being cosmetic doesn't make it trivial or to be
ignored: the image of the project belongs to the community; it's an
acceptable objection.  But I hope that objections can be followed up with
working proposals.

Reiterating, the cosmetic change would be that our next yearly release be
4.1 or 4.2 or 5.0 or 5.1 or 5.2 (as we would not be doing more than two
periodic snapshots before next May).

Another concern raised was the released artefacts can have a quality
pre-release label attached (alpha|beta|rc) while other unreleased artefacts
would have no such pre-release label, indicating that the latter has a
stability the former does not. This isn't true: these unreleased periodic
artefacts are only available via dev/snapshot channels. They would be the
same as builds off trunk are today, which currently is "4.1" without any
such pre-release label.

There's no rush on this 

Re: [DISCUSS] Periodic snapshot publishing with minor version bumps

2021-12-16 Thread bened...@apache.org
4.1.0-pre1 sounds good to me.

From: Jeremiah D Jordan 
Date: Thursday, 16 December 2021 at 16:37
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Periodic snapshot publishing with minor version bumps
If we want to have “called out development snapshots” then I think we need some 
way to distinguish build from those commits the from ongoing work in the 
version number that is in the build file.  I do not think the “development 
snapshots” being 4.1.0-SNAPSHOT and current trunk also being 4.1.0-SNAPSHOT 
achieves that goal.

How we accomplish that, I don’t have any preference.  Bumping the version 
number such that trunk becomes 4.2.0-SNAPSHOT is a pretty simple way to 
accomplish it.

Another option would be to push and tag a commit outside of the trunk branch 
that has the version as 4.1.0-pre1 or something.  From my look at the 
CassandraVersion code something after SNAPSHOT will not parse correctly, 
-SNAPSHOT needs to be the last thing.  So 4.1.0-pre1-SNAPSHOT is valid, 
4.1.0-SNAPSHOT-pre1 is not.

-Jeremiah

> On Dec 16, 2021, at 9:33 AM, bened...@apache.org wrote:
>
> I don’t really see the advantage to this over 4.1.0-SNAPSHOT1
>
> From: Mick Semb Wever 
> Date: Thursday, 16 December 2021 at 15:04
> To: dev@cassandra.apache.org 
> Subject: [DISCUSS] Periodic snapshot publishing with minor version bumps
> Back in January¹ we agreed to do periodic snapshot publishing, as we
> move to yearly major+minor releases. But (it's come to light²) it
> wasn't clear how we would do that.
>
> ¹) https://lists.apache.org/thread/vzx10600o23mrp9t2k55gofmsxwtng8v
> ²) https://the-asf.slack.com/archives/CK23JSY2K/p1638950961325900
>
>
> The following is a proposal on doing such snapshot publishing by
> bumping the minor version number.
>
> The idea is to every ~quarter in trunk bump the minor version in
> build.xml. No release or branch would be cut. But the last SHA on the
> previous snapshot version can be git tagged. It does not need to
> happen every quarter, we can make that call as we go depending on how
> much has landed in trunk.
>
> The idea of this approach is that it provides a structured way to
> reference these periodic published snapshots. That is, the semantic
> versioning that our own releases abide by extends to these periodic
> snapshots. This can be helpful as the codebase (and drivers) does not
> like funky versions (like putting in pre-release or vendor labels), as
> we want to be inclusive to the ecosystem.
>
> A negative reaction of this approach is that our released versions
> will jump minor versions. For example, next year's release could be
> 4.3.0 and users might ask what happened to 4.1 and 4.2. This should
> only be a cosmetic concern, and general feedback seems to be that
> users don't care so long as version numbers are going up, and that we
> use semantic versioning so that major version increments mean
> something (we would never jump a major version).
>
> A valid question is how would this impact our supported upgrade paths.
> Per semantic versioning any 4.x to 4.y (where y > x) is always safe,
> and any major upgrade like 4.z to 5.x is safe (where z is the last
> 4.minor). Nonetheless we should document this to make it clear and
> explicit how it works (and that we are adhering to semver).
> https://semver.org/
>
> What are people's thoughts on this?
> Are there objections to bumping trunk so that base.version=4.2 ? (We
> can try this trunk and re-evaluate after our next annual release.)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


Re: [DISCUSS] Periodic snapshot publishing with minor version bumps

2021-12-16 Thread bened...@apache.org
> No. You refer to Pre-release but my statement was about Build Metadata. The
timestamping of snapshots is the latter.

So you agree the proposal is compatible with semver? If so, what’s the problem? 
I’m genuinely perplexed.

> I would rather bump the versions during the dev cycle and work on fixing it

I wouldn’t. I’m surprised to hear we broke things between 3.0 and 4.0, but if 
this is happening then these releases are a great opportunity to prevent those 
kinds of problems, not an opportunity to sidestep them.



From: Mick Semb Wever 
Date: Thursday, 16 December 2021 at 19:00
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Periodic snapshot publishing with minor version bumps
>
> I poked around a tiny bit - Spark and Flink both interpret "periodic" as
> "nightly", and fwiw that's what I'm most familiar with. Ruminating on this
> a bit, the implications of a quarterly (or other cadence) snapshot seem to
> be the developers on a project providing more guarantees of support and/or
> integrity than a nightly release which I don't think is what we're trying
> for here.
>
> So I guess my question: why not keep it at 4.1.0 and have 4.1.0-SNAPSHOT be
> a nightly snapshot, and folks downstream can integrate on whichever cadence
> is most appropriate for them using timestamped builds (monthly, quarterly,
> etc)?



It becomes clumsy and limited distinguishing between Q1 4.1.0-SNAPSHOT and
Q2 4.1.0-SNAPSHOT. Our ecosystem (who are dependent on us for this) and our
versioning just hasn't matured enough yet.
We can rely on the nightly approach and the timestamps in the build
metadata like the example you provide. But we have the challenge that this
breaks code, and we want to keep the ecosystem and vendors close and able
to re-use the same semantic versioning scheme. By ecosystem I'm referring
to drivers, test frameworks, libraries and tools. These (as we know
already) can all parse our version numbers, so there's value in using
semver and keeping it simple.



> What benefit do people gain by having us provide both nightly
> snapshots, quarterly snapshots, and yearly releases? Or only quarterly
> snapshots and yearly releases?
>
> Are we trying to encourage more testing?



Yes. Folk have expressed that they don't have resources to test snapshots
or nightlies. But such quarterlies they could test, and would appreciate
doing so. This in turn benefits us.


Are we looking to provide a more
> "blessed" batch of work for downstream maintainers to integrate with?
>


No. From the community's perspective, this comes with only our stable-trunk
guarantees.
We do not release or distribute them. They are only available via our
snapshot/dev channels.
We are relying on a known ASF precedence here.


> AFAIK we publish 4.0.0-rc1 builds and everyone consumes those just fine,
but if everything is so fragile why is it not more reasonable to fix that?


During the lead up to 4.0.0 there was plenty of headache and fixes going in
to deal with how we parse version numbers in different places and
alpa|beta|rc etc.

I would rather bump the versions during the dev cycle and work on fixing
it, than have that headache again at release time. I also feel for
third-parties that have to parse our own way of versioning.



> > It's not semantic versioning
>
> Thought I’d check this, and this appears to be incorrect.


No. You refer to Pre-release but my statement was about Build Metadata. The
timestamping of snapshots is the latter.

Aside from not breaking code, I'm in favour of re-using a popular
specification, and ensuring our versioning is simple and extendable in the
ecosystem.

My suggestion is that we try this proposal this trunk cycle, and
re-evaluate after our next release.
The positive to this, beyond making a decision on experience, is that it
will help improve the code (and ecosystem)'s ability dealing with version
changes outside of release time putting us in a better position to try
alternatives when we re-evaluate.


Re: [DISCUSS] Periodic snapshot publishing with minor version bumps

2021-12-16 Thread bened...@apache.org
Yes it is, see my prior email.

Even it weren’t, I would struggle to understand the argument. We publish alpha, 
beta and rc just fine and the world hasn’t collapsed. I can’t imagine anyone 
advocating that we publish these as their own minor versions.

From: Brandon Williams 
Date: Thursday, 16 December 2021 at 17:43
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Periodic snapshot publishing with minor version bumps
On Thu, Dec 16, 2021 at 11:38 AM bened...@apache.org
 wrote:
>
> > Oh yeah, that's a dealbreaker then. Wasn't aware.
>
> Is this a dealbreaker?

It's not semver, so I would say so, unless we want to keep doing that poorly.

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


Re: [DISCUSS] Periodic snapshot publishing with minor version bumps

2021-12-16 Thread bened...@apache.org
> It's not semantic versioning

Thought I’d check this, and this appears to be incorrect. From 
https://semver.org:

A pre-release version MAY be denoted by appending a hyphen and a series of dot 
separated identifiers immediately following the patch version. Identifiers MUST 
comprise only ASCII alphanumerics and hyphens [0-9A-Za-z-]. Identifiers MUST 
NOT be empty. Numeric identifiers MUST NOT include leading zeroes. Pre-release 
versions have a lower precedence than the associated normal version. A 
pre-release version indicates that the version is unstable and might not 
satisfy the intended compatibility requirements as denoted by its associated 
normal version. Examples: 1.0.0-alpha, 1.0.0-alpha.1, 1.0.0-0.3.7, 
1.0.0-x.7.z.92, 1.0.0-x-y-z.–.




From: Mick Semb Wever 
Date: Thursday, 16 December 2021 at 16:31
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] Periodic snapshot publishing with minor version bumps
> >
> > general feedback seems to be that users don't care so long as version
> > numbers are going up
>
> Curious to hear more about this. It doesn't match my intuition or
> experience running systems but I'm also n=1 and there's a lot of opinions
> in the world.
>
> Leap-frogged by Benedict's response here, but I'm in favor of something
> like:
> 4.1.0-SNAPSHOT-22Q1
> 4.1.0-SNAPSHOT-22Q2
> ...
>
> Keeps lexicographical comparison but also embeds the intent of the release
> and when it hit all in one bundle. And doesn't blow up our minor #'s and
> lead to confusion.
>


It's not semantic versioning, and it breaks the code and drivers.

The proposal is about publishing a semantic version, providing a structure
that is inclusive to the ecosystem and vendors, combined with the
limitations/challenges within the project around what version numbers are
valid.

The SNAPSHOT suffix/label here does not belong to semantic versioning. In
maven repositories it gets replaced with timestamps which is "Build
metadata" (#spec-item-10) for SemVer. Both suffixing and extending this
becomes messy (and again would not be SemVer).

Jumping version numbers can happen when releases cut don't pass
staging/voting are cast aside and a new release with a new version is cut
instead of a "takeX" approach. For the release manager this is a simpler
approach (not having to rollback version numbers and changelogs), and for
those using development published artefacts (nightlies, staging, etc) (not
having versions clobbered). Release manager practices aside, as a user I
agree with Brandon, what matters is the version is greater and whether
major/minor/patch numbers are greater.

I am interested in others PoV, but am seeing this boiling down to a)
extending SemVer to these published dev artefacts (which enables and keeps
close the ecosystem and vendors), and b) accepting that the code and
drivers is fragile with versions and we need to keep it simple.


  1   2   3   >