Re: [VOTE] Release Apache Cassandra 4.0.6

2022-08-22 Thread Tommy Stendahl via dev
+1 nb

-Original Message-
From: Brandon Williams 
mailto:brandon%20williams%20%3cdri...@gmail.com%3e>>
Reply-To: dev@cassandra.apache.org
To: dev mailto:dev%20%3c...@cassandra.apache.org%3e>>
Subject: Re: [VOTE] Release Apache Cassandra 4.0.6
Date: Mon, 22 Aug 2022 17:47:59 -0500


+1


On Sun, Aug 21, 2022 at 7:44 AM Mick Semb Wever <



m...@apache.org

> wrote:



Proposing the test build of Cassandra 4.0.6 for release.


sha1: eb2375718483f4c360810127ae457f2a26ccce67

Git:



https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.0.6-tentative


Maven Artifacts:



https://repository.apache.org/content/repositories/orgapachecassandra-/org/apache/cassandra/cassandra-all/4.0.6/



The Source and Build Artifacts, and the Debian and RPM packages and 
repositories, are available here:



https://dist.apache.org/repos/dist/dev/cassandra/4.0.6/



The vote will be open for 72 hours (longer if needed). Everyone who has tested 
the build is invited to vote. Votes by PMC members are considered binding. A 
vote passes if there are at least three binding +1s and no -1's.


[1]: CHANGES.txt:



https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/4.0.6-tentative


[2]: NEWS.txt:



https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/4.0.6-tentative



Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

2022-08-22 Thread Jeff Jirsa
“ The proposed mechanism for dealing with both of these failure types is to 
enable a manual operator override mode. This would allow operators to inject 
metadata changes (potentially overriding the complete metadata state) directly 
on any and all nodes in a cluster. At the most extreme end of the spectrum, 
this could allow an unrecoverably corrupt state to be rectified by composing a 
custom snapshot of cluster metadata and uploading it to all nodes in the 
cluster”

What do you expect this to look like in practice? JSON representation of the 
ring? Would reads and writes have halted? In what situations would the database 
be entirely unavailable? 



> On Aug 22, 2022, at 11:15 AM, Derek Chen-Becker  wrote:
> 
> 
> This looks really interesting; thanks for putting this together! Just so I'm 
> clear on CEP nomenclature, having external management of metadata as a 
> non-goal doesn't preclude some future use, correct? Coincidentally, I'm 
> working on my ApacheCon talk on improving modularity in Cassandra and one of 
> the ideas I'm discussing is pluggably (?) replacing gossip with something(s) 
> that allow us to externalize some of the complexity of maintaining 
> consistency. I need to digest the proposal you've made, but I don't see the 
> two ideas being at odds on my first read. 
> 
> Cheers,
> 
> Derek
> 
>> On Mon, Aug 22, 2022 at 6:45 AM Sam Tunnicliffe  wrote:
>> Hi,
>> 
>> I'd like to open discussion about this CEP: 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata
>>
>> Cluster metadata in Cassandra comprises a number of disparate elements 
>> including, but not limited to, distributed schema, topology and token 
>> ownership. Following the general design principles of Cassandra, the 
>> mechanisms for coordinating updates to cluster state have favoured eventual 
>> consistency, with probabilisitic delivery via gossip being a prime example. 
>> Undoubtedly, this approach has benefits, not least in terms of resilience, 
>> particularly in highly fluid distributed environments. However, this is not 
>> the reality of most Cassandra deployments, where the total number of nodes 
>> is relatively small (i.e. in the low thousands) and the rate of change tends 
>> to be low.  
>> 
>> Historically, a significant proportion of issues affecting operators and 
>> users of Cassandra have been due, at least in part, to a lack of strongly 
>> consistent cluster metadata. In response to this, we propose a design which 
>> aims to provide linearizability of metadata changes whilst ensuring that the 
>> effects of those changes are made visible to all nodes in a strongly 
>> consistent manner. At its core, it is also pluggable, enabling 
>> Cassandra-derived projects to supply their own implementations if desired.
>> 
>> In addition to the CEP document itself, we aim to publish a working 
>> prototype of the proposed design. Obviously, this does not implement the 
>> entire proposal and there are several parts which remain only partially 
>> complete. It does include the core of the system, including a good deal of 
>> test infrastructure, so may serve as both illustration of the design and a 
>> starting point for real implementation. 
>> 
> 
> 
> -- 
> +---+
> | Derek Chen-Becker |
> | GPG Key available at https://keybase.io/dchenbecker and   |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---+
> 


[Marketing] For Review: Learn How CommitLog Works in Apache Cassandra

2022-08-22 Thread Chris Thornett
Opening up Alex Sorokoumov's guide 'Learn How CommitLog Works in Apache
Cassandra' for a 72-hr community review by lazy consensus.

Please add any amends and suggestions in the comments:
https://docs.google.com/document/d/1cyOi-IeU_I9GBkpQbJS6IIrmemAesEqvzLb-eeFs_rM/edit#

Thanks!

-- 

Chris Thornett
Senior Content Strategist, Constantia.io


Re: [VOTE] Release Apache Cassandra 4.0.6

2022-08-22 Thread Brandon Williams
+1

On Sun, Aug 21, 2022 at 7:44 AM Mick Semb Wever  wrote:
>
>
> Proposing the test build of Cassandra 4.0.6 for release.
>
> sha1: eb2375718483f4c360810127ae457f2a26ccce67
> Git: 
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.0.6-tentative
> Maven Artifacts: 
> https://repository.apache.org/content/repositories/orgapachecassandra-/org/apache/cassandra/cassandra-all/4.0.6/
>
> The Source and Build Artifacts, and the Debian and RPM packages and 
> repositories, are available here: 
> https://dist.apache.org/repos/dist/dev/cassandra/4.0.6/
>
> The vote will be open for 72 hours (longer if needed). Everyone who has 
> tested the build is invited to vote. Votes by PMC members are considered 
> binding. A vote passes if there are at least three binding +1s and no -1's.
>
> [1]: CHANGES.txt: 
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/4.0.6-tentative
> [2]: NEWS.txt: 
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/4.0.6-tentative


Re: [VOTE] Release Apache Cassandra 4.0.6

2022-08-22 Thread Francisco Guerrero
+1 (nb)

On 2022/08/22 20:08:11 Sylwester Lachiewicz wrote:
> +1 nb
> 
> pon., 22 sie 2022 o 21:28 C. Scott Andreas  napisał(a):
> >
> > +1nb
> >
> > On Aug 22, 2022, at 8:55 AM, Mick Semb Wever  wrote:
> >
> >
> >>
> >> The vote will be open for 72 hours (longer if needed). Everyone who has 
> >> tested the build is invited to vote. Votes by PMC members are considered 
> >> binding. A vote passes if there are at least three binding +1s and no -1's.
> >
> >
> >
> > +1
> >
> > Checked
> > - signing correct
> > - checksums are correct
> > - source artefact builds
> > - binary artefact runs
> > - debian package runs
> > - redhat package runs
> >
> 


Re: [VOTE] Release Apache Cassandra 4.0.6

2022-08-22 Thread Sylwester Lachiewicz
+1 nb

pon., 22 sie 2022 o 21:28 C. Scott Andreas  napisał(a):
>
> +1nb
>
> On Aug 22, 2022, at 8:55 AM, Mick Semb Wever  wrote:
>
>
>>
>> The vote will be open for 72 hours (longer if needed). Everyone who has 
>> tested the build is invited to vote. Votes by PMC members are considered 
>> binding. A vote passes if there are at least three binding +1s and no -1's.
>
>
>
> +1
>
> Checked
> - signing correct
> - checksums are correct
> - source artefact builds
> - binary artefact runs
> - debian package runs
> - redhat package runs
>


Re: [VOTE] Release Apache Cassandra 4.0.6

2022-08-22 Thread C. Scott Andreas

+1nbOn Aug 22, 2022, at 8:55 AM, Mick Semb Wever  wrote:The 
vote will be open for 72 hours (longer if needed). Everyone who has tested the build 
is invited to vote. Votes by PMC members are considered binding. A vote passes if 
there are at least three binding +1s and no -1's.+1Checked- signing correct- 
checksums are correct- source artefact builds- binary artefact runs- debian package 
runs- redhat package runs 

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-22 Thread Henrik Ingo
One thought: The way the CEP is currently written, it is only possible to
mask a column one way. You can only define one masking function for a
column, and since you use the original column name, you could only return
one version of it in the result set, even if you had a way to define
several functions.

I'm not proposing this should change, just calling it out.

henrik

On Fri, Aug 19, 2022 at 2:50 PM Andrés de la Peña 
wrote:

> Hi everyone,
>
> I'd like to start a discussion about this proposal for dynamic data
> masking:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>
> Dynamic data masking allows to obscure sensitive information without
> changing the stored data. It would be based on a set of native CQL
> functions providing different types of masking, such as replacing the
> column value by "". These functions could be used as regular functions
> or attached to table columns with CREATE/ALTER table. There would be a new
> UNMASK permission, so only the users with this permissions would be able to
> see the unmasked column values. It would be possible to customize masking
> by using UDFs as masking functions.
>
> Thanks,
>


-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.]   [image: Visit us on
Twitter.]   [image: Visit us on YouTube.]

  [image: Visit my LinkedIn profile.] 


Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

2022-08-22 Thread Derek Chen-Becker
This looks really interesting; thanks for putting this together! Just so
I'm clear on CEP nomenclature, having external management of metadata as a
non-goal doesn't preclude some future use, correct? Coincidentally, I'm
working on my ApacheCon talk on improving modularity in Cassandra and one
of the ideas I'm discussing is pluggably (?) replacing gossip with
something(s) that allow us to externalize some of the complexity of
maintaining consistency. I need to digest the proposal you've made, but I
don't see the two ideas being at odds on my first read.

Cheers,

Derek

On Mon, Aug 22, 2022 at 6:45 AM Sam Tunnicliffe  wrote:

> Hi,
>
> I'd like to open discussion about this CEP:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata
> 
>
> Cluster metadata in Cassandra comprises a number of disparate elements
> including, but not limited to, distributed schema, topology and token
> ownership. Following the general design principles of Cassandra, the
> mechanisms for coordinating updates to cluster state have favoured eventual
> consistency, with probabilisitic delivery via gossip being a prime example.
> Undoubtedly, this approach has benefits, not least in terms of resilience,
> particularly in highly fluid distributed environments. However, this is not
> the reality of most Cassandra deployments, where the total number of nodes
> is relatively small (i.e. in the low thousands) and the rate of change
> tends to be low.
>
> Historically, a significant proportion of issues affecting operators and
> users of Cassandra have been due, at least in part, to a lack of strongly
> consistent cluster metadata. In response to this, we propose a design which
> aims to provide linearizability of metadata changes whilst ensuring that
> the effects of those changes are made visible to all nodes in a strongly
> consistent manner. At its core, it is also pluggable, enabling
> Cassandra-derived projects to supply their own implementations if desired.
> In addition to the CEP document itself, we aim to publish a working
> prototype of the proposed design. Obviously, this does not implement the
> entire proposal and there are several parts which remain only partially
> complete. It does include the core of the system, including a good deal of
> test infrastructure, so may serve as both illustration of the design and a
> starting point for real implementation.
>
>

-- 
+---+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and   |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---+


Re: CEP-15 multi key transaction syntax

2022-08-22 Thread Avi Kivity via dev
I wasn't referring to specific syntax but to the concept. If a SQL 
dialect (or better, the standard) has a way to select data into a 
variable, let's adopt it.


If such syntax doesn't exist, LET (a, b, c) = (SELECT x, y, z FROM tab) 
is my preference.


On 8/22/22 19:13, Patrick McFadin wrote:

The replies got trashed pretty badly in the responses.
When you say: "Agree it's better to reuse existing syntax than invent 
new syntax."


Which syntax are you referring to?

Patrick


On Mon, Aug 22, 2022 at 1:36 AM Avi Kivity via dev 
 wrote:


Agree it's better to reuse existing syntax than invent new syntax.

On 8/21/22 16:52, Konstantin Osipov wrote:
> * Avi Kivity via dev  [22/08/14 15:59]:
>
> MySQL supports SELECT  INTO  FROM ... WHERE
> ...
>
> PostgreSQL supports pretty much the same syntax.
>
> Maybe instead of LET use the ANSI/MySQL/PostgreSQL DECLARE var
TYPE and
> MySQL/PostgreSQL SELECT ... INTO?
>
>> On 14/08/2022 01.29, Benedict Elliott Smith wrote:
>>> 
>>> I’ll do my best to express with my thinking, as well as how I
would
>>> explain the feature to a user.
>>>
>>> My mental model for LET statements is that they are simply SELECT
>>> statements where the columns that are selected become variables
>>> accessible anywhere in the scope of the transaction. That is
to say, you
>>> should be able to run something like s/LET/SELECT and
>>> s/([^=]+)=([^,]+)(,|$)/\2 AS \1\3/g on the columns of a LET
statement
>>> and produce a valid SELECT statement, and vice versa. Both should
>>> perform identically.
>>>
>>> e.g.
>>> SELECT pk AS key, v AS value FROM table
>>>
>>> =>
>>> LET key = pk, value = v FROM table
>>
>> "=" is a CQL/SQL operator. Cassandra doesn't support it yet,
but SQL
>> supports selecting comparisons:
>>
>>
>> $ psql
>> psql (14.3)
>> Type "help" for help.
>>
>> avi=# SELECT 1 = 2, 3 = 3, NULL = NULL;
>>   ?column? | ?column? | ?column?
>> --+--+--
>>   f    | t    |
>> (1 row)
>>
>>
>> Using "=" as a syntactic element in LET would make SELECT and LET
>> incompatible once comparisons become valid selectors. Unless
they become
>> mandatory (and then you'd write "LET q = a = b" if you wanted
to select a
>> comparison).
>>
>>
>> I personally prefer the nested query syntax:
>>
>>
>>      LET (a, b, c) = (SELECT foo, bar, x+y FROM ...);
>>
>>
>> So there aren't two similar-but-not-quite-the-same syntaxes.
SELECT is
>> immediately recognizable by everyone as a query, LET is not.
>>
>>
>>> Identical form, identical behaviour. Every statement should be
directly
>>> translatable with some simple text manipulation.
>>>
>>> We can then make this more powerful for users by simply
expanding SELECT
>>> statements, e.g. by permitting them to declare constants and
tuples in
>>> the column results. In this scheme LET x = * is simply
syntactic sugar
>>> for LET x = (pk, ck, field1, …) This scheme then supports
options 2, 4
>>> and 5 all at once, consistently alongside each other.
>>>
>>> Option 6 is in fact very similar, but is strictly less
flexible for the
>>> user as they have no way to declare multiple scalar variables
without
>>> scoping them inside a tuple.
>>>
>>> e.g.
>>> LET key = pk, value = v FROM table
>>> IF key > 1 AND value > 1 THEN...
>>>
>>> =>
>>> LET row = SELECT pk AS key, v AS value FROM table
>>> IF row.key > 1 AND row.value > 1 THEN…
>>>
>>> However, both are expressible in the existing proposal, as if
you prefer
>>> this naming scheme you can simply write
>>>
>>> LET row = (pk AS key, v AS value) FROM table
>>> IF row.key > 1 AND row.value > 1 THEN…
>>>
>>> With respect to auto converting single column results to a
scalar, we do
>>> need a way for the user to say they care whether the row was
null or the
>>> column. I think an implicit conversion here could be
surprising. However
>>> we could implement tuple expressions anyway and let the user
explicitly
>>> declare v as a tuple as Caleb has suggested for the existing
proposal as
>>> well.
>>>
>>> Assigning constants or other values not selected from a table
would also
>>> be a little clunky:
>>>
>>> LET v1 = someFunc(), v2 = someOtherFunc(?)
>>> IF v1 > 1 AND v2 > 1 THEN…
>>>
>>> =>
>>> LET row = SELECT someFunc() AS v1, someOtherFunc(?) AS v2
>>> IF row.v1 > 1 AND row.v2 > 1 THEN...
>>>
>>> That said, the proposals are /close/ to identical, it is just
slightly
>>> more verbose and slightly less flexible.
>>>
>>> Which one would be most intuitive to users is hard to predict.
 

Re: CEP-15 multi key transaction syntax

2022-08-22 Thread Patrick McFadin
The replies got trashed pretty badly in the responses.
When you say: "Agree it's better to reuse existing syntax than invent new
syntax."

Which syntax are you referring to?

Patrick


On Mon, Aug 22, 2022 at 1:36 AM Avi Kivity via dev 
wrote:

> Agree it's better to reuse existing syntax than invent new syntax.
>
> On 8/21/22 16:52, Konstantin Osipov wrote:
> > * Avi Kivity via dev  [22/08/14 15:59]:
> >
> > MySQL supports SELECT  INTO  FROM ... WHERE
> > ...
> >
> > PostgreSQL supports pretty much the same syntax.
> >
> > Maybe instead of LET use the ANSI/MySQL/PostgreSQL DECLARE var TYPE and
> > MySQL/PostgreSQL SELECT ... INTO?
> >
> >> On 14/08/2022 01.29, Benedict Elliott Smith wrote:
> >>> 
> >>> I’ll do my best to express with my thinking, as well as how I would
> >>> explain the feature to a user.
> >>>
> >>> My mental model for LET statements is that they are simply SELECT
> >>> statements where the columns that are selected become variables
> >>> accessible anywhere in the scope of the transaction. That is to say,
> you
> >>> should be able to run something like s/LET/SELECT and
> >>> s/([^=]+)=([^,]+)(,|$)/\2 AS \1\3/g on the columns of a LET statement
> >>> and produce a valid SELECT statement, and vice versa. Both should
> >>> perform identically.
> >>>
> >>> e.g.
> >>> SELECT pk AS key, v AS value FROM table
> >>>
> >>> =>
> >>> LET key = pk, value = v FROM table
> >>
> >> "=" is a CQL/SQL operator. Cassandra doesn't support it yet, but SQL
> >> supports selecting comparisons:
> >>
> >>
> >> $ psql
> >> psql (14.3)
> >> Type "help" for help.
> >>
> >> avi=# SELECT 1 = 2, 3 = 3, NULL = NULL;
> >>   ?column? | ?column? | ?column?
> >> --+--+--
> >>   f| t|
> >> (1 row)
> >>
> >>
> >> Using "=" as a syntactic element in LET would make SELECT and LET
> >> incompatible once comparisons become valid selectors. Unless they become
> >> mandatory (and then you'd write "LET q = a = b" if you wanted to select
> a
> >> comparison).
> >>
> >>
> >> I personally prefer the nested query syntax:
> >>
> >>
> >>  LET (a, b, c) = (SELECT foo, bar, x+y FROM ...);
> >>
> >>
> >> So there aren't two similar-but-not-quite-the-same syntaxes. SELECT is
> >> immediately recognizable by everyone as a query, LET is not.
> >>
> >>
> >>> Identical form, identical behaviour. Every statement should be directly
> >>> translatable with some simple text manipulation.
> >>>
> >>> We can then make this more powerful for users by simply expanding
> SELECT
> >>> statements, e.g. by permitting them to declare constants and tuples in
> >>> the column results. In this scheme LET x = * is simply syntactic sugar
> >>> for LET x = (pk, ck, field1, …) This scheme then supports options 2, 4
> >>> and 5 all at once, consistently alongside each other.
> >>>
> >>> Option 6 is in fact very similar, but is strictly less flexible for the
> >>> user as they have no way to declare multiple scalar variables without
> >>> scoping them inside a tuple.
> >>>
> >>> e.g.
> >>> LET key = pk, value = v FROM table
> >>> IF key > 1 AND value > 1 THEN...
> >>>
> >>> =>
> >>> LET row = SELECT pk AS key, v AS value FROM table
> >>> IF row.key > 1 AND row.value > 1 THEN…
> >>>
> >>> However, both are expressible in the existing proposal, as if you
> prefer
> >>> this naming scheme you can simply write
> >>>
> >>> LET row = (pk AS key, v AS value) FROM table
> >>> IF row.key > 1 AND row.value > 1 THEN…
> >>>
> >>> With respect to auto converting single column results to a scalar, we
> do
> >>> need a way for the user to say they care whether the row was null or
> the
> >>> column. I think an implicit conversion here could be surprising.
> However
> >>> we could implement tuple expressions anyway and let the user explicitly
> >>> declare v as a tuple as Caleb has suggested for the existing proposal
> as
> >>> well.
> >>>
> >>> Assigning constants or other values not selected from a table would
> also
> >>> be a little clunky:
> >>>
> >>> LET v1 = someFunc(), v2 = someOtherFunc(?)
> >>> IF v1 > 1 AND v2 > 1 THEN…
> >>>
> >>> =>
> >>> LET row = SELECT someFunc() AS v1, someOtherFunc(?) AS v2
> >>> IF row.v1 > 1 AND row.v2 > 1 THEN...
> >>>
> >>> That said, the proposals are /close/ to identical, it is just slightly
> >>> more verbose and slightly less flexible.
> >>>
> >>> Which one would be most intuitive to users is hard to predict. It might
> >>> be that Option 6 would be slightly easier, but I’m unsure if there
> would
> >>> be a huge difference.
> >>>
> >>>
>  On 13 Aug 2022, at 16:59, Patrick McFadin  wrote:
> 
>  I'm really happy to see CEP-15 getting closer to a final
>  implementation. I'm going to walk through my reasoning for your
>  proposals wrt trying to explain this to somebody new.
> 
>  Looking at all the options, the first thing that comes up for me is
>  the Cassandra project's complicated relationship with NULL.  We have
>  prior art with EXISTS/NOT EXISTS when 

Re: [VOTE] Release Apache Cassandra 4.0.6

2022-08-22 Thread Mick Semb Wever
>
>
> The vote will be open for 72 hours (longer if needed). Everyone who has
> tested the build is invited to vote. Votes by PMC members are considered
> binding. A vote passes if there are at least three binding +1s and no -1's.
>
>

+1

Checked
- signing correct
- checksums are correct
- source artefact builds
- binary artefact runs
- debian package runs
- redhat package runs


Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

2022-08-22 Thread Benedict
I just want to say I’m really excited about this work. It’s one of the last 
remaining major inadequacies of the project that makes it hard for people to 
deploy, and hard for us to develop.

Can’t wait for it to be fixed.

> On 22 Aug 2022, at 13:45, Sam Tunnicliffe  wrote:
> Hi,
> 
> I'd like to open discussion about this CEP: 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata
> 
> Cluster metadata in Cassandra comprises a number of disparate elements 
> including, but not limited to, distributed schema, topology and token 
> ownership. Following the general design principles of Cassandra, the 
> mechanisms for coordinating updates to cluster state have favoured eventual 
> consistency, with probabilisitic delivery via gossip being a prime example. 
> Undoubtedly, this approach has benefits, not least in terms of resilience, 
> particularly in highly fluid distributed environments. However, this is not 
> the reality of most Cassandra deployments, where the total number of nodes is 
> relatively small (i.e. in the low thousands) and the rate of change tends to 
> be low.  
> 
> Historically, a significant proportion of issues affecting operators and 
> users of Cassandra have been due, at least in part, to a lack of strongly 
> consistent cluster metadata. In response to this, we propose a design which 
> aims to provide linearizability of metadata changes whilst ensuring that the 
> effects of those changes are made visible to all nodes in a strongly 
> consistent manner. At its core, it is also pluggable, enabling 
> Cassandra-derived projects to supply their own implementations if desired.
> 
> In addition to the CEP document itself, we aim to publish a working prototype 
> of the proposed design. Obviously, this does not implement the entire 
> proposal and there are several parts which remain only partially complete. It 
> does include the core of the system, including a good deal of test 
> infrastructure, so may serve as both illustration of the design and a 
> starting point for real implementation. 


[DISCUSS] CEP-21: Transactional Cluster Metadata

2022-08-22 Thread Sam Tunnicliffe
Hi,

I'd like to open discussion about this CEP: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata
 

   
Cluster metadata in Cassandra comprises a number of disparate elements 
including, but not limited to, distributed schema, topology and token 
ownership. Following the general design principles of Cassandra, the mechanisms 
for coordinating updates to cluster state have favoured eventual consistency, 
with probabilisitic delivery via gossip being a prime example. Undoubtedly, 
this approach has benefits, not least in terms of resilience, particularly in 
highly fluid distributed environments. However, this is not the reality of most 
Cassandra deployments, where the total number of nodes is relatively small 
(i.e. in the low thousands) and the rate of change tends to be low.  

Historically, a significant proportion of issues affecting operators and users 
of Cassandra have been due, at least in part, to a lack of strongly consistent 
cluster metadata. In response to this, we propose a design which aims to 
provide linearizability of metadata changes whilst ensuring that the effects of 
those changes are made visible to all nodes in a strongly consistent manner. At 
its core, it is also pluggable, enabling Cassandra-derived projects to supply 
their own implementations if desired.

In addition to the CEP document itself, we aim to publish a working prototype 
of the proposed design. Obviously, this does not implement the entire proposal 
and there are several parts which remain only partially complete. It does 
include the core of the system, including a good deal of test infrastructure, 
so may serve as both illustration of the design and a starting point for real 
implementation. 



Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-22 Thread Andrés de la Peña
>
> Isn't there an assumption here that encryption can not be used?  Would we
> not be better served to build in an encryption strategy that keeps the data
> encrypted until the user shows permissions to decrypt, like the unmask
> property?  An encryption strategy that can work within the Cassandra
> internals?
> I think that issue is that there are some data fields that should not be
> discoverable by unauthorized users/systems, and I think this solution masks
> that issue.  I fear that this capability will be seized upon by pointy
> haired managers as a cheaper alternative to encryption, regardless of the
> warnings otherwise, and that as a whole will harm the Cassandra ecosystem.


Data encryption, access permissions and data masking are different
solutions to different problems. We don't have to choose between them, and
indeed we should aim to support the three of them at some point. None of
these features impedes the implementation of the others. Actually, is quite
common for popular databases to provide all of them.

Data encryption should protect the data files from anyone that has direct
access to the data files, such sstables, commitlog, etc. It offers
protection outside the interfaces of the database. Of course there is also
encryption of communications.

Permissions should completely prevent the access of unauthorized users to
the data within the database interface. Currently we have permissions on
CQL at the keyspace and table level, but we are missing column-level
permissions.

Data masking obfuscates all or part of the data without totally forbidding
access to it. The key here is that the masked data can still contain parts
of the original information, or be representative enough. For example,
masking can obfuscate all the digits of a credit card number except the
last four, so the clear digits can be used for some degree of
identification. As another example, a masking function returning the hash
would allow to join the masked data of different sources without exposing
it.

An example of how data masking and permissions can be used together could
be a company storing the social security numbers (SSN) of its customers.
The accounting team might need full access to the stored SSNs. Employees
attending phone calls might need to ask for the last two digits of SSN for
identification purposes, so they would need masked access. The rest of the
organization would need no access at all.

This CEP focuses exclusively on data masking, but there is no reason not to
start parallel work on other related-but-different features like
column-level permissions on on-disk data encryption.




On Mon, 22 Aug 2022 at 07:05, Claude Warren, Jr via dev <
dev@cassandra.apache.org> wrote:

> I am more interested in the motivation where it is stated:
>
> Many users have the need of masking sensitive data, such as contact info,
>> age, gender, credit card numbers, etc. Dynamic data masking (DDM) allows to
>> obscure sensitive information while still allowing access to the masked
>> columns, and without changing the stored data.
>
>
> There is an unspoken assumption that the stored data format can not be
> changed.  It feels like this solution is starting from a false premise.
> Throughout the document there are guard statements about how this does not
> replace encryption.  Isn't there an assumption here that encryption can not
> be used?  Would we not be better served to build in an encryption strategy
> that keeps the data encrypted until the user shows permissions to decrypt,
> like the unmask property?  An encryption strategy that can work within the
> Cassandra internals?
>
> I think that issue is that there are some data fields that should not be
> discoverable by unauthorized users/systems, and I think this solution masks
> that issue.  I fear that this capability will be seized upon by pointy
> haired managers as a cheaper alternative to encryption, regardless of the
> warnings otherwise, and that as a whole will harm the Cassandra ecosystem.
>
> Yes, encryption is more difficult to implement and will take longer, but
> this feels like a sticking plaster that distracts from that underlying
> issue.
>
> my 0.02
>
> On Mon, Aug 22, 2022 at 12:30 AM Andrés de la Peña 
> wrote:
>
>> > If the column names are the same for masked and unmasked data, it would
>>> impact existing applications. I am curious what the transition plan look
>>> like for applications that expect unmasked data?
>>
>> For example, let’s say you store SSNs and Birth dates. Upon enabling this
>>> feature, let’s say the app user is not given the UNMASK permission. Now the
>>> app is receiving masked values for these columns. This is fine for most
>>> read only applications. However, a lot of times these columns may be used
>>> as primary keys or part of primary keys in other tables. This would break
>>> existing applications.
>>> How would this work in mixed mode when  ew nodes in the cluster are
>>> masking data and others aren’t? How would it 

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-22 Thread Andrés de la Peña
>
> Maybe a small improvement is the redacted value could be of the form
> `XXX1...1000` meaning XXX followed by a rand number from 1 to 1000: XXX54,
> XXX998, XXX456,... Some randomness would prevent some apps flattening all
> rows to a single XXX'ed one, giving a more realistic redacted data
> distribution/structure.


I'm not sure I understand why that would be useful. Why would random
suffixes give us a more realistic redacted data distribution? If we want to
avoid returning always the same value, we could use a function that just
return the random value, without the  part, so we can use any data
type. Microsoft's SQL Server and Azure SQL have this function among their
masking functions.

Nevertheless, it would be quite easy to keep adding new masking functions
when we need them.

On Mon, 22 Aug 2022 at 06:52, Berenguer Blasi 
wrote:

> Maybe a small improvement is the redacted value could be of the form
> `XXX1...1000` meaning XXX followed by a rand number from 1 to 1000: XXX54,
> XXX998, XXX456,... Some randomness would prevent some apps flattening all
> rows to a single XXX'ed one, giving a more realistic redacted data
> distribution/structure.
>
> I am not sure either about it's value, as that would still break any key
> or other cross-referencing.
>
> My 2cts.
> On 22/8/22 1:30, Andrés de la Peña wrote:
>
> > If the column names are the same for masked and unmasked data, it would
>> impact existing applications. I am curious what the transition plan look
>> like for applications that expect unmasked data?
>
> For example, let’s say you store SSNs and Birth dates. Upon enabling this
>> feature, let’s say the app user is not given the UNMASK permission. Now the
>> app is receiving masked values for these columns. This is fine for most
>> read only applications. However, a lot of times these columns may be used
>> as primary keys or part of primary keys in other tables. This would break
>> existing applications.
>> How would this work in mixed mode when  ew nodes in the cluster are
>> masking data and others aren’t? How would it impact the driver?
>> How would the application learn that the column values are masked? This
>> is important in case a user has UNMASK permission and then later taken
>> away. Again this would break a lot of applications.
>
>
> Changing the masking of a column is a schema change, and as such it can be
> risky for existing applications. However, differently to deleting a column
> or revoking a SELECT permission, suddenly activating masking might pass
> undetected for existing applications.
>
> Applications developed after the introduction of this feature can check
> the table schema to know if a column is masked or not. We can even add a
> specific system view to ease this, if we think it's worth it. However,
> administrators should not activate masking when there could be applications
> that are not aware of the feature. We should be clear about this in the
> documentation.
>
> This is the way data masking seems to work in the databases I've checked.
> I also though that we could just change the name of the column when it's
> masked to something as "masked(column_name)", as it is discussed in the CEP
> document. This would make it impossible to miss that a column is masked.
> However, applications should be prepared to use different column names when
> reading result sets, depending on whether the data is masked for them or
> not. None of the databases mentioned on the "other databases" section of
> the CEP does this kind of column renaming, so it might be a kind of exotic
> behaviour. wdyt?
>
> On Fri, 19 Aug 2022 at 19:17, Andrés de la Peña 
> wrote:
>
>> > This type of feature is very useful, but it may be easier to analyze
>>> this proposal if it’s compared with other DDM implementations from other
>>> databases? Would it be reasonable to add a table to the proposal comparing
>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>
>>
>> Good idea. I have added a section at the end of the document briefly
>> describing how some other databases deal with data masking, and with links
>> to their documentation for the topic. I am not an expert in none of those
>> databases, so please take my comments there with a grain of salt.
>>
>> On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa  wrote:
>>
>>> This type of feature is very useful, but it may be easier to analyze
>>> this proposal if it’s compared with other DDM implementations from other
>>> databases? Would it be reasonable to add a table to the proposal comparing
>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>>
>>>
>>> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña 
>>> wrote:
>>>
>>> 
>>> Hi everyone,
>>>
>>> I'd like to start a discussion about this proposal for dynamic data
>>> masking:
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>>>
>>> Dynamic data masking allows to obscure sensitive information without
>>> changing the stored data. It 

Re: [VOTE] Release Apache Cassandra 4.0.6

2022-08-22 Thread Mick Semb Wever
> Maven Artifacts:
> https://repository.apache.org/content/repositories/orgapachecassandra-/org/apache/cassandra/cassandra-all/4.0.6/
>


Correction.
Maven Artifacts are at:
https://repository.apache.org/content/repositories/orgapachecassandra-1275/org/apache/cassandra/cassandra-all/4.0.6/


Re: CEP-15 multi key transaction syntax

2022-08-22 Thread Avi Kivity via dev

Agree it's better to reuse existing syntax than invent new syntax.

On 8/21/22 16:52, Konstantin Osipov wrote:

* Avi Kivity via dev  [22/08/14 15:59]:

MySQL supports SELECT  INTO  FROM ... WHERE
...

PostgreSQL supports pretty much the same syntax.

Maybe instead of LET use the ANSI/MySQL/PostgreSQL DECLARE var TYPE and
MySQL/PostgreSQL SELECT ... INTO?


On 14/08/2022 01.29, Benedict Elliott Smith wrote:


I’ll do my best to express with my thinking, as well as how I would
explain the feature to a user.

My mental model for LET statements is that they are simply SELECT
statements where the columns that are selected become variables
accessible anywhere in the scope of the transaction. That is to say, you
should be able to run something like s/LET/SELECT and
s/([^=]+)=([^,]+)(,|$)/\2 AS \1\3/g on the columns of a LET statement
and produce a valid SELECT statement, and vice versa. Both should
perform identically.

e.g.
SELECT pk AS key, v AS value FROM table

=>
LET key = pk, value = v FROM table


"=" is a CQL/SQL operator. Cassandra doesn't support it yet, but SQL
supports selecting comparisons:


$ psql
psql (14.3)
Type "help" for help.

avi=# SELECT 1 = 2, 3 = 3, NULL = NULL;
  ?column? | ?column? | ?column?
--+--+--
  f    | t    |
(1 row)


Using "=" as a syntactic element in LET would make SELECT and LET
incompatible once comparisons become valid selectors. Unless they become
mandatory (and then you'd write "LET q = a = b" if you wanted to select a
comparison).


I personally prefer the nested query syntax:


     LET (a, b, c) = (SELECT foo, bar, x+y FROM ...);


So there aren't two similar-but-not-quite-the-same syntaxes. SELECT is
immediately recognizable by everyone as a query, LET is not.



Identical form, identical behaviour. Every statement should be directly
translatable with some simple text manipulation.

We can then make this more powerful for users by simply expanding SELECT
statements, e.g. by permitting them to declare constants and tuples in
the column results. In this scheme LET x = * is simply syntactic sugar
for LET x = (pk, ck, field1, …) This scheme then supports options 2, 4
and 5 all at once, consistently alongside each other.

Option 6 is in fact very similar, but is strictly less flexible for the
user as they have no way to declare multiple scalar variables without
scoping them inside a tuple.

e.g.
LET key = pk, value = v FROM table
IF key > 1 AND value > 1 THEN...

=>
LET row = SELECT pk AS key, v AS value FROM table
IF row.key > 1 AND row.value > 1 THEN…

However, both are expressible in the existing proposal, as if you prefer
this naming scheme you can simply write

LET row = (pk AS key, v AS value) FROM table
IF row.key > 1 AND row.value > 1 THEN…

With respect to auto converting single column results to a scalar, we do
need a way for the user to say they care whether the row was null or the
column. I think an implicit conversion here could be surprising. However
we could implement tuple expressions anyway and let the user explicitly
declare v as a tuple as Caleb has suggested for the existing proposal as
well.

Assigning constants or other values not selected from a table would also
be a little clunky:

LET v1 = someFunc(), v2 = someOtherFunc(?)
IF v1 > 1 AND v2 > 1 THEN…

=>
LET row = SELECT someFunc() AS v1, someOtherFunc(?) AS v2
IF row.v1 > 1 AND row.v2 > 1 THEN...

That said, the proposals are /close/ to identical, it is just slightly
more verbose and slightly less flexible.

Which one would be most intuitive to users is hard to predict. It might
be that Option 6 would be slightly easier, but I’m unsure if there would
be a huge difference.



On 13 Aug 2022, at 16:59, Patrick McFadin  wrote:

I'm really happy to see CEP-15 getting closer to a final
implementation. I'm going to walk through my reasoning for your
proposals wrt trying to explain this to somebody new.

Looking at all the options, the first thing that comes up for me is
the Cassandra project's complicated relationship with NULL.  We have
prior art with EXISTS/NOT EXISTS when creating new tables. IS
NULL/IS NOT NULL is used in materialized views similarly to
proposals 2,4 and 5.

CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] [keyspace_name.]view_name
   AS SELECT [ (column_list) ]
   FROM [keyspace_name.]table_name
   [ WHERE column_name IS NOT NULL
   [ AND column_name IS NOT NULL ... ] ]
   [ AND relation [ AND ... ] ]
   PRIMARY KEY ( column_list )
   [ WITH [ table_properties ]
   [ [ AND ] CLUSTERING ORDER BY (cluster_column_name order_option) ] ] ;

  Based on that, I believe 1 and 3 would just confuse users, so -1 on
those.

Trying to explain the difference between row and column operations
with LET, I can't see the difference between a row and column in #2.

#4 introduces a boolean instead of column names and just adds more
syntax.

#5 is verbose and, in my opinion, easier to reason when writing a
query. Thinking top down, I need to know if these exact rows and/or

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-22 Thread Claude Warren, Jr via dev
I am more interested in the motivation where it is stated:

Many users have the need of masking sensitive data, such as contact info,
> age, gender, credit card numbers, etc. Dynamic data masking (DDM) allows to
> obscure sensitive information while still allowing access to the masked
> columns, and without changing the stored data.


There is an unspoken assumption that the stored data format can not be
changed.  It feels like this solution is starting from a false premise.
Throughout the document there are guard statements about how this does not
replace encryption.  Isn't there an assumption here that encryption can not
be used?  Would we not be better served to build in an encryption strategy
that keeps the data encrypted until the user shows permissions to decrypt,
like the unmask property?  An encryption strategy that can work within the
Cassandra internals?

I think that issue is that there are some data fields that should not be
discoverable by unauthorized users/systems, and I think this solution masks
that issue.  I fear that this capability will be seized upon by pointy
haired managers as a cheaper alternative to encryption, regardless of the
warnings otherwise, and that as a whole will harm the Cassandra ecosystem.

Yes, encryption is more difficult to implement and will take longer, but
this feels like a sticking plaster that distracts from that underlying
issue.

my 0.02

On Mon, Aug 22, 2022 at 12:30 AM Andrés de la Peña 
wrote:

> > If the column names are the same for masked and unmasked data, it would
>> impact existing applications. I am curious what the transition plan look
>> like for applications that expect unmasked data?
>
> For example, let’s say you store SSNs and Birth dates. Upon enabling this
>> feature, let’s say the app user is not given the UNMASK permission. Now the
>> app is receiving masked values for these columns. This is fine for most
>> read only applications. However, a lot of times these columns may be used
>> as primary keys or part of primary keys in other tables. This would break
>> existing applications.
>> How would this work in mixed mode when  ew nodes in the cluster are
>> masking data and others aren’t? How would it impact the driver?
>> How would the application learn that the column values are masked? This
>> is important in case a user has UNMASK permission and then later taken
>> away. Again this would break a lot of applications.
>
>
> Changing the masking of a column is a schema change, and as such it can be
> risky for existing applications. However, differently to deleting a column
> or revoking a SELECT permission, suddenly activating masking might pass
> undetected for existing applications.
>
> Applications developed after the introduction of this feature can check
> the table schema to know if a column is masked or not. We can even add a
> specific system view to ease this, if we think it's worth it. However,
> administrators should not activate masking when there could be applications
> that are not aware of the feature. We should be clear about this in the
> documentation.
>
> This is the way data masking seems to work in the databases I've checked.
> I also though that we could just change the name of the column when it's
> masked to something as "masked(column_name)", as it is discussed in the CEP
> document. This would make it impossible to miss that a column is masked.
> However, applications should be prepared to use different column names when
> reading result sets, depending on whether the data is masked for them or
> not. None of the databases mentioned on the "other databases" section of
> the CEP does this kind of column renaming, so it might be a kind of exotic
> behaviour. wdyt?
>
> On Fri, 19 Aug 2022 at 19:17, Andrés de la Peña 
> wrote:
>
>> > This type of feature is very useful, but it may be easier to analyze
>>> this proposal if it’s compared with other DDM implementations from other
>>> databases? Would it be reasonable to add a table to the proposal comparing
>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>
>>
>> Good idea. I have added a section at the end of the document briefly
>> describing how some other databases deal with data masking, and with links
>> to their documentation for the topic. I am not an expert in none of those
>> databases, so please take my comments there with a grain of salt.
>>
>> On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa  wrote:
>>
>>> This type of feature is very useful, but it may be easier to analyze
>>> this proposal if it’s compared with other DDM implementations from other
>>> databases? Would it be reasonable to add a table to the proposal comparing
>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>>
>>>
>>> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña 
>>> wrote:
>>>
>>> 
>>> Hi everyone,
>>>
>>> I'd like to start a discussion about this proposal for dynamic data
>>> masking:
>>>