Re: Welcome Maxim Muzafarov as Cassandra Committer

2024-01-09 Thread Andrés de la Peña
Congrats, Maxim!

On Tue, 9 Jan 2024 at 03:45, guo Maxwell  wrote:

> Congratulations, Maxim!
>
> Francisco Guerrero  于2024年1月9日周二 09:00写道:
>
>> Congratulations, Maxim! Well deserved!
>>
>> On 2024/01/08 18:19:04 Josh McKenzie wrote:
>> > The Apache Cassandra PMC is pleased to announce that Maxim Muzafarov
>> has accepted
>> > the invitation to become a committer.
>> >
>> > Thanks for all the hard work and collaboration on the project thus far,
>> and we're all looking forward to working more with you in the future.
>> Congratulations and welcome!
>> >
>> > The Apache Cassandra PMC members
>> >
>> >
>>
>


Re: Welcome Mike Adamson as Cassandra committer

2023-12-08 Thread Andrés de la Peña
Congrats Mike!

On Fri, 8 Dec 2023 at 14:53, Jeremiah Jordan 
wrote:

> Congrats Mike!  Thanks for all your work on SAI and Vector index.  Well
> deserved!
>
> On Dec 8, 2023 at 8:52:07 AM, Brandon Williams  wrote:
>
>> Congratulations Mike!
>>
>> Kind Regards,
>> Brandon
>>
>> On Fri, Dec 8, 2023 at 8:41 AM Benjamin Lerer  wrote:
>>
>>
>> The PMC members are pleased to announce that Mike Adamson has accepted
>>
>> the invitation to become committer.
>>
>>
>> Thanks a lot, Mike, for everything you have done for the project.
>>
>>
>> Congratulations and welcome
>>
>>
>> The Apache Cassandra PMC members
>>
>>


Re: Welcome Francisco Guerrero Hernandez as Cassandra Committer

2023-11-29 Thread Andrés de la Peña
Congrats Francisco!

On Wed, 29 Nov 2023 at 11:37, Benjamin Lerer  wrote:

> Congratulations!!! Well deserved!
>
> Le mer. 29 nov. 2023 à 07:31, Berenguer Blasi 
> a écrit :
>
>> Welcome!
>> On 29/11/23 2:24, guo Maxwell wrote:
>>
>> Congrats!
>>
>> Jacek Lewandowski  于2023年11月29日周三 06:16写道:
>>
>>> Congrats!!!
>>>
>>> wt., 28 lis 2023, 23:08 użytkownik Abe Ratnofsky  napisał:
>>>
 Congrats Francisco!

 > On Nov 28, 2023, at 1:56 PM, C. Scott Andreas 
 wrote:
 >
 > Congratulations, Francisco!
 >
 > - Scott
 >
 >> On Nov 28, 2023, at 10:53 AM, Dinesh Joshi 
 wrote:
 >>
 >> The PMC members are pleased to announce that Francisco Guerrero
 Hernandez has accepted
 >> the invitation to become committer today.
 >>
 >> Congratulations and welcome!
 >>
 >> The Apache Cassandra PMC members




Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-31 Thread Andrés de la Peña
I'd add that even if we commit running CI to verify that we are not
introducing new test failures, we can always inadvertently introduce new
flakies. Those flakies can be hit long after the original commit,
for example while trying to make a release.

On Tue, 31 Oct 2023 at 17:08, Paulo Motta  wrote:

> Even if it was not formally prescribed as far as I understand, we have
> been following the "only merge on Green CI" custom as much as possible for
> the past several years. Is the proposal to relax this rule for 5.0?
>
> On Tue, Oct 31, 2023 at 1:02 PM Jeremiah Jordan 
> wrote:
>
>> You are free to argue validity.  I am just stating what I see on the
>> mailing list and in the wiki.  We had a vote which was called passing and
>> was not contested at that time.  The vote was on a process which includes
>> as #3 in the list:
>>
>>
>>1. Before a merge, a committer needs either a non-regressing (i.e. no
>>new failures) run of circleci with the required test suites (TBD; see
>>below) or of ci-cassandra.
>>   1. Non-regressing is defined here as "Doesn't introduce any new
>>   test failures; any new failures in CI are clearly not attributable to 
>> this
>>   diff"
>>   2. (NEW) After merging tickets, ci-cassandra runs against the SHA
>>   and the author gets an advisory update on the related JIRA for any new
>>   errors on CI. The author of the ticket will take point on triaging 
>> this new
>>   failure and either fixing (if clearly reproducible or related to their
>>   work) or opening a JIRA for the intermittent failure and linking it in
>>   butler (https://butler.cassandra.apache.org/#/)
>>
>>
>> Which clearly says that before merge we ensure there are no known new
>> regressions to CI.
>>
>> The allowance for releases without CI being green, and merges without the
>> CI being completely green are from the fact that our trunk CI has rarely
>> been completely green, so we allow merging things which do not introduce
>> NEW regressions, and we allow releases with known regressions that are
>> deemed acceptable.
>>
>> We can indeed always vote to override it, and if it comes to that we can
>> consider that as an option.
>>
>> -Jeremiah
>>
>> On Oct 31, 2023 at 11:41:29 AM, Benedict  wrote:
>>
>>> That vote thread also did not reach the threshold; it was incorrectly
>>> counted, as committer votes are not binding for procedural changes. I
>>> counted at most 8 PMC +1 votes.
>>>
>>> The focus of that thread was also clearly GA releases and merges on such
>>> branches, since there was a focus on releases being failure-free. But this
>>> predates the more general release lifecycle vote that allows for alphas to
>>> have failing tests - which logically would be impossible if nothing were
>>> merged with failing or flaky tests.
>>>
>>> Either way, the vote and discussion specifically allow for this to be
>>> overridden.
>>>
>>> 路‍♀️
>>>
>>> On 31 Oct 2023, at 16:29, Jeremiah Jordan 
>>> wrote:
>>>
>>> 
>>> I never said there was a need for green CI for alpha.  We do have a
>>> requirement for not merging things to trunk that have known regressions in
>>> CI.
>>> Vote here:
>>> https://lists.apache.org/thread/j34mrgcy9wrtn04nwwymgm6893h0xwo9
>>>
>>>
>>>
>>> On Oct 31, 2023 at 3:23:48 AM, Benedict  wrote:
>>>
 There is no requirement for green CI on alpha. We voted last year to
 require running all tests before commit and to require green CI for beta
 releases. This vote was invalid because it didn’t reach the vote floor for
 a procedural change but anyway is not inconsistent with knowingly and
 selectively merging work without green CI.

 If we reach the summit we should take a look at the state of the PRs
 and make a decision about if they are alpha quality; if so, and we want a
 release, we should simply merge it and release. Making up a new release
 type when the work meets alpha standard to avoid an arbitrary and not
 mandated commit bar seems the definition of silly.

 On 31 Oct 2023, at 04:34, J. D. Jordan 
 wrote:

 
 That is my understanding as well. If the TCM and Accord based on TCM
 branches are ready to commit by ~12/1 we can cut a 5.1 branch and then a
 5.1-alpha release.
 Where “ready to commit” means our usual things of two committer +1 and
 green CI etc.

 If we are not ready to commit then I propose that as long as everything
 in the accord+tcm Apache repo branch has had two committer +1’s, but maybe
 people are still working on fixes for getting CI green or similar, we cut a
 5.1-preview  build from the feature branch to vote on with known issues
 documented.  This would not be the preferred path, but would be a way to
 have a voted on release for summit.

 -Jeremiah

 On Oct 30, 2023, at 5:59 PM, Mick Semb Wever  wrote:

 

 Hoping we can get clarity on this.

 The proposal was, once TCM and Accord 

Re: [VOTE] Accept java-driver

2023-10-04 Thread Andrés de la Peña
+1

On Wed, 4 Oct 2023 at 05:44, Berenguer Blasi 
wrote:

> +1
> On 4/10/23 4:43, Erick Ramirez wrote:
>
> +1 
>
>>


Re: [DISCUSS] Vector type and empty value

2023-09-22 Thread Andrés de la Peña
I have just created CASSANDRA-18876 for this. I'll post a patch very soon.

On Wed, 20 Sept 2023 at 19:41, David Capwell  wrote:

> I don’t think we can readily migrate old types away from this however,
> without breaking backwards compatibility.
>
>
> Given that java driver has a different behavior from server, I wouldn’t be
> shocked to see that other drivers also have their own custom behaviors… so
> not clear how to migrate unless we actually hand a user facing standard per
> type… if all drivers use a “default value” and is consistent, I do think we
> could migrate, but would need to live with this till at least 6.0+
>
> We can only prevent its use in the CQL layer where support isn’t required.
>
>
> +1
>
> On Sep 20, 2023, at 7:38 AM, Benedict  wrote:
>
> Yes, if this is what was meant by empty I agree. It’s nonsensical for most
> types. Apologies for any confusion.
>
> I don’t think we can readily migrate old types away from this however,
> without breaking backwards compatibility. We can only prevent its use in
> the CQL layer where support isn’t required. My understanding was that we
> had at least tried to do this for all non-thrift schemas, but perhaps we
> did not do so thoroughly and now may have some CQL legacy support
> requirements as well.
>
> On 20 Sep 2023, at 15:30, Aleksey Yeshchenko  wrote:
>
> Allowing zero-length byte arrays for most old types is just a legacy from
> Darker Days. It’s a distinct concern from columns being nullable or not.
>
> There are a couple types where this makes sense: strings and blobs. All
> else should not allow this except for backward compatibility reasons. So,
> not for new types.
>
> On 20 Sep 2023, at 00:08, David Capwell  wrote:
>
> When does empty mean null?
>
>
>
> Most types are this way
>
> @Test
> public void nullExample()
> {
> createTable("CREATE TABLE %s (pk int primary key, cuteness int)");
> execute("INSERT INTO %s (pk, cuteness) VALUES (0, ?)", ByteBuffer.wrap(new
> byte[0]));
> Row result = execute("SELECT * FROM %s WHERE pk=0").one();
> if (result.has("cuteness")) System.out.println("Cuteness score: " +
> result.getInt("cuteness"));
> else System.out.println("Cuteness score is undefined");
> }
>
>
> This test will NPE in getInt as the returned BB is seen as “null” for
> int32 type, you can make it “safer” by changing to the following
>
> if (result.has("cuteness")) System.out.println("Cuteness score: " +
> Int32Type.instance.compose(result.getBlob("cuteness")));
>
> Now we get the log "Cuteness score: null”
>
> What’s even better (just found this out) is that client isn’t consistent
> or correct in these cases!
>
> com.datastax.driver.core.Row result = executeNet(ProtocolVersion.CURRENT,
> "SELECT * FROM %s WHERE pk=0").one();
> if (result.getBytesUnsafe("cuteness") != null)
> System.out.println("Cuteness score: " + result.getInt("cuteness"));
> else System.out.println("Cuteness score is undefined”);
>
> This prints "Cuteness score: 0”
>
> So for Cassandra we think the value is “null” but java driver thinks it’s
> 0?
>
> Do we have types where writing an empty value creates a tombstone?
>
>
> Empty does not generate a tombstone for any type, but empty has a similar
> user experience as we return null in both cases (but just found out that
> the drivers may not be consistent with this…)
>
> On Sep 19, 2023, at 3:33 PM, J. D. Jordan 
> wrote:
>
>
> When does empty mean null?  My understanding was that empty is a valid
> value for the types that support it, separate from null (aka a tombstone).
> Do we have types where writing an empty value creates a tombstone?
>
> I agree with David that my preference would be for only blob and string
> like types to support empty. It’s too late for the existing types, but we
> should hold to this going forward. Which is what I think the idea was in
> https://issues.apache.org/jira/browse/CASSANDRA-8951 as well?  That it
> was sad the existing numerics were emptiable, but too late to change, and
> we could correct it for newer types.
>
> On Sep 19, 2023, at 12:12 PM, David Capwell  wrote:
>
> 
>
>
> When we introduced TINYINT and SMALLINT (CASSANDRA-8951) we started making
> types non -emptiable. This approach makes more sense to me as having to
> deal with empty value is error prone in my opinion.
>
>
> I agree it’s confusing, and in the patch I found that different code paths
> didn’t handle things correctly as we have some times (most) that support
> empty bytes, and some that do not…. Empty also has different meaning in
> different code paths; for most it means “null”, and for some other types it
> means “empty”…. To try to make things more clear I added
> org.apache.cassandra.db.marshal.AbstractType#isNull(V,
> org.apache.cassandra.db.marshal.ValueAccessor) to the type system so
> each type can define if empty is null or not.
>
> I also think that it would be good to standardize on one approach to avoid
> confusion.
>
>
> I agree, but also don’t feel it’s a perfect one-size-fits-all thing….
> Let’s 

Re: [VOTE] Release Apache Cassandra 5.0-alpha1 (take3)

2023-09-07 Thread Andrés de la Peña
+1

On Thu, 7 Sept 2023 at 12:52, Jacek Lewandowski 
wrote:

> Mick, is the documentation / website ok?
>
> If so, +1
>
> Best Regards,
> - - -- --- -  -
> Jacek Lewandowski
>
>
> czw., 7 wrz 2023 o 12:58 Brandon Williams  napisał(a):
>
>> +1
>>
>> Kind Regards,
>> Brandon
>>
>> On Mon, Sep 4, 2023 at 3:26 PM Mick Semb Wever  wrote:
>> >
>> >
>> > Proposing the test build of Cassandra 5.0-alpha1 for release.
>> >
>> > DISCLAIMER, this alpha release does not contain the expected 5.0
>> > features: Vector Search (CEP-30), Transactional Cluster Metadata
>> > (CEP-21) and Accord Transactions (CEP-15).  These features will land
>> > in a later alpha release.
>> >
>> > Please also note that this is an alpha release and what that means,
>> further info at
>> https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle
>> >
>> > sha1: bc5e3741d475e2e99fd7a10450681fd708431a89
>> > Git: https://github.com/apache/cassandra/tree/5.0-alpha1-tentative
>> > Maven Artifacts:
>> https://repository.apache.org/content/repositories/orgapachecassandra-1316/org/apache/cassandra/cassandra-all/5.0-alpha1/
>> >
>> > The Source and Build Artifacts, and the Debian and RPM packages and
>> repositories, are available here:
>> https://dist.apache.org/repos/dist/dev/cassandra/5.0-alpha1/
>> >
>> > The vote will be open for 72 hours (longer if needed). Everyone who has
>> tested the build is invited to vote. Votes by PMC members are considered
>> binding. A vote passes if there are at least three binding +1s and no -1's.
>> >
>> > [1]: CHANGES.txt:
>> https://github.com/apache/cassandra/blob/5.0-alpha1-tentative/CHANGES.txt
>> > [2]: NEWS.txt:
>> https://github.com/apache/cassandra/blob/5.0-alpha1-tentative/NEWS.txt
>> >
>>
>


Re: Tokenization and SAI query syntax

2023-07-24 Thread Andrés de la Peña
`column = term` is definitively problematic because it creates an ambiguity
when the queried column belongs to the primary key. For some queries we
wouldn't know whether the user wants a primary key query using regular
equality or an index query using the analyzer.

`term_matches(column, term)` seems quite clear and hard to misinterpret,
but it's quite long to write and its implementation will be challenging
since we would need a bunch of special casing around SelectStatement and
functions.

LIKE, MATCHES and CONTAINS could be a bit misleading since they seem to
evoke different behaviours to what they would have.

`column LIKE :term:` seems a bit redundant compared to just using `column :
term`, and we are still introducing a new symbol.

I think I like `column : term` the most, because it's brief, it's similar
to the equivalent Lucene's syntax, and it doesn't seem to clash with other
different meanings that I can think of.

On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis  wrote:

> Hi all,
>
> With phase 1 of SAI wrapping up, I’d like to start the ball rolling on
> aligning around phase 2 features.
>
> In particular, we need to nail down the syntax for doing non-exact string
> matches.  We have a proof of concept that includes full Lucene analyzer and
> filter functionality – just the text transformation pieces, none of the
> storage parts – which is the gold standard in this space.  For example, the
> StandardAnalyzer [1] lowercases all terms and removes stopwords (common
> words like “a”, “is”, “the” that are usually not useful to search
> against).  Lucene also has classes that offer stemming, special case
> handling for email, and many languages besides English [2].
>
> What syntax should we use to express “rows whose analyzed tokens match
> this search term?”
>
> The syntax must be clear that we want to look for this term within the
> column data using the configured index with corresponding query-time
> tokenization and analysis.  This means that the query term is not always a
> substring of the original string!  Besides obvious transformations like
> lowercasing, you have things like PhoneticFilter available as well.
>
> Here are my thoughts on some of the options:
>
> `column = term`.  This is what the POC does today and it’s super confusing
> to overload = to mean something other than exact equality.  I am not a fan.
>
> `column LIKE term` or `column LIKE %term%`. The closest SQL operator, but
> neither the wildcarded nor unwildcarded syntax matches the semantics of
> term-based search.
>
> `column MATCHES term`. I rather like this one, although Mike points out
> that “match” has a meaning in the context of regular expressions that could
> cause confusion here.
>
> `column CONTAINS term`. Contains is used by both Java and Python for
> substring searches, so at least some users will be surprised by term-based
> behavior.
>
> `term_matches(column, term)`. Postgresql FTS makes you use functions like
> this for everything.  It’s pretty clunky, and we would need to make the
> amazingly hairy SelectStatement even hairier to handle “use a function
> result in a predicate” like this.
>
> `column : term`. Inspired by Lucene’s syntax.  I don’t actually hate it.
>
> `column LIKE :term:`. Stick with the LIKE operator but add a new symbol to
> indicate term matching.  Arguably more SQL-ish than a new bare symbol
> operator.
>
> [1]
> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
> [2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


Re: Fwd: [DISCUSS] Formalizing requirements for pre-commit patches on new CI

2023-07-07 Thread Andrés de la Peña
I think 500 runs combining all configs could be reasonable, since it's
unlikely to have config-specific flaky tests. As in five configs with 100
repetitions each.

On Fri, 7 Jul 2023 at 16:14, Josh McKenzie  wrote:

> Maybe. Kind of depends on how long we write our tests to run doesn't it? :)
>
> But point taken. Any non-trivial test would start to be something of a
> beast under this approach.
>
> On Fri, Jul 7, 2023, at 11:12 AM, Brandon Williams wrote:
>
> On Fri, Jul 7, 2023 at 10:09 AM Josh McKenzie 
> wrote:
> > 3. Multiplexed tests (changed, added) run against all JDK's and a
> broader range of configs (no-vnode, vnode default, compression, etc)
>
> I think this is going to be too heavy...we're taking 500 iterations
> and multiplying that by like 4 or 5?
>
>
>


Re: [DISCUSS] Remove deprecated keyspace_count_warn_threshold and table_count_warn_threshold

2023-06-16 Thread Andrés de la Peña
It seems we agree on removing the default value for the old thresholds, and
don't count system keyspaces/tables on the new ones.

The old thresholds were on active duty for around ten months, and they have
been deprecated for around a year. They will have been deprecated for
longer by the time we release 5.0. If we want to keep them in perpetuity, I
guess the plan would be:

- Remove the default value of the old thresholds in Config.java to make
them disabled by default.
- Remove the old thresholds from the default cassandra.yaml, although old
yamls can still have them.
- Use converters (@Replaces tag in Config.java) to read the old threshold
values (if present) and apply them to the new guardrails.
- During the conversion from the old thresholds to the new guardrails,
subtract the current number of system keyspace/tables from the old value.
For example, 150 tables in the old threshold translate to 103 tables in the
new guardrail, considering that there are 47 system tables.

Does this sound good?

On Wed, 14 Jun 2023 at 17:26, David Capwell  wrote:

> That's problematic because the new thresholds we added in CASSANDRA-17147
> don't include system tables. Do you think we should change that?
>
>
> I wouldn’t change the semantics of the config as it’s already live.  I
> guess where I am coming from is that logically we have to think about the
> system tables, so to your point, if we think 150 is too much and the system
> already exposes 50… then we should recommend no more than 100….
>
> I find it's better for usability to not count the system tables and just
> say "It's recommended not to have more than 100 tables. This doesn't
> include system tables.”
>
>
> I am fine with this framing… internally we think about 150 but
> publicly speak 100 (due to our 50 tables)...
>
>
> On Jun 14, 2023, at 8:29 AM, Josh McKenzie  wrote:
>
> In my opinion including system tables defeats that purpose because it
> forces users to know details about the system tables.
>
> Perhaps having a unit test that caps our system tables at some value and
> keeping the guardrail user-scope specific would be a better approach. I see
> your point about leaking internal details to users, specifically on things
> they can't control at this point.
>
> On Wed, Jun 14, 2023, at 8:19 AM, Andrés de la Peña wrote:
>
> > Default value I agree with you; features should be off by default!  If
> we remove the default then we disable the feature by default (which im cool
> with) and for anyone who changed the config, they would keep their behavior
>
>
> I'm glad we agree on at least removing the default value if we keep the
> deprecated properties.
>
> > With that, I kinda don’t agree that including system tables is a
> mistake, as we add more we allow less for user tables before we start to
> have issues….
>
>
> That's problematic because the new thresholds we added in CASSANDRA-17147
> don't include system tables. Do you think we should change that?
>
> I still think it's better not to include the system tables in the count.
> The thresholds on the number of keyspaces/tables/rows/columns/tombstones
> are just guidance since they cannot be exactly related to exact resource
> consumption. The main purpose of those thresholds is to prevent obvious
> antipatterns such as creating thousands of tables. A benefit of expressing
> the guardrails in terms of the number of schema entities, rather than
> counting the memory usage of those entities, is that they are easy to
> understand and reason about. In my opinion including system tables defeats
> that purpose because it forces users to know details about the system
> tables. The fact that those details change between versions doesn't help.
> Including system tables is not going to make the thresholds precise in
> terms of measuring memory consumption because that depends on other
> factors, such as the columns they store.
>
> Including system tables also imposes a minimum threshold value, like in
> 5.0 you cannot set a threshold value under 45 tables without triggering it
> with an empty db. For other thresholds, this can be more tricky. That would
> be the case of the guardrail on the number of columns in a partition, where
> you would need to know the size of the widest row in the system tables,
> which can change over time.
>
> I guess that if system tables were to be counted, a recommendation for the
> threshold would say something like "It's recommended not to have more than
> 150 tables. The system already includes 45 tables for internal usage, so
> you shouldn't create more than 105 user tables". I find it's better for
> usability to not count the system tables and just say "It's recommended not
> to have more than 100 tables. This doesn't include syst

Re: [DISCUSS] Remove deprecated keyspace_count_warn_threshold and table_count_warn_threshold

2023-06-14 Thread Andrés de la Peña
>
> > Default value I agree with you; features should be off by default!  If
> we remove the default then we disable the feature by default (which im cool
> with) and for anyone who changed the config, they would keep their behavior


I'm glad we agree on at least removing the default value if we keep the
deprecated properties.

> With that, I kinda don’t agree that including system tables is a mistake,
> as we add more we allow less for user tables before we start to have
> issues….


That's problematic because the new thresholds we added in CASSANDRA-17147
don't include system tables. Do you think we should change that?

I still think it's better not to include the system tables in the count.
The thresholds on the number of keyspaces/tables/rows/columns/tombstones
are just guidance since they cannot be exactly related to exact resource
consumption. The main purpose of those thresholds is to prevent obvious
antipatterns such as creating thousands of tables. A benefit of expressing
the guardrails in terms of the number of schema entities, rather than
counting the memory usage of those entities, is that they are easy to
understand and reason about. In my opinion including system tables defeats
that purpose because it forces users to know details about the system
tables. The fact that those details change between versions doesn't help.
Including system tables is not going to make the thresholds precise in
terms of measuring memory consumption because that depends on other
factors, such as the columns they store.

Including system tables also imposes a minimum threshold value, like in 5.0
you cannot set a threshold value under 45 tables without triggering it with
an empty db. For other thresholds, this can be more tricky. That would be
the case of the guardrail on the number of columns in a partition, where
you would need to know the size of the widest row in the system tables,
which can change over time.

I guess that if system tables were to be counted, a recommendation for the
threshold would say something like "It's recommended not to have more than
150 tables. The system already includes 45 tables for internal usage, so
you shouldn't create more than 105 user tables". I find it's better for
usability to not count the system tables and just say "It's recommended not
to have more than 100 tables. This doesn't include system tables."

On Tue, 13 Jun 2023 at 23:51, Josh McKenzie  wrote:

> Warning that too many tables (including system) may have negative behavior
> I think is fine
>
> This reminds me of the current situation with our tests where we just keep
> adding more and more without really considering the value of the current
> set and the costs of that body of work as it keeps growing.
>
> Having some kind of signal that we need to do some housekeeping with our
> system tables, or *something* in the feedback loop that helps us keep on
> top of this hygiene over time, seems like a clear benefit to me.
>
> On Tue, Jun 13, 2023, at 1:42 PM, David Capwell wrote:
>
> I think that the combined decision of using a default value and counting
> system tables was a mistake
>
>
> Default value I agree with you; features should be off by default!  If we
> remove the default then we disable the feature by default (which im cool
> with) and for anyone who changed the config, they would keep their behavior
>
> As for system tables… each table adds a cost to our bookkeeping, so when
> we add new tables the cost grows and the memory per table decreases, does
> it not?  Warning that too many tables (including system) may have negative
> behavior I think is fine, its only if we start to fail is when things
> become a problem (upgrading to 5.0 can’t happen due to too many tables
> added in the release?); think the feature was warn only, so that should be
> fine.  With that, I kinda don’t agree that including system tables is a
> mistake, as we add more we allow less for user tables before we start to
> have issues…. At the same time, if we have improvements in newer versions
> that allows higher number of tables, the user then has to update their
> configs (well, as long as we don’t make things worse a smaller limit than
> needed is fine…)
>
> we would need to know how many system keyspaces/tables were on the version
> we are upgrading from
>
>
> Do we?  The logic was pulling from local schema, so to keep the same
> behavior we would need to do the same; being version dependent would
> actually break the semantics as far as I can tell.
>
> On Jun 13, 2023, at 9:50 AM, Andrés de la Peña 
> wrote:
>
> Indeed "keyspace_count_warn_threshold" and "table_count_warn_threshold"
> include system keyspaces and tables. Also, differently to the newer
> guardrails, they are enabled by default.
>
> I find t

Re: [DISCUSS] Remove deprecated keyspace_count_warn_threshold and table_count_warn_threshold

2023-06-13 Thread Andrés de la Peña
Indeed "keyspace_count_warn_threshold" and "table_count_warn_threshold"
include system keyspaces and tables. Also, differently to the newer
guardrails, they are enabled by default.

I find that problematic because users need to know how many system
keyspaces/tables there are to know if they need to set the threshold value.
Moreover, if a new release adds some system tables, the threshold can start
to be triggered without changing the number of user tables. That would
force some users to update the threshold values during an upgrade. Even if
they are using the defaults. That situation would happen again in any
release adding new keyspaces/tables. I think adding new system tables is
not that uncommon, and indeed 5.0 does it.

I think that the combined decision of using a default value and counting
system tables was a mistake. If that's the case, I don't know for how long
we want to remain tied to that mistake. Especially when the old thresholds
tend to create upgrade issues on their own.

If we were going to use converters, we would need to know how many system
keyspaces/tables were on the version we are upgrading from. I don't know if
that information is available. Or perhaps we could assume that counting
system keyspaces/tables was a bug, and just translate changing the meaning
to not include them.





On Tue, 13 Jun 2023 at 16:51, David Capwell  wrote:

> > Have we been dropping support entirely for old params or using the
> @Replaces annotation into perpetuity?
>
>
> My understanding is that the goal is to keep things around in perpetuity
> unless it actively causes us harm… and with @Replaces, there tends to be no
> harm to keep around…
>
> Looking at
> https://github.com/apache/cassandra/commit/bae92ee139b411c94228f8fd5bb8befb4183ca9f
> we just marked them deprecated and created a brand new config that matched
> the old… which I feel was a bad idea…. Renaming configs are fine with
> @Replaces, but asking users to migrate with the idea of breaking them in
> the future is bad…
>
> The table_count_warn_threshold config is used at
> org.apache.cassandra.cql3.statements.schema.CreateTableStatement#clientWarnings
> The tables_warn_threshold config is used at
> org.apache.cassandra.cql3.statements.schema.CreateTableStatement#validate
>
> The only difference I see is that table_count_warn_threshold includes
> system tables where as tables_warn_threshold is only user tables…
>
> > I would like to propose removing the non-guardrail thresholds
> 'keyspace_count_warn_threshold' and 'table_count_warn_threshold'
> configuration settings on the trunk branch for the next major release.
>
> Deprecate in 4.1 is way too new for me to accept that, and its low effort
> to keep; breaking users is always a bad idea and doing it when not needed
> is bad…
>
> Honestly, I don’t see why we couldn’t use @Replaces here to solve the
> semantic gap… table_count_warn_threshold includes the system tables, so we
> just need a Converter that takes w/e the value the user put in and
> subtracts the system tables… which then gives us the user tables (matching
> tables_warn_threshold)
>
> > On Jun 13, 2023, at 7:57 AM, Josh McKenzie  wrote:
> >
> >> have subsequently been deprecated since 4.1-alpha in CASSANDRA-17195
> when they were replaced/migrated to guardrails as part of CEP-3
> (Guardrails).
> > Have we been dropping support entirely for old params or using the
> @Replaces annotation into perpetuity?
> >
> > I dislike the idea of operators having to remember to update things
> between versions and being surprised when things change roughly equally to
> us carrying along undocumented deprecated param name mapping roughly
> equally. :)
> >
> > On Mon, Jun 12, 2023, at 5:56 PM, Dan Jatnieks wrote:
> >> Hello everyone,
> >>
> >> I would like to propose removing the non-guardrail thresholds
> 'keyspace_count_warn_threshold' and 'table_count_warn_threshold'
> configuration settings on the trunk branch for the next major release.
> >>
> >> These thresholds were first added with CASSANDRA-16309 in 4.0-beta4 and
> have subsequently been deprecated since 4.1-alpha in CASSANDRA-17195 when
> they were replaced/migrated to guardrails as part of CEP-3 (Guardrails).
> >>
> >> I'd appreciate any thoughts about this. I will open a ticket to get
> started if there is support for doing this.
> >>
> >> Reference:
> >> https://issues.apache.org/jira/browse/CASSANDRA-16309
> >> https://issues.apache.org/jira/browse/CASSANDRA-17195
> >> CEP-3: Guardrails
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-3%3A+Guardrails
> >>
> >>
> >> Thanks,
> >> Dan Jatnieks
>
>
>


Re: [VOTE] CEP-8 Datastax Drivers Donation

2023-06-13 Thread Andrés de la Peña
+1

On Tue, 13 Jun 2023 at 16:40, Yifan Cai  wrote:

> +1
> --
> *From:* David Capwell 
> *Sent:* Tuesday, June 13, 2023 8:37:10 AM
> *To:* dev 
> *Subject:* Re: [VOTE] CEP-8 Datastax Drivers Donation
>
> +1
>
> On Jun 13, 2023, at 7:59 AM, Josh McKenzie  wrote:
>
> +1
>
> On Tue, Jun 13, 2023, at 10:55 AM, Jeremiah Jordan wrote:
>
> +1 nb
>
> On Jun 13, 2023 at 9:14:35 AM, Jeremy Hanna 
> wrote:
>
>
> Calling for a vote on CEP-8 [1].
>
> To clarify the intent, as Benjamin said in the discussion thread [2], the
> goal of this vote is simply to ensure that the community is in favor of
> the donation. Nothing more.
> The plan is to introduce the drivers, one by one. Each driver donation
> will need to be accepted first by the PMC members, as it is the case for
> any donation. Therefore the PMC should have full control on the pace at
> which new drivers are accepted.
>
> If this vote passes, we can start this process for the Java driver under
> the direction of the PMC.
>
> Jeremy
>
> 1.
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation
> 2. https://lists.apache.org/thread/opt630do09phh7hlt28odztxdv6g58dp
>
>
>


Re: [VOTE] CEP-30 ANN Vector Search

2023-05-26 Thread Andrés de la Peña
+1

On Fri, 26 May 2023 at 12:59, Mike Adamson  wrote:

> +1 (nb)
>
> On Fri, 26 May 2023 at 12:50, Stefania Alborghetti 
> wrote:
>
>> +1
>>
>> On Fri, May 26, 2023 at 7:31 AM Aleksey Yeshchenko 
>> wrote:
>>
>>> +1
>>>
>>> On 26 May 2023, at 07:19, Berenguer Blasi 
>>> wrote:
>>>
>>> +1
>>> On 26/5/23 6:07, guo Maxwell wrote:
>>>
>>> +1
>>>
>>> Dinesh Joshi 于2023年5月26日 周五上午11:08写道:
>>>
 +1


 On May 25, 2023, at 8:45 AM, Jonathan Ellis  wrote:

 

 Let's make this official.

 CEP:
 https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes

 POC that demonstrates all the big rocks, including distributed queries:
 https://github.com/datastax/cassandra/tree/cep-vsearch

 --
 Jonathan Ellis
 co-founder, http://www.datastax.com
 @spyced

 --
>>> you are the apple of my eye !
>>>
>>>
>>>
>
> --
> [image: DataStax Logo Square]  *Mike Adamson*
> Engineering
>
> +1 650 389 6000 <16503896000> | datastax.com 
> Find DataStax Online: [image: LinkedIn Logo]
> 
>[image: Facebook Logo]
> 
>[image: Twitter Logo]    [image: RSS
> Feed]    [image: Github Logo]
> 
>
>


Re: [VOTE] Release dtest-api 0.0.14

2023-05-16 Thread Andrés de la Peña
+1

On Tue, 16 May 2023 at 07:24, Alex Petrov  wrote:

> +1
>
> On Tue, May 16, 2023, at 4:45 AM, Doug Rohrer wrote:
>
> +1 (nb)
>
> Doug Rohrer
>
> > On May 15, 2023, at 7:17 PM, Brandon Williams  wrote:
> >
> > +1
> >
> > Kind Regards,
> > Brandon
> >
> >> On Mon, May 15, 2023 at 5:12 PM Dinesh Joshi  wrote:
> >>
> >> Proposing the test build of in-jvm dtest API 0.0.14 for release.
> >>
> >> Repository:
> >> https://gitbox.apache.org/repos/asf?p=cassandra-in-jvm-dtest-api.git
> >>
> >> Candidate SHA:
> >>
> https://github.com/apache/cassandra-in-jvm-dtest-api/commit/ea4b44e0ed0a4f0bbe9b18fb40ad927b49a73a32
> >> tagged with 0.0.14
> >>
> >> Artifacts:
> >>
> https://repository.apache.org/content/repositories/orgapachecassandra-1289/org/apache/cassandra/dtest-api/0.0.14/
> >>
> >> Key signature: 53371F9B1B425A336988B6A03B6042413D323470
> >>
> >> Changes since last release:
> >>
> >> * CASSANDRA-18511: Add support for JMX in jvm-dtest
> >>
> >> The vote will be open for 24 hours. Everyone who has tested the build
> >> is invited to vote. Votes by PMC members are considered binding. A
> >> vote passes if there are at least three binding +1s.
>
>


Re: [POLL] Vector type for ML

2023-05-05 Thread Andrés de la Peña
My vote is:

1. VECTOR
2. DENSE VECTOR
3. type[dimension]

If we ever add sparse vectors, we can assume that DENSE is the default and
allow to use either DENSE, SPARSE or nothing.

Perhaps the dimension could be separated from the type, such as in
VECTOR[dimension] or VECTOR(dimension).

On Fri, 5 May 2023 at 19:05, David Capwell  wrote:

> ...where, just to be clear, VECTOR means a frozen fixed
>> size array w/ no null values?
>>
> Assuming this is the case
>
>
> The current agreed requirements are:
>
> 1) non-null elements
> 2) fixed length
> 3) frozen
>
> You pointed out 3 isn’t actually required, but that would be a different
> conversation to remove =)… maybe defer this to JIRA as long as all parties
> agree in the ticket?
>
> With all votes in, this is what I see
>
> *Syntax*
>
> *Jonathan Ellis*
>
> *David Capwell*
>
> *Josh McKenzie*
>
> *Caleb Rackliffe*
>
> *Patrick McFadin*
>
> *Brandon Williams*
>
> *Mike Adamson*
>
> *Benedict*
>
> *Mick Semb Wever*
>
> *Derek Chen-Becker*
>
> VECTOR
>
> 1
>
> 2
>
> 2
>
>
> 2
>
> 1
>
> 1
>
> 3
>
> 2
>
>
> DENSE VECTOR
>
> 2
>
> 1
>
>
>
> 1
>
>
> 2
>
>
>
>
> type[dimension]
>
> 3
>
> 3
>
> 3
>
> 1
>
>
> 3
>
>
> 2
>
>
>
> DENSE_VECTOR
>
>
>
> 1
>
>
>
>
>
>
>
> 3
>
> NON NULL [dimention]
>
>
> 1
>
>
>
>
>
>
> 1
>
>
> 2
>
> VECTOR type[n]
>
>
>
>
>
>
> 2
>
>
>
> 1
>
>
> ARRAY
>
>
>
>
>
> 3
>
>
>
>
>
>
> NON-NULL FROZEN
>
>
>
>
>
>
>
>
>
>
> 1
>
> *Rank*
>
> *Weight*
>
> *1*
>
> 3
>
> *2*
>
> 2
>
> *3*
>
> 1
>
> *?*
>
> 3
>
> *Syntax*
>
> *Score*
>
> VECTOR
>
> 18
>
> DENSE VECTOR
>
> 10
>
> type[dimension]
>
> 9
>
> NON NULL [dimention]
>
> 8
>
> VECTOR type[n]
>
> 5
>
> DENSE_VECTOR
>
> 4
>
> NON-NULL FROZEN
>
> 3
>
> ARRAY
>
> 1
>
>
> *Syntax*
>
> *Round 1*
>
> *Round 2*
>
> VECTOR
>
> 3
>
> 4
>
> DENSE VECTOR
>
> 2
>
> 2
>
> NON NULL [dimention]
>
> 2
>
> 1
>
> VECTOR type[n]
>
> 1
>
>
> type[dimension]
>
> 1
>
>
> DENSE_VECTOR
>
> 1
>
>
> NON-NULL FROZEN
>
> 1
>
>
> ARRAY
>
> 0
>
>
>
> Under 2 different voting systems vector is in the lead
> and by a good amount… I have updated the patch locally to reflect this
> change as well.
>
> On May 5, 2023, at 10:41 AM, Mike Adamson  wrote:
>
> ...where, just to be clear, VECTOR means a frozen fixed
>> size array w/ no null values?
>>
> Assuming this is the case, my vote is:
>
> 1. VECTOR
> 2. DENSE VECTOR
>
> I don't really have a 3rd vote because I think that *type[dimension]* is
> too ambiguous.
>
>
> On Fri, 5 May 2023 at 18:32, Derek Chen-Becker 
> wrote:
>
>> LOL, I'm holding you to that at the summit :) In all seriousness, I'm
>> glad to see a robust debate around it. I guess for completeness, my order
>> of preference is
>>
>> 1 - NONNULL FROZEN>
>> 2 - NONNULL TYPE (which part of this implies frozen? The NONNULL or
>> the cardinality?)
>> 3 - DENSE_VECTOR
>>
>> I guess my main concern with just "VECTOR" is that it's such an
>> overloaded term. Maybe in ML it means something specific, but for anyone
>> coming from C++, Rust, Java, etc, a Vector is both mutable and can carry
>> null (or equivalent, e.g. None, in Rust). If the argument hadn't also been
>> made that we should be working toward something that's not ML-specific
>> maybe I would be less concerned.
>>
>> Cheers,
>>
>> Derek
>>
>>
>> Cheers,
>>
>> Derek
>>
>> On Fri, May 5, 2023 at 11:14 AM Patrick McFadin 
>> wrote:
>>
>>> Derek, despite your preference, I would hang out with you at a party.
>>>
>>> On Fri, May 5, 2023 at 9:44 AM Derek Chen-Becker 
>>> wrote:
>>>
 Speaking as someone who likes Erlang, maybe that's why I also like
 NONNULL FROZEN>. It's unambiguous what Cassandra is going to do
 with that type. DENSE VECTOR means I need to go read docs (and then
 probably double-check in the source to be sure) to be sure what exactly is
 going on.

 Cheers,

 Derek

 On Fri, May 5, 2023 at 9:54 AM Patrick McFadin 
 wrote:

> I hope we are willing to consider developers that use our system
> because if I had to teach people to use "NON-NULL FROZEN" I'm
> pretty sure the response would be:
>
> Did you tell me to go write a distributed map-reduce job in Erlang? I
> beleive I did, Bob.
>
> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie 
> wrote:
>
>> Idiomatically, to my mind, there's a question of "what space are we
>> thinking about this datatype in"?
>>
>> - In the context of mathematics, nullability in a vector would be 0
>> - In the context of Cassandra, nullability tends to mean a tombstone
>> (or nothing)
>> - In the context of programming languages, it's all over the place
>>
>> Given many models are exploring quantizing to int8 and other data
>> types, there's definitely the "support other data types easily in the
>> future" piece to me we need to keep in mind.
>>
>> So with the above and the "meet the user where they are and don't
>> make them understand more of Cassandra than absolutely critical to 

Re: [POLL] Vector type for ML

2023-05-02 Thread Andrés de la Peña
A > B > C

I don't think that ML is such a niche application that it can't have its
own CQL data type. Also, vectors are mathematical elements that have more
applications that ML.

On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:

>
>
> On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:
>
>> Should we add a vector type to Cassandra designed to meet the needs of
>> machine learning use cases, specifically feature and embedding vectors for
>> training, inference, and vector search?
>>
>> ML vectors are fixed-dimension (fixed-length) sequences of numeric types,
>> with no nulls allowed, and with no need for random access. The ML industry
>> overwhelmingly uses float32 vectors, to the point that the industry-leading
>> special-purpose vector database ONLY supports that data type.
>>
>> This poll is to gauge consensus subsequent to the recent discussion
>> thread at
>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>
>> Please rank the discussed options from most preferred option to least,
>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B
>> = A (C is my preference, followed by B or A approximately equally.)
>>
>> (A) I am in favor of adding a vector type for floats; I do not believe we
>> need to tie it to any particular implementation details.
>>
>> (B) I am okay with adding a vector type but I believe we must add array
>> types that compose with all Cassandra types first, and make vectors a
>> special case of arrays-without-null-elements.
>>
>> (C) I am not in favor of adding a built-in vector type.
>>
>
>
>
> A  > B > C
>
> B is stated as "must add array types…".  I think this is a bit loaded.  If
> B was the (A + the implementation needs to be a non-null frozen float32
> array, serialisation forward compatible with other frozen arrays later
> implemented) I would put this before (A).  Especially because it's been
> shown already this is easy to implement.
>
>
>


Re: [DISCUSS] New data type for vector search

2023-04-26 Thread Andrés de la Peña
If we are going to use FLOAT[N] as sugar for another CQL data type, maybe
tuples are more convenient than lists. So FLOAT[N] could be equivalent to
TUPLE.

Differently to collections, tuples have a fixed size, they are always
frozen and I think they don't support random access. These properties seem
desirable for vectors.

Tuples however support null values, whereas collections doesn't. I mean,
you can remove elements from a collection, but I think you are never going
to see an explicit null in the collection. Tuples don't allow to remove a
value, but the entire tuple can be written with null values. Like in INSERT
INTO t (key, tuple) VALUES (0,  (1, null, 3)).

On Wed, 26 Apr 2023 at 21:53, Mick Semb Wever  wrote:

> My inclination then would be to say you declare an ARRAY (which
>> is semantic sugar for FROZEN>). This is very consistent with
>> our existing style. We then simply permit such columns to define ANN
>> indexes.
>>
>
>
> So long as nulls aren't a problem as David questions, an alternative is:
>
>  FLOAT[N] as semantic sugar for LIST
>
> And ANN requiring FROZEN
>
> Maybe taking a poll in a few days will be positive to keep this
> moving forward.
>


Re: [DISCUSS] CEP-29 CQL NOT Operator

2023-04-13 Thread Andrés de la Peña
Indeed requiring AF for "select * from ks.tb where p1 = 1 and c1 = 2 and
col2 = 1", where p1 and c1 are all the columns in the primary key, sounds
like a bug.

I think the criterion in the code is that we require AF if there is any
column restriction that cannot be processed by the primary key or a
secondary index. The error message indeed seems to reject any kind of
filtering, independently of primary key filters. We can see this even
without defined clustering keys:

CREATE TABLE t (k int PRIMARY KEY, v int);
SELECT * FROM  t WHERE  k = 1 AND v = 1; # requires AF

That clashes with documentation, where it's said that AF is required for
filters that require scanning all partitions. If we were to adapt the code
to the behaviour described in documentation we shouldn't require AF if
there are restrictions specifying a partition key. Or possibly a group of
partition keys, if a IN restriction is used. So both within row and within
partition filtering wouldn't require AF.

Regarding adding a new ALLOW FILTERING WITHIN PARTITION, I think we could
just add a guardrail to directly disallow those queries, without needing to
add the WITHIN PARTITION clause to the CQL grammar.

On Thu, 13 Apr 2023 at 11:11, Henrik Ingo  wrote:

>
>
> On Thu, Apr 13, 2023 at 10:20 AM Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> wrote:
>
>> Somebody correct me if I am wrong but "partition key" itself is not
>> enough (primary keys = partition keys + clustering columns). It will
>> require ALLOW FILTERING when clustering columns are not specified either.
>>
>> create table ks.tb (p1 int, c1 int, col1 int, col2 int, primary key (p1,
>> c1));
>> select * from ks.tb where p1 = 1 and col1 = 2; // this will require
>> allow filtering
>>
>> The documentation seems to omit this fact.
>>
>
> It does seem so.
>
> That said, personally I was assuming, and would still argue it's the
> optimal choice, that the documentation was right and reality is wrong.
>
> If there is a partition key, then the query can avoid scanning the entire
> table, across all nodes, potentially petabytes.
>
> If a query specifies a partition key but not the full clustering key, of
> course there will be some scanning needed, but this is marginal compared to
> the need to scan the entire table. Even in the worst case, a partition with
> 2 billion cells, we are talking about seconds to filter the result from the
> single partition.
>
> > Aha I get what you all mean:
>
> No, I actually think both are unnecessary. But yeah, certainly this latter
> case is a bug?
>
> henrik
>
> --
>
> Henrik Ingo
>
> c. +358 40 569 7354
>
> w. www.datastax.com
>
>   
> 
> 
>
>


Re: [DISCUSS] Introduce DATABASE as an alternative to KEYSPACE

2023-04-04 Thread Andrés de la Peña
I think supporting DATABASE is a great idea.

It's better aligned with SQL databases, and can save new users one of the
first troubles they find.

Probably anyone starting to use Cassandra for the first time is going to
face the what is a keyspace? question in the first minutes. Saving that to
users with a more common name would be a victory for usability IMO.

On Tue, 4 Apr 2023 at 16:48, Mike Adamson  wrote:

> Hi,
>
> I'd like to propose that we add DATABASE to the CQL grammar as an
> alternative to KEYSPACE.
>
> Background: While TABLE was introduced as an alternative for COLUMNFAMILY
> in the grammar we have kept KEYSPACE for the container name for a group of
> tables. Nearly all traditional SQL databases use DATABASE as the container
> name for a group of tables so it would make sense for Cassandra to adopt
> this naming as well.
>
> KEYSPACE would be kept in the grammar but we would update some logging and
> documentation to encourage use of the new name.
>
> Mike Adamson
>
> --
> [image: DataStax Logo Square]  *Mike Adamson*
> Engineering
>
> +1 650 389 6000 <16503896000> | datastax.com 
> Find DataStax Online: [image: LinkedIn Logo]
> 
>[image: Facebook Logo]
> 
>[image: Twitter Logo]    [image: RSS
> Feed]    [image: Github Logo]
> 
>
>


Re: [VOTE] CEP-26: Unified Compaction Strategy

2023-04-04 Thread Andrés de la Peña
+1

On Tue, 4 Apr 2023 at 15:09, Jeremy Hanna 
wrote:

> +1 nb, will be great to have this in the codebase - it will make nearly
> every table's compaction work more efficiently.  The only possible
> exception is tables that are well suited for TWCS.
>
> On Apr 4, 2023, at 8:00 AM, Berenguer Blasi 
> wrote:
>
> +1
> On 4/4/23 14:36, J. D. Jordan wrote:
>
> +1
>
> On Apr 4, 2023, at 7:29 AM, Brandon Williams 
>  wrote:
>
> 
> +1
>
> On Tue, Apr 4, 2023, 7:24 AM Branimir Lambov  wrote:
>
>> Hi everyone,
>>
>> I would like to put CEP-26 to a vote.
>>
>> Proposal:
>>
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-26%3A+Unified+Compaction+Strategy
>>
>> JIRA and draft implementation:
>> https://issues.apache.org/jira/browse/CASSANDRA-18397
>>
>> Up-to-date documentation:
>>
>> https://github.com/blambov/cassandra/blob/CASSANDRA-18397/src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.md
>>
>> Discussion:
>> https://lists.apache.org/thread/8xf5245tclf1mb18055px47b982rdg4b
>>
>> The vote will be open for 72 hours.
>> A vote passes if there are at least three binding +1s and no binding
>> vetoes.
>>
>> Thanks,
>> Branimir
>>
>
>


Re: Welcome our next PMC Chair Josh McKenzie

2023-03-23 Thread Andrés de la Peña
Congratulations Josh! Thanks for everything Mick!

On Thu, 23 Mar 2023 at 11:36, Ekaterina Dimitrova 
wrote:

> Congrats Josh!  Thanks Mick!
>
> На четвъртък, 23 март 2023 г. Brandon Williams  написа:
>
>> Congrats Josh!  Thanks Mick!
>>
>> Kind Regards,
>> Brandon
>>
>> On Thu, Mar 23, 2023 at 3:22 AM Mick Semb Wever  wrote:
>> >
>> > It is time to pass the baton on, and on behalf of the Apache Cassandra
>> Project Management Committee (PMC) I would like to welcome and congratulate
>> our next PMC Chair Josh McKenzie (jmckenzie).
>> >
>> > Most of you already know Josh, especially through his regular and
>> valuable project oversight and status emails, always presenting a balance
>> and understanding to the various views and concerns incoming.
>> >
>> > Repeating Paulo's words from last year: The chair is an administrative
>> position that interfaces with the Apache Software Foundation Board, by
>> submitting regular reports about project status and health. Read more about
>> the PMC chair role on Apache projects:
>> > - https://www.apache.org/foundation/how-it-works.html#pmc
>> > - https://www.apache.org/foundation/how-it-works.html#pmc-chair
>> > -
>> https://www.apache.org/foundation/faq.html#why-are-PMC-chairs-officers
>> >
>> > The PMC as a whole is the entity that oversees and leads the project
>> and any PMC member can be approached as a representative of the committee.
>> A list of Apache Cassandra PMC members can be found on:
>> https://cassandra.apache.org/_/community.html
>>
>


Re: [DISCUSS] Remove deprecated CQL functions dateof and unixtimestampof on 5.0

2023-03-14 Thread Andrés de la Peña
I have created https://issues.apache.org/jira/browse/CASSANDRA-18328 for
removing the deprecated functions, as part of CASSANDRA-18306
<https://issues.apache.org/jira/browse/CASSANDRA-18306> epic.

On Mon, 13 Mar 2023 at 12:28, Miklosovic, Stefan <
stefan.mikloso...@netapp.com> wrote:

> Actually, this one https://issues.apache.org/jira/browse/CASSANDRA-18306
>
> 
> From: Miklosovic, Stefan 
> Sent: Monday, March 13, 2023 13:26
> To: dev@cassandra.apache.org
> Subject: Re: [DISCUSS] Remove deprecated CQL functions dateof and
> unixtimestampof on 5.0
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
>
> I am +1.
>
> Could you please link the ticket to
> https://issues.apache.org/jira/browse/CASSANDRA-17973 ?
>
> Thanks
>
> 
> From: Andrés de la Peña 
> Sent: Monday, March 13, 2023 13:22
> To: dev@cassandra.apache.org
> Subject: [DISCUSS] Remove deprecated CQL functions dateof and
> unixtimestampof on 5.0
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
> The CQL functions "dateof" and "unixtimestampof" were deprecated on
> Cassandra 2.2.0, almost eight years ago [1]. They were deprecated in favour
> of the then new "totimestamp" and "tounixtimestamp" functions.
>
> I think that we can finally remove those functions in 5.0, since they have
> been deprecated for so long.
>
> A note about their deprecation was added to NEWS.txt [2], and they were
> marked as deprecated on CQL.textile [3]. They are also listed as deprecated
> on the new doc [4].
>
> I came to this while working on the adoption of snake case conventions for
> CQL function names on CASSANDRA-18037. It probably doesn't make sense to
> add new "date_of" and "unix_timestamp_of" aliases for them.
>
> What do you think? Should we remove them?
>
> [1]
> https://github.com/apache/cassandra/commit/c08aaabd95d4872593c29807de6ec1485cefa7fa
> [2] https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L1421-L1423
> [3]
> https://github.com/apache/cassandra/blob/trunk/doc/cql3/CQL.textile#time-conversion-functions
> [4]
> https://github.com/apache/cassandra/blob/trunk/doc/modules/cassandra/pages/cql/functions.adoc#time-conversion-functions
>


[DISCUSS] Remove deprecated CQL functions dateof and unixtimestampof on 5.0

2023-03-13 Thread Andrés de la Peña
The CQL functions "dateof" and "unixtimestampof" were deprecated on
Cassandra 2.2.0, almost eight years ago [1]. They were deprecated in favour
of the then new "totimestamp" and "tounixtimestamp" functions.

I think that we can finally remove those functions in 5.0, since they have
been deprecated for so long.

A note about their deprecation was added to NEWS.txt [2], and they were
marked as deprecated on CQL.textile [3]. They are also listed as deprecated
on the new doc [4].

I came to this while working on the adoption of snake case conventions for
CQL function names on CASSANDRA-18037. It probably doesn't make sense to
add new "date_of" and "unix_timestamp_of" aliases for them.

What do you think? Should we remove them?

[1]
https://github.com/apache/cassandra/commit/c08aaabd95d4872593c29807de6ec1485cefa7fa
[2] https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L1421-L1423
[3]
https://github.com/apache/cassandra/blob/trunk/doc/cql3/CQL.textile#time-conversion-functions
[4]
https://github.com/apache/cassandra/blob/trunk/doc/modules/cassandra/pages/cql/functions.adoc#time-conversion-functions


Re: [EXTERNAL] Re: [DISCUSS] Next release date

2023-03-13 Thread Andrés de la Peña
>
> Should we clarify that part first by getting an idea of the status of the
> different CEPs and other big pieces of work?


CEP-20 (dynamic data masking) should hopefully be ready by the end of this
month.

It's composed by seven small tickets. Five of those tickets are ready, and
two are under review. All together it will be ~6K LOC, involving around 100
files.

On Thu, 9 Mar 2023 at 21:17, Mick Semb Wever  wrote:

> > > > One place we've been weak historically is in distinguishing between
> tickets we consider "nice to have" and things that are "blockers". We don't
> have any metadata that currently distinguishes those two, so determining
> what our burndown leading up to 5.0 looks like is a lot more data massaging
> and hand-waving than I'd prefer right now.
> > >
> > > We distinguish "blockers" with `Priority=Urgent` or
> `Severity=Critical`, or by linking the ticket as blocking to a specific
> ticket that spells it out. We do have the metadata, but yes it requires
> some work…
> >
> > For everything not urgent or a blocker, does it matter whether something
> has a fixver of where we think it's going to land or where we'd like to see
> it land? At the end of the day, neither of those scenarios will actually
> shift a release date if we're proactively putting "blocker / urgent" status
> on new features, improvements, and bugs we think are significant enough to
> delay a release right?
>
>
> Ooops, actually we were using the -beta, and -rc fixVersion
> placeholders to denote the blockers once "the bridge was crossed"
> (while Urgent and Critical is used more broadly, e.g. patch releases).
> If we use this approach, then we could add a 5.0-alpha placeholder
> that indicates a consensus on tickets blocking the branching (if we
> agree alpha1 should be cut at the same time we branch…). IMHO such
> tickets should also still be marked as Urgent, but I suggest we use
> Urgent/Critical as an initial state, and the fixVersion placeholders
> where we have consensus or it is according to our release criteria
> :shrug:
>


Re: [VOTE] CEP-21 Transactional Cluster Metadata

2023-02-07 Thread Andrés de la Peña
+1

On Tue, 7 Feb 2023 at 09:52, Jacek Lewandowski 
wrote:

> +1
>
> - - -- --- -  -
> Jacek Lewandowski
>
>
> wt., 7 lut 2023 o 10:12 Benjamin Lerer  napisał(a):
>
>> +1
>>
>> Le mar. 7 févr. 2023 à 06:21,  a écrit :
>>
>>> +1 nb
>>>
>>> On Feb 6, 2023, at 9:05 PM, Ariel Weisberg  wrote:
>>>
>>> +1
>>>
>>> On Mon, Feb 6, 2023, at 11:15 AM, Sam Tunnicliffe wrote:
>>>
>>> Hi everyone,
>>>
>>> I would like to start a vote on this CEP.
>>>
>>> Proposal:
>>>
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata
>>>
>>> Discussion:
>>> https://lists.apache.org/thread/h25skwkbdztz9hj2pxtgh39rnjfzckk7
>>>
>>> The vote will be open for 72 hours.
>>> A vote passes if there are at least three binding +1s and no binding
>>> vetoes.
>>>
>>> Thanks,
>>> Sam
>>>
>>>
>>>


Re: Implicitly enabling ALLOW FILTERING on virtual tables

2023-02-03 Thread Andrés de la Peña
For those eventual big virtual tables there is the mentioned flag
indicating whether the table allows filtering without AF.

I guess the question is how can a user know whether a certain virtual table
is one of the big ones. That could be specified in the doc for each table,
and it could also be included in the table properties, so it's displayed by
DESCRIBE TABLE queries.

On Fri, 3 Feb 2023 at 20:56, Chris Lohfink  wrote:

> Just to 2nd what Scott days. While everything is in memory now, it may not
> be in the future, and if we add it implicitly, we are tying ourselves to be
> in memory only. However, I wouldn't -1 the idea.
>
> Another option may be a cqlsh option (ie like expand on/off) to always
> include a flag so it doesnt need to be added or something.
>
> Chris
>
> On Fri, Feb 3, 2023 at 1:24 PM C. Scott Andreas 
> wrote:
>
>> There are some ideas that development community members have kicked
>> around that may falsify the assumption that "virtual tables are tiny and
>> will fit in memory."
>>
>> One example is CASSANDRA-14629: Abstract Virtual Table for very large
>> result sets
>> https://issues.apache.org/jira/browse/CASSANDRA-14629
>>
>> Chris's proposal here is to enable query results from virtual tables to
>> be streamed to the client rather than being fully materialized. There are
>> some neat possibilities suggested in this ticket, such as debug
>> functionality to dump the contents of a raw SSTable via the CQL interface,
>> or the contents of the database's internal caches. One could also imagine a
>> feature like this providing functionality similar to a foreign data wrapper
>> in other databases.
>>
>> I don't think the assumption that "virtual tables will always be small
>> and always fit in memory" is a safe one.
>>
>> I don't think we should implicitly add "ALLOW FILTERING" to all queries
>> against virtual tables because of this, in addition to concern with
>> departing from standard CQL semantics for a type of tables deemed special.
>>
>> – Scott
>>
>> On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov  wrote:
>>
>>
>> Hello Stefan,
>>
>> Regarding the decision to implicitly enable ALLOW FILTERING for
>> virtual tables, which also makes sense to me, it may be necessary to
>> consider changing the clustering columns in the virtual table metadata
>> to regular columns as well. The reasons are the same as mentioned
>> earlier: the virtual tables hold their data in memory, thus we do not
>> benefit from the advantages of ordered data (e.g. the ClientsTable and
>> its ClusteringColumn(PORT)).
>>
>> Changing the clustering column to a regular column may simplify the
>> virtual table data model, but I'm afraid it may affect users who rely
>> on the table metadata.
>>
>>
>>
>> On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña 
>> wrote:
>>
>>
>> I think removing the need for ALLOW FILTERING on virtual tables makes
>> sense and would be quite useful for operators.
>>
>> That guard exists for performance issues that shouldn't occur on virtual
>> tables. We also have a flag in case some future virtual table
>> implementation has limitations regarding filtering, although it seems it's
>> not the case with any of the existing virtual tables.
>>
>> It is not like we would promote bad habits because virtual tables are
>> meant to be queried by operators / administrators only.
>>
>>
>> It might even be quite the opposite, since in the current situation users
>> might get used to routinely use ALLOW FILTERING for querying their virtual
>> tables.
>>
>> It has been mentioned on the #cassandra-dev Slack thread where this
>> started (1) that it's kind of an API inconsistency to allow querying by
>> non-primary keys on virtual tables without ALLOW FILTERING, whereas it's
>> required for regular tables. I think that a simply doc update saying that
>> virtual tables, which are not regular tables, support filtering would be
>> enough. Virtual tables are well identified by both the keyspace they belong
>> to and doc, so users shouldn't have trouble knowing whether a table is
>> virtual. It would be similar to the current exception for ALLOW FILTERING,
>> where one needs to use it unless the table has an index for the queried
>> column.
>>
>> (1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329
>>
>> On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <
>> stefan.mikloso...@netapp.com> wrote:
>>
>>
>> Hi list,
>>
>> the content of virtu

Re: [DISCUSS] Merging incremental feature work

2023-02-03 Thread Andrés de la Peña
Yes, I think that some refactors can also be directly merged if they have a
value for the end user on their own. Changes providing cleaner, better
documented and less tightly coupled code can have that value, even if they
aren't a new feature. Things like new APIs without an implementation
probably don't have that value.

I guess the criteria could be that partial CEP changes can be merged if
they would still make sense if there were a release the next day. Or, even
better, as if the next steps on the CEP were to never be completed.

On Fri, 3 Feb 2023 at 13:13, Josh McKenzie  wrote:

> The deeply coupled nature of some areas of our codebase does have some
> constraints it imposes on us here to your point. Without sensible internal
> APIs a lot of this type of work expands into two phases, one to refactor
> out said APIs and the other to introduce new functionality.
>
> It probably depends on what systems we’re extending or replacing and how
> tightly coupled the original design is as to which approach is feasible
> given resourcing.
>
> On Fri, Feb 3, 2023, at 7:48 AM, Sam Tunnicliffe wrote:
>
> This is quite timely as we're just gearing up to begin pushing the work
> we've been doing on CEP-21 into the public domain.
>
> This CEP is a slightly different from others that have gone before in that
> it touches almost every area of the system. This presents a few
> implementation challenges, most obviously around feature flagging and
> incremental merging. When we began prototyping and working on the design
> presented in CEP-21 it quickly became apparent that doing things
> incrementally would push an already large changeset into gargantuan
> proportions. Keeping changes isolated and abstracted would itself have
> required a vast amount of refactoring and rework of existing code and
> tests.
>
> I'll go into more detail in a CEP-21 specific mail shortly, but the plan
> we were hoping to follow was to work in a long lived topic branch, with
> JIRAs, sensible commit history and CI, and defer merging to trunk until the
> work as a whole is useable and meets all the existing bars for quality,
> review and the like.
>
>
> On 3 Feb 2023, at 12:43, Josh McKenzie  wrote:
>
> Anything we either a) have to do (JDK support) or b) have all agreed up
> front we think we should do (CEP). I.e. things with a lower risk of being
> left dead in the codebase partially implemented.
>
> I don't think it's a coincidence we've set up other processes to help
> de-risk and streamline the consensus building portion of this work given
> our history with it. We haven't taken steps to optimize the tactical
> execution of it yet.
>
> On Fri, Feb 3, 2023, at 7:09 AM, Brandon Williams wrote:
>
> On Fri, Feb 3, 2023 at 6:06 AM Josh McKenzie  wrote:
> >
> > My current thinking: I'd like to propose we all agree to move to merge
> work into trunk incrementally if it's either:
> > 1) New JDK support
> > 2) An approved CEP
>
> So basically everything?  I'm not sure what large complex bodies of
> work would be left.
>
>
>


Re: [DISCUSS] Merging incremental feature work

2023-02-03 Thread Andrés de la Peña
I'm not sure a CEP is necessarily a big, complex piece of work that needs
to be split into multiple tickets. There could be single-ticket CEPs that
don't need multiple tickets, and still need the bureaucracy of a CEP due to
the impact of the change. However that probably isn't the common case.

Also, there could be complex CEPs with multiple steps that can be
incrementally merged to trunk because some of the steps have a value on
their own. For example, dynamic data masking (CEP-20) first creates CQL
functions for masking data, then it allows to attach those functions to the
schema, and then it allows to use UDFs as attached masking functions [1].
Each incremental step has a value on its own. For example, the first step
alone is what MySQL dynamic data masking consists on, just a bunch of
functions.

I think that CEP components that offer value on their own to the users can
perfectly be merged to trunk. That's because we could suddenly cut a
release including those changes and be happy with having included them.
However, if there are partial changes that don't give value to the end user
those changes should probably be in a feature branch. And the feature
branch could be merged every time it contains a releasable piece of work.
If the changes on that feature turn out to be massive, it could be a good
idea to create a separate ticket just for the merge, so reviewers can jump
at it and give a final pass with a global perspective.

[1]
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking#CEP20:DynamicDataMasking-Timeline

On Fri, 3 Feb 2023 at 12:48, Sam Tunnicliffe  wrote:

> This is quite timely as we're just gearing up to begin pushing the work
> we've been doing on CEP-21 into the public domain.
>
> This CEP is a slightly different from others that have gone before in that
> it touches almost every area of the system. This presents a few
> implementation challenges, most obviously around feature flagging and
> incremental merging. When we began prototyping and working on the design
> presented in CEP-21 it quickly became apparent that doing things
> incrementally would push an already large changeset into gargantuan
> proportions. Keeping changes isolated and abstracted would itself have
> required a vast amount of refactoring and rework of existing code and
> tests.
>
> I'll go into more detail in a CEP-21 specific mail shortly, but the plan
> we were hoping to follow was to work in a long lived topic branch, with
> JIRAs, sensible commit history and CI, and defer merging to trunk until the
> work as a whole is useable and meets all the existing bars for quality,
> review and the like.
>
>
> On 3 Feb 2023, at 12:43, Josh McKenzie  wrote:
>
> Anything we either a) have to do (JDK support) or b) have all agreed up
> front we think we should do (CEP). I.e. things with a lower risk of being
> left dead in the codebase partially implemented.
>
> I don't think it's a coincidence we've set up other processes to help
> de-risk and streamline the consensus building portion of this work given
> our history with it. We haven't taken steps to optimize the tactical
> execution of it yet.
>
> On Fri, Feb 3, 2023, at 7:09 AM, Brandon Williams wrote:
>
> On Fri, Feb 3, 2023 at 6:06 AM Josh McKenzie  wrote:
> >
> > My current thinking: I'd like to propose we all agree to move to merge
> work into trunk incrementally if it's either:
> > 1) New JDK support
> > 2) An approved CEP
>
> So basically everything?  I'm not sure what large complex bodies of
> work would be left.
>
>
>


Re: Implicitly enabling ALLOW FILTERING on virtual tables

2023-02-03 Thread Andrés de la Peña
I think removing the need for ALLOW FILTERING on virtual tables makes sense
and would be quite useful for operators.

That guard exists for performance issues that shouldn't occur on virtual
tables. We also have a flag in case some future virtual table
implementation has limitations regarding filtering, although it seems it's
not the case with any of the existing virtual tables.

It is not like we would promote bad habits because virtual tables are meant
to be queried by operators / administrators only.


It might even be quite the opposite, since in the current situation users
might get used to routinely use ALLOW FILTERING for querying their virtual
tables.

It has been mentioned on the #cassandra-dev Slack thread where this started
(1) that it's kind of an API inconsistency to allow querying by non-primary
keys on virtual tables without ALLOW FILTERING, whereas it's required for
regular tables. I think that a simply doc update saying that virtual
tables, which are not regular tables, support filtering would be enough.
Virtual tables are well identified by both the keyspace they belong to and
doc, so users shouldn't have trouble knowing whether a table is virtual. It
would be similar to the current exception for ALLOW FILTERING, where one
needs to use it unless the table has an index for the queried column.

(1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329

On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <
stefan.mikloso...@netapp.com> wrote:

> Hi list,
>
> the content of virtual tables is held in memory (and / or is fetched every
> time upon request). While doing queries against such table for a column
> outside of primary key, normally, users are required to specify ALLOW
> FILTERING. This makes total sense for "ordinary tables" for applications to
> have performant and effective queries but it kinds of loses the
> applicability for virtual tables when it literally holds just handful of
> entries in memory and it just does not matter, does it?
>
> What do you think about implicitly allowing filtering for virtual tables
> so we save ourselves from these pesky errors when we want to query
> arbitrary column and we need to satisfy CQL spec just to do that?
>
> It is not like we would promote bad habits because virtual tables are
> meant to be queried by operators / administrators only.
>
> We can also explicitly document this behavior.
>
> Among other options, we may try to implement secondary indices on virtual
> tables but I am not completely sure this is what we want because its
> complexity etc. Is it even necessary to put such complex logic in place
> just to be able to select any column on few entries in memory?
>
> I put together a draft here (1). It would be ever possible to implicitly
> allow filtering on virtual tables only and it would be implementator's
> responsibility to decide that, per table.
>
> For all virtual tables we currently have, I would enable this everywhere.
> I do not think there is any virtual table where we would not want to enable
> it or where people HAVE TO specify that.
>
> (1) https://github.com/apache/cassandra/pull/2131


Re: Welcome Patrick McFadin as Cassandra Committer

2023-02-02 Thread Andrés de la Peña
Congratulations, Patrick!

On Thu, 2 Feb 2023 at 19:54, Francisco Guerrero  wrote:

> Congrats, Patrick!
>
> On 2023/02/02 19:52:35 Jean-Armel Luce wrote:
> > Congrats, Patrick
> >
> > Le jeu. 2 févr. 2023 à 20:46, David Capwell  a
> écrit :
> >
> > > Congrats and welcome! =)
> > >
> > > On Feb 2, 2023, at 10:53 AM, J. D. Jordan 
> > > wrote:
> > >
> > > Congrats!
> > >
> > > On Feb 2, 2023, at 12:47 PM, Christopher Bradford <
> bradfor...@gmail.com>
> > > wrote:
> > >
> > > 
> > > Congrats Patrick! Well done.
> > >
> > > On Thu, Feb 2, 2023 at 10:44 AM Aaron Ploetz 
> > > wrote:
> > >
> > >> Patrick FTW!!!
> > >>
> > >> On Thu, Feb 2, 2023 at 12:32 PM Joseph Lynch 
> > >> wrote:
> > >>
> > >>> W! Congratulations Patrick!!
> > >>>
> > >>> -Joey
> > >>>
> > >>> On Thu, Feb 2, 2023 at 9:58 AM Benjamin Lerer 
> wrote:
> > >>>
> >  The PMC members are pleased to announce that Patrick McFadin has
> >  accepted
> >  the invitation to become committer today.
> > 
> >  Thanks a lot, Patrick, for everything you have done for this project
> >  and its community through the years.
> > 
> >  Congratulations and welcome!
> > 
> >  The Apache Cassandra PMC members
> > 
> > >>> --
> > >
> > > Christopher Bradford
> > >
> > >
> > >
> >
>


Re: [DISCUSS] API modifications and when to raise a thread on the dev ML

2023-02-02 Thread Andrés de la Peña
I guess that depends on the type of change, and what we consider an API.

If it's a breaking change, like removing a method or property, I think we
would need a DISCUSS API thread prior to making changes. However, if the
change is an addition, like adding a new yaml property or a JMX method, I
think JIRA suffices.

As for what we consider a public API, we usually include extensible
interfaces on this. For example, we can agree that the Index interface for
secondary indexes is a public API. However, that interface exposes many
other interfaces and classes through its methods. For example, that Index
interface exposes ColumnFamillyStore, SSTableReader, ColumnMetadata,
AbstractType, PartitionUpdate, etc. Changes into those indirectly exposed
classes can easily break custom index implementations out there. Should we
consider them public APIs too, and require a DISCUSS thread for every
change on them? Should that include new methods that wouldn't break
compatibility?

On Thu, 2 Feb 2023 at 09:29, Benedict Elliott Smith 
wrote:

> Closing the loop on seeking consensus for UX/UI/API changes, I see a few
> options. Can we rank choice vote please?
>
> A - Jira suffices
> B - Post a DISCUSS API thread prior to making changes
> C - Periodically publish a list of API changes for retrospective
> consideration by the community
>
> Points raised in the discussion included: lowering the bar for
> participation and avoiding unnecessary burden to developers.
>
> I vote B (only) because I think broader participation is most important
> for these topics.
>
>
> On 7 Dec 2022, at 15:43, Mick Semb Wever  wrote:
>
> I think it makes sense to look into improving visibility of API changes,
>> so people can more easily review a summary of API changes versus reading
>> through the whole changelog (perhaps we need a summarized API change log?).
>>
>
>
> Agree Paulo.
>
> Observers should be able to see all API changes early. We can do better
> than telling downstream users/devs "you have to listen to all jira tickets"
> or "you have to watch the code and pick up changes". Watching CHANGES.txt
> or NEWS.txt or CEPs doesn't solve the need either.
>
> Observing such changes as early as possible can save a significant amount
> of effort and headache later on, and should be encouraged. If done
> correctly I can imagine it will help welcome more contributors.
>
> I can also see that we can improve at, and have a better shared
> understanding of, categorising the types of API changes:
> addition/change/deprecation/removal, signature/output/behavioural, API/SPI.
> So I can see value here for both observers and for ourselves.
>
>
>


Re: Introducing mockito-inline library among test dependencies

2023-01-12 Thread Andrés de la Peña
+1 for the same reasons.

On Wed, 11 Jan 2023 at 20:14, David Capwell  wrote:

> +1. We already use mockito.  Also that library is basically empty, its
> just defining configs for extensions (see
> https://github.com/mockito/mockito/tree/main/subprojects/inline/src/main/resources/mockito-extensions
> )
>
> On Jan 11, 2023, at 12:02 PM, Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> wrote:
>
> Hi list,
>
> the test for (1) is using mockito-inline dependency for mocking static
> methods as mockito-core is not able to do that on its own. mockito-inline
> was not part of our test dependencies prior this work. I want to ask if we
> are all OK with being able to mock static methods from now on with the help
> of this library.
>
> Please tell me if we are mocking static methods already by some other (to
> me yet unknown) mean so we do not include this unnecessarily.
>
> G:A:V is org.mockito:mockito-inline:4.7.0
>
> (1) https://issues.apache.org/jira/browse/CASSANDRA-14361
>
> Thanks
>
>
>


Re: [VOTE] CEP-25: Trie-indexed SSTable format

2022-12-19 Thread Andrés de la Peña
+1

On Mon, 19 Dec 2022 at 15:11, Aleksey Yeshchenko  wrote:

> +1
>
> On 19 Dec 2022, at 13:42, Ekaterina Dimitrova 
> wrote:
>
> +1
>
> On Mon, 19 Dec 2022 at 8:30, J. D. Jordan 
> wrote:
>
>> +1 nb
>>
>> > On Dec 19, 2022, at 7:07 AM, Brandon Williams  wrote:
>> >
>> > +1
>> >
>> > Kind Regards,
>> > Brandon
>> >
>> >> On Mon, Dec 19, 2022 at 6:59 AM Branimir Lambov 
>> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> I'd like to propose CEP-25 for approval.
>> >>
>> >> Proposal:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A+Trie-indexed+SSTable+format
>> >> Discussion:
>> https://lists.apache.org/thread/3dpdg6dgm3rqxj96cyhn58b50g415dyh
>> >>
>> >> The vote will be open for 72 hours.
>> >> Votes by committers are considered binding.
>> >> A vote passes if there are at least three binding +1s and no binding
>> vetoes.
>> >>
>> >> Thank you,
>> >> Branimir
>>
>
>


Re: [VOTE] Release Apache Cassandra 4.1.0 (take2)

2022-12-09 Thread Andrés de la Peña
+1

On Fri, 9 Dec 2022 at 15:41, Josh McKenzie  wrote:

> +1
>
> On Fri, Dec 9, 2022, at 10:28 AM, Marcus Eriksson wrote:
>
> +1
>
> On Wed, Dec 07, 2022 at 10:40:21PM +0100, Mick Semb Wever wrote:
> > Proposing the (second) test build of Cassandra 4.1.0 for release.
> >
> > sha1: f9e033f519c14596da4dc954875756a69aea4e78
> > Git:
> >
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.1.0-tentative
> > Maven Artifacts:
> >
> https://repository.apache.org/content/repositories/orgapachecassandra-1282/org/apache/cassandra/cassandra-all/4.1.0/
> >
> > The Source and Build Artifacts, and the Debian and RPM packages and
> > repositories, are available here:
> > https://dist.apache.org/repos/dist/dev/cassandra/4.1.0/
> >
> > The vote will be open for 96 hours (longer if needed). Everyone who has
> > tested the build is invited to vote. Votes by PMC members are considered
> > binding. A vote passes if there are at least three binding +1s and no
> -1's.
> >
> > [1]: CHANGES.txt:
> >
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/4.1.0-tentative
> > [2]: NEWS.txt:
> >
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/4.1.0-tentative
>
>


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-09 Thread Andrés de la Peña
Note that specialized collection functions are also an opportunity for
optimization. For example, COLLECTION_COUNT reads only the first bytes of a
serialized collection, since those bytes contain the number of elements in
that collection. The most simple implementation of
COUNT(UNNEST(collection)) wouldn't do that. It would probably require to
deserialize the entire collection. A fancy query optimizer could internally
translate COUNT(UNNEST(collection)) to COLLECTION_COUNT(collection) to get
a nice performance improvement. Unfortunately we don't have such optimizer
at the moment.

I don't see a reason not to leave those collection functions as they are,
renaming aside. They can be later complemented by the generic UNNEST, or
subqueries, when someone is willing to work on those features.

It's not clear to me how UNNEST + UDA would operate on maps. We would still
need a way to extract the keys or the values of maps, like the current
MAP_KEYS and MAP_VALUES functions, wouldn't we?

As for the impossibility of applying COLLECTION_MAX, COLLECTION_MIN, etc.
to maps, I wouldn't be against renaming those to SET_MAX, LIST_MAX,
SET_MIN, SET_MAX, etc. Sets and lists have many things in common and it's a
pity that we don't have a common name for them. This lack of a common name
for lists and sets is something that permeates into the code at multiple
places. This kind of problems are probably the reason why Java's maps
aren't collections.





On Fri, 9 Dec 2022 at 11:26, Benedict  wrote:

> Right, this is basically my view - it can be syntactic sugar for UNNEST
> subqueries as and when we offer those (if not now), but I think we should
> be able to apply any UDA or aggregate to collections with some syntax
> that’s ergonomic.
>
> I don’t think APPLY is the right way to express it, my version was
> MAX(column AS COLLECTION) which means bind this operator to the collection
> rather than the rows (and since this is CAST-like, I’d say this is also a
> reasonable way to apply aggregations to single values too)
>
> But perhaps there’s some other third option. Or, if not, let’s simply
> support UNNEST subqueries.
>
>
> On 9 Dec 2022, at 11:19, Claude Warren, Jr via dev <
> dev@cassandra.apache.org> wrote:
>
> 
> I still think that semantically it makes sense to have a function that
> applies an aggregate to various collection types.  So rather than building
> ARRAY_MAX do APPLY(MAX, column)) or APPLY(MAX(column)) it is clear what is
> being requested and APPLY can be the source of truth for which aggregate
> functions work on which column types.
>
> On Fri, Dec 9, 2022 at 10:28 AM Andrés de la Peña 
> wrote:
>
>> Indeed this discussion is useful now that we know that there is
>> dissension about these changes. However, as those changes were happening
>> none of the persons involved on them felt the need of a discuss thread, and
>> I opened this thread as soon as Benedict objected the changes. I think this
>> is perfectly in line with ASF's wise policy about lazy consensus:
>> https://community.apache.org/committers/lazyConsensus.html
>>
>> My point here was that none I could find use ARRAY_COUNT - either
>>> ARRAY_SIZE or ARRAY_LENGTH
>>
>>
>> A quick search on Google shows:
>> https://www.ibm.com/docs/en/psfa/7.2.1?topic=functions-array-count
>>
>> https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/arrayfun.html#fn-array-count
>>
>> https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Array/ARRAY_COUNT.htm
>>
>> We seem to be mixing and matching our databases we use as templates here.
>>> Most of the standard examples we use (Postgres, MySQL etc) do not offer
>>> this, and do not offer equivalent functions on any of their other
>>> collection types.
>>
>>
>> I don't know what's wrong with those databases, nor what makes MySQL and
>> Postgres more standard than others. AFAIK MySQL doesn't even have an array
>> type, so it hardly is going to have array functions. Postgres however does
>> have arrays. Those arrays can manipulated with both subqueries and an
>> unnest function. However, this doesn't impede Postgres to also have a set
>> of array functions:
>> https://www.postgresql.org/docs/12/functions-array.html
>>
>> It seems that it is difficult to find a DB supporting arrays that doesn't
>> also offer an assorted set of array functions. DBs can perfectly support
>> subqueries, unnesting and utility functions at the same time. For example,
>> with Postgres you can get the size of an array with a subquery, or with
>> UNNEST, or with the ARRAY_LENGTH function.
>>
>> The collection functions that we discuss here are mostly analogous to
&g

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-09 Thread Andrés de la Peña
Indeed this discussion is useful now that we know that there is dissension
about these changes. However, as those changes were happening none of the
persons involved on them felt the need of a discuss thread, and I opened
this thread as soon as Benedict objected the changes. I think this is
perfectly in line with ASF's wise policy about lazy consensus:
https://community.apache.org/committers/lazyConsensus.html

My point here was that none I could find use ARRAY_COUNT - either
> ARRAY_SIZE or ARRAY_LENGTH


A quick search on Google shows:
https://www.ibm.com/docs/en/psfa/7.2.1?topic=functions-array-count
https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/arrayfun.html#fn-array-count
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Array/ARRAY_COUNT.htm

We seem to be mixing and matching our databases we use as templates here.
> Most of the standard examples we use (Postgres, MySQL etc) do not offer
> this, and do not offer equivalent functions on any of their other
> collection types.


I don't know what's wrong with those databases, nor what makes MySQL and
Postgres more standard than others. AFAIK MySQL doesn't even have an array
type, so it hardly is going to have array functions. Postgres however does
have arrays. Those arrays can manipulated with both subqueries and an
unnest function. However, this doesn't impede Postgres to also have a set
of array functions: https://www.postgresql.org/docs/12/functions-array.html

It seems that it is difficult to find a DB supporting arrays that doesn't
also offer an assorted set of array functions. DBs can perfectly support
subqueries, unnesting and utility functions at the same time. For example,
with Postgres you can get the size of an array with a subquery, or with
UNNEST, or with the ARRAY_LENGTH function.

The collection functions that we discuss here are mostly analogous to those
sets of functions that we find anywhere. They are a quite small, well
encapsulated and not invasive feature that has the advantage of being
already done. Those functions don't seem to impede us to add support for
subqueries or unnesting whenever someone wants to work on them.

Adding subqueries on a later stage wouldn't involve the deprecation of the
collection functions, since those are still useful as a shortcut, as we see
in other DBs out there.



On Thu, 8 Dec 2022 at 20:47, J. D. Jordan  wrote:

> I think this thread proves the point that a DISCUSS thread for API changes
> on dev@ will get more viewpoints than just having something in JIRA. I
> think this thread has been useful and should result in us having a better
> user facing API than without it.
>
> On Dec 8, 2022, at 1:57 PM, Andrés de la Peña 
> wrote:
>
> 
>
>> I expect we’ll rehash it every API thread otherwise.
>
>
> Since you bring up the topic, I understand that opposing to every single
> reviewed decision that has been taken on CASSANDRA-17811, CASSANDRA-8877,
> CASSANDRA-17425 and CASSANDRA-18085 could make an argument in favour of the
> policy demanding a DISCUSS thread for every new feature, big or small. One
> could argue that any change to the work on this area that comes out from
> the current discussion could have been prevented by a previous DISCUSS
> thread.
>
> However, those tickets have been around for long in a very public way, and
> there hasn't been any controversy around them until now. So I think that an
> initial thread wouldn't have prevented that anyone can resurrect the
> discussion at any point in time if they haven't put the time to look into
> the changes before. We have already seen those after-commit discussions
> even for discussed, voted, approved and reviewed CEPs. The only thing we
> would get is two discussion threads instead of one. By the looks of it, I
> doubt that the suggested policy about discuss threads is going to be
> accepted.
>
> In any case, this is a separate topic from what we're discussing here.
>
>
>
> On Thu, 8 Dec 2022 at 18:21, Benedict  wrote:
>
>> It feels like this is a recurring kind of discussion, and I wonder if
>> there’s any value in deciding on a general approach to guide these
>> discussions in future? Are we aiming to look like SQL as we evolve, and if
>> so which products do we want to be informed by?
>>
>> I expect we’ll rehash it every API thread otherwise.
>>
>> On 8 Dec 2022, at 17:37, Benedict  wrote:
>>
>> 
>>
>> 1) Do they offer ARRAY_SUM or ARRAY_AVG?
>>
>> Yes, a quick search on Google shows some examples:
>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg
>> <https://urldefense.com/v3/__https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg__;!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Andrés de la Peña
ers don't need to know the type of the column.
> That's the motivation behind the idea of doing the same with the collection
> functions, so they can entirely replace MAXWRITE.
>
> However I wouldn't be against leaving the collection functions working
> only on collections, as originally designed, and as they currently are on
> trunk. The question is what we do with MAXWRITETIME. That function is also
> only on trunk, and it might be repetitive given the more generic collection
> functions. It's also a bit odd that there isn't, for example, a similar
> MINTTL function. Maybe we should start a separate discussion thread about
> that new function?
>
>
> I think we should figure out our overall strategy - these are all pieces
> of the puzzle IMO. But I guess the above questions seem to come first and
> will shape this. I would be in favour of some general approach, however,
> such as either first casting to a collection, or passing an aggregation
> operator to WRITETIME.
>
>
> On 8 Dec 2022, at 17:13, Andrés de la Peña  wrote:
>
> 
>
>> 1) Do they offer ARRAY_SUM or ARRAY_AVG?
>
> Yes, a quick search on Google shows some examples:
> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg
> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/gxz1nB7GclxNO5mBd~rn8g
>
> https://docs.upsolver.com/sqlake/functions-and-operators-reference/array-functions/array_sum
> https://docs.firebolt.io/sql-reference/functions-reference/array-sum.html
>
> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH?
>
> Yes, again we can search for some examples:
> https://docs.snowflake.com/en/sql-reference/functions/array_size.html
> https://docs.databricks.com/sql/language-manual/functions/array_size.html
>
> 3) A map is a collection in C* parlance, but I gather from below you
>> expect these methods not to operate on them?
>
> Nope, only COLLECTION_COUNT works on sets, lists and maps. COLLECTION_MIN
> and COLLECTION_MAX require a set or list. COLLECTION_SUM and COLLECTION_AVG
> require a numeric collection, the same way that the ARRAY_SUM and ARRAY_AVG
> functions above require a numeric array.
>
> Does ARRAY_MAX operate on single values? If we are to base our decisions
>> on norms elsewhere, we should be consistent about it.
>
> It doesn't in any the examples above. Those functions aren't a standard so
> I don't know if there are others dbs around that support it. In any case,
> the fact that we look for inspiration in other databases to minimize
> surprise etc. doesn't mean that we have to do exactly the same. After all,
> CQL is not SQL and our collections aren't SQL arrays.
>
> Note that the collection functions added by CASSANDRA-8877 don't operate
> on single values either. That idea was proposed by Yifan on CASSANDRA-18078
> and it looked good to Francisco and me. The patch is on CASSANDRA-18085,
> already reviewed and blocked waiting on the outcome of this discussion.
>
> The new collection functions can do the same as the new MAXWRITE function,
> but not only for getting max timestamps, but also min timestamps and
> min/max ttls. The missing part is that MAXWRITE can accept both collections
> and single elements, so callers don't need to know the type of the column.
> That's the motivation behind the idea of doing the same with the collection
> functions, so they can entirely replace MAXWRITE.
>
> However I wouldn't be against leaving the collection functions working
> only on collections, as originally designed, and as they currently are on
> trunk. The question is what we do with MAXWRITETIME. That function is also
> only on trunk, and it might be repetitive given the more generic collection
> functions. It's also a bit odd that there isn't, for example, a similar
> MINTTL function. Maybe we should start a separate discussion thread about
> that new function?
>
>
>
> On Thu, 8 Dec 2022 at 14:21, Benedict  wrote:
>
>> 1) Do they offer ARRAY_SUM or ARRAY_AVG?
>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH?
>> 3) A map is a collection in C* parlance, but I gather from below you
>> expect these methods not to operate on them?
>>
>> Does ARRAY_MAX operate on single values? If we are to base our decisions
>> on norms elsewhere, we should be consistent about it.
>>
>> It’s worth noting that ARRAY is an ISO SQL concept, as is MULTISET. Some
>> databases also have Set or Map types, such as MySQL’s Set and Postgres’
>> hstore. These databases only support ARRAY_ functions, seemingly, plus
>> special MULTISET operators defined by the SQL standard where that data type
>> is supported.
>>
>>
>>
>> On 8 Dec 2022, at 12:11, Andrés de la Peña  wrote:
>>
>> 
>

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Andrés de la Peña
>
> 1) Do they offer ARRAY_SUM or ARRAY_AVG?

Yes, a quick search on Google shows some examples:
https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg
https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/gxz1nB7GclxNO5mBd~rn8g
https://docs.upsolver.com/sqlake/functions-and-operators-reference/array-functions/array_sum
https://docs.firebolt.io/sql-reference/functions-reference/array-sum.html

2) Do they define ARRAY_COUNT or ARRAY_LENGTH?

Yes, again we can search for some examples:
https://docs.snowflake.com/en/sql-reference/functions/array_size.html
https://docs.databricks.com/sql/language-manual/functions/array_size.html

3) A map is a collection in C* parlance, but I gather from below you expect
> these methods not to operate on them?

Nope, only COLLECTION_COUNT works on sets, lists and maps. COLLECTION_MIN
and COLLECTION_MAX require a set or list. COLLECTION_SUM and COLLECTION_AVG
require a numeric collection, the same way that the ARRAY_SUM and ARRAY_AVG
functions above require a numeric array.

Does ARRAY_MAX operate on single values? If we are to base our decisions on
> norms elsewhere, we should be consistent about it.

It doesn't in any the examples above. Those functions aren't a standard so
I don't know if there are others dbs around that support it. In any case,
the fact that we look for inspiration in other databases to minimize
surprise etc. doesn't mean that we have to do exactly the same. After all,
CQL is not SQL and our collections aren't SQL arrays.

Note that the collection functions added by CASSANDRA-8877 don't operate on
single values either. That idea was proposed by Yifan on CASSANDRA-18078
and it looked good to Francisco and me. The patch is on CASSANDRA-18085,
already reviewed and blocked waiting on the outcome of this discussion.

The new collection functions can do the same as the new MAXWRITE function,
but not only for getting max timestamps, but also min timestamps and
min/max ttls. The missing part is that MAXWRITE can accept both collections
and single elements, so callers don't need to know the type of the column.
That's the motivation behind the idea of doing the same with the collection
functions, so they can entirely replace MAXWRITE.

However I wouldn't be against leaving the collection functions working only
on collections, as originally designed, and as they currently are on trunk.
The question is what we do with MAXWRITETIME. That function is also only on
trunk, and it might be repetitive given the more generic collection
functions. It's also a bit odd that there isn't, for example, a similar
MINTTL function. Maybe we should start a separate discussion thread about
that new function?



On Thu, 8 Dec 2022 at 14:21, Benedict  wrote:

> 1) Do they offer ARRAY_SUM or ARRAY_AVG?
> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH?
> 3) A map is a collection in C* parlance, but I gather from below you
> expect these methods not to operate on them?
>
> Does ARRAY_MAX operate on single values? If we are to base our decisions
> on norms elsewhere, we should be consistent about it.
>
> It’s worth noting that ARRAY is an ISO SQL concept, as is MULTISET. Some
> databases also have Set or Map types, such as MySQL’s Set and Postgres’
> hstore. These databases only support ARRAY_ functions, seemingly, plus
> special MULTISET operators defined by the SQL standard where that data type
> is supported.
>
>
>
> On 8 Dec 2022, at 12:11, Andrés de la Peña  wrote:
>
> 
> "ARRAY_MAX" and "ARRAY_MIN" functions to get the max/min element in a list
> are not an uncommon practice. You can find them in SparkSQL, Amazon
> Timestream, Teradata, etc. Since we have what we call collections instead
> or arrays, it makes sense to call the analogous functions "COLLECTION_MAX",
> "COLLECTION_MIN", etc.
>
> As for maps, CASSANDRA-8877 also introduced "MAP_KEYS" and "MAP_VALUES"
> functions to get the keys or the values of a map, so one can feed them to
> "MAX", "COLLECTION_MAX", etc. That isn't anything too original either, you
> can find identical functions on SparkSQL for example.
>
> I find simple utility functions easier to use than subqueries. But we
> don't have to chose. We can also have subqueries if someone finds the time
> to work on them.
>
> On Thu, 8 Dec 2022 at 12:04, Claude Warren, Jr via dev <
> dev@cassandra.apache.org> wrote:
>
>>  I think the semantics of the situation is important here.
>>
>>
>> Let’s take MAX as our example aggregate function..
>>
>>
>> We all expect that in a DB context MAX(column) will return the value of
>> the column with the maximum value. That is the expected semantics of MAX.
>>
>>
>> The question here is that there are data types that are multi-valued and
>> the

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Andrés de la Peña
"ARRAY_MAX" and "ARRAY_MIN" functions to get the max/min element in a list
are not an uncommon practice. You can find them in SparkSQL, Amazon
Timestream, Teradata, etc. Since we have what we call collections instead
or arrays, it makes sense to call the analogous functions "COLLECTION_MAX",
"COLLECTION_MIN", etc.

As for maps, CASSANDRA-8877 also introduced "MAP_KEYS" and "MAP_VALUES"
functions to get the keys or the values of a map, so one can feed them to
"MAX", "COLLECTION_MAX", etc. That isn't anything too original either, you
can find identical functions on SparkSQL for example.

I find simple utility functions easier to use than subqueries. But we don't
have to chose. We can also have subqueries if someone finds the time to
work on them.

On Thu, 8 Dec 2022 at 12:04, Claude Warren, Jr via dev <
dev@cassandra.apache.org> wrote:

>  I think the semantics of the situation is important here.
>
>
> Let’s take MAX as our example aggregate function..
>
>
> We all expect that in a DB context MAX(column) will return the value of
> the column with the maximum value. That is the expected semantics of MAX.
>
>
> The question here is that there are data types that are multi-valued and
> there is a desire to apply MAX to the values within the column. I would
> expect that this would return the maximum value of the column every row in
> the DB.
>
>
> So if there were a keyword that operated like the Java BiFunction class
> where the Function would apply a second function to the column data. For
> purposes of this discussion let’s call this Function APPLY.
>
>
> So APPLY( MAX, column ) would return the maximum value from the column for
> each row in the DB.
>
>
> MAX(APPLY(MAX,column)) would get the maximum value from the column across
> all the rows.
>
>
> Similarly APPLY could be used with other functions MAX(APPLY(MIN,column))
> the largest minimum value from the column across all rows.
>
>
> These statements make clear semantically what is being asked for.
>
> On Thu, Dec 8, 2022 at 10:57 AM Benedict  wrote:
>
>> I meant unnest, not unwrap.
>>
>> On 8 Dec 2022, at 10:34, Benedict  wrote:
>>
>> 
>> 
>>
>> I do not think we should have functions that aggregate across rows and
>> functions that operate within a row use the same name.
>>
>>
>> I’m sympathetic to that view for sure. I wouldn’t be too disappointed by
>> that outcome, and SQL engines seem to take a similar approach, however they
>> mostly rely on sub-queries to get around this problem, and the SQL standard
>> introduces UNWRAP for operating on arrays (by translating them into a
>> table), permitting subqueries to aggregate them. It seems to me we have
>> four options:
>>
>> 1) introduce functionality similar to UNWRAP and subqueries
>> 2) introduce new syntax to permit operating on collections with the same
>> functions
>> 3) permit the same functions to operate on both, with a precedence order,
>> and introduce syntax to permit breaking the precedence order
>> 4) introduce new functions
>>
>> (1) might look like SELECT (SELECT MAX(item) FROM UNWRAP(list)) AS
>> max_item FROM table
>>
>> (2) and (3) might look something like:
>>
>> SELECT MAX(list AS COLLECTION) or
>> SELECT MAX(list AS ROWS)
>>
>> (4) might look something like we have already, but perhaps with different
>> names
>>
>> The comparator for collections is the lexicographical compare on the
>> collection items
>>
>>
>> This is a fair point, I mistakenly thought it sorted first on size. Even
>> this definition is a little funkier for Map types, where the values of a
>> key may cause something to sort earlier than a map whose next key sorts
>> first. There are multiple potential lexicographical sorts for Maps (i.e.,
>> by keys first, then values, or by (key, value) pairs), so this is
>> particularly poorly defined IMO.
>>
>> The maximum of a blob type is pretty well defined I think, as are
>> boolean, inetaddress etc. However, even for List or Set collections there’s
>> multiple reasonable functions one could define for maximum, so it would
>> make more sense to me to permit the user to define the comparison as part
>> of the MAX function if we are to offer it. However, with the
>> lexicographical definition we have I am somewhat less concerned for Set and
>> List. Map seems like a real problem though, if we support these operators
>> (which perhaps we do not).
>>
>>
>> On 7 Dec 2022, at 12:13, Andrés de la Peña  wrote:
>>
>> 
>> 

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-07 Thread Andrés de la Peña
 I don’t think those functions being cross row aggregations for
> some column types, but within row collection operations for other types, is
> any more intuitive, and actually would be more confusing.  So I am -1 on
> using the same names.
> >>
> >>> 3. I think it is peculiar to permit methods named collection_ to
> operate over non-collection types when they are explicitly collection
> variants.
> >>
> >> While I could see some point to this, I do not think it would be
> confusing for something named collection_XXX to treat a non-collection as a
> collection of 1.  But maybe there is a better name for these function.
> Rather than seeing them as collection variants, we should see them as
> variants that operate on the data in a single row, rather than aggregating
> across multiple rows.  But even with that perspective I don’t know what the
> best name would be.
> >>
> >>>> On Dec 6, 2022, at 7:30 AM, Benedict  wrote:
> >>>
> >>> Thanks Andres, I think community input on direction here will be
> invaluable. There’s a bunch of interrelated tickets, and my opinions are as
> follows:
> >>>
> >>> 1. I think it is a mistake to offer a function MAX that operates over
> rows containing collections, returning the collection with the most
> elements. This is just a nonsensical operation to support IMO. We should
> decide as a community whether we “fix” this aggregation, or remove it.
> >>> 2. I think “collection_" prefixed methods are non-intuitive for
> discovery, and all-else equal it would be better to use MAX,MIN, etc, same
> as for aggregations.
> >>> 3. I think it is peculiar to permit methods named collection_ to
> operate over non-collection types when they are explicitly collection
> variants.
> >>>
> >>> Given (1), (2) becomes simple except for COUNT which remains
> ambiguous, but this could be solved by either providing a separate method
> for collections (e.g. SIZE) which seems fine to me, or by offering a
> precedence order for matching and a keyword for overriding the precedence
> order (e.g. COUNT(collection AS COLLECTION)).
> >>>
> >>> Given (2), (3) is a little more difficult. However, I think this can
> be solved several ways.
> >>> - We could permit explicit casts to collection types, that for a
> collection type would be a no-op, and for a single value would create a
> collection
> >>> - With precedence orders, by always selecting the scalar function last
> >>> - By permitting WRITETIME to accept a binary operator reduce function
> to resolve multiple values
> >>>
> >>> These decisions all imply trade-offs on each other, and affect the
> evolution of CQL, so I think community input would be helpful.
> >>>
> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña 
> wrote:
> >>>>
> >>>> 
> >>>> This will require some long introduction for context:
> >>>>
> >>>> The MAX/MIN functions aggregate rows to get the row with min/max
> column value according to their comparator. For collections, the comparison
> is on the lexicographical order of the collection elements. That's the very
> same comparator that is used when collections are used as clustering keys
> and for ORDER BY.
> >>>>
> >>>> However, a bug in the MIN/MAX aggregate functions used to make that
> the results were presented in their unserialized form, although the row
> selection was correct. That bug was recently solved by CASSANDRA-17811.
> During that ticket it was also considered the option of simply disabling
> MIN/MAX on collection since applying those functions to collections, since
> they don't seem super useful. However, that option was quickly discarded
> and the operation was fixed so the MIN/MAX functions correctly work for
> every data type.
> >>>>
> >>>> As a byproduct of the internal improvements of that fix,
> CASSANDRA-8877 introduced a new set of functions that can perform
> aggregations of the elements of a collection. Those where named "map_keys",
> "map_values", "collection_min", "collection_max", "collection_sum", and
> "collection_count". Those are the names mentioned on the mail list thread
> about function naming conventions. Despite doing a kind of
> within-collection aggregation, these functions are not what we usually call
> aggregate functions, since they don't aggregate multiple rows together.
> >>>>
> >>>> On a different line of work, CASSANDRA-17425 added to trunk a
> MA

Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-06 Thread Andrés de la Peña
This will require some long introduction for context:

The MAX/MIN functions aggregate rows to get the row with min/max column
value according to their comparator. For collections, the comparison is on
the lexicographical order of the collection elements. That's the very same
comparator that is used when collections are used as clustering keys and
for ORDER BY.

However, a bug in the MIN/MAX aggregate functions used to make that the
results were presented in their unserialized form, although the row
selection was correct. That bug was recently solved by CASSANDRA-17811.
During that ticket it was also considered the option of simply disabling
MIN/MAX on collection since applying those functions to collections, since
they don't seem super useful. However, that option was quickly discarded
and the operation was fixed so the MIN/MAX functions correctly work for
every data type.

As a byproduct of the internal improvements of that fix, CASSANDRA-8877
introduced a new set of functions that can perform aggregations of the
elements of a collection. Those where named "map_keys", "map_values",
"collection_min", "collection_max", "collection_sum", and
"collection_count". Those are the names mentioned on the mail list thread
about function naming conventions. Despite doing a kind of
within-collection aggregation, these functions are not what we usually call
aggregate functions, since they don't aggregate multiple rows together.

On a different line of work, CASSANDRA-17425 added to trunk a MAXWRITETIME
function to get the max timestamp of a multi-cell column. However, the new
collection functions can be used in combination with the WRITETIME and TTL
functions to retrieve the min/max/sum/avg timestamp or ttl of a multi-cell
column. Since the new functions give a generic way of aggreagting
timestamps ant TTLs of multi-cell columns, CASSANDRA-18078 proposed to
remove that MAXWRITETIME function.

Yifan Cai, author of the MAXWRITETIME function, agreed to remove that
function in favour of the new generic collection functions. However, the
MAXWRITETIME function can work on both single-cell and multi-cell columns,
whereas "COLLECTION_MAX(WRITETIME(column))" would only work on multi-cell
columns, That's because MAXWRITETIME of a not-multicell column doesn't
return a collection, and one should simply use "WRITETIME(column)" instead.
So it was proposed in CASSANDRA-18037 that collections functions applied to
a not-collection value consider that value as the only element of a
singleton collection. So, for example, COLLECTION_MAX(7) =
COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and it's
mostly ready to commit.

Now we can go straight to the point:

Recently Benedict brought back the idea of deprecating aggregate functions
applied to collections, the very same idea that was mentioned on
CASSANDRA-17811 description almost four months ago. That way we could
rename the new collection functions MIN/MAX/SUM/AVG, same as the classic
aggregate functions. That way MIN/MAX/SUM/AVG would be an aggregate
function when applied to not-collection columns, and a scalar function when
applied to collection. We can't do that with COUNT because there would be
an ambiguity, so the proposal for that case is renaming COLLECTION_COUNT to
SIZE. Benedict, please correct me if I'm not correctly exposing the
proposal.

I however would prefer to keep aggregate functions working on collections,
and keep the names of the new collection functions as "COLLECTION_*".
Reasons are:

1 - Making aggregate functions not work on collections might be cosidered
as breaking backward compatibility and require a deprecation plan.
2 - Keeping aggregate functions working on collections might not look
superuseful, but they make the set of aggregate functions consistent and
applicable to every column type.
3 - Using the "COLLECTION_" prefix on collection functions establishes a
clear distinction between row aggregations and collection aggregations,
while at the same time exposing the analogy between each pair of functions.
4 - Not using the "COLLECTION_" prefix forces us to search for workarounds
such as using the column type when possible, or trying to figure out
synonyms like in the case of COUNT/SIZE. Even if that works for this case,
future functions can find more trouble when trying to figure out
workarounds to avoid clashing with existing function names. For example, we
might want to add a SIZE function that gets the size in bytes of any
column, or we might want to add a MAX function that gets the maximum of a
set of columns, etc. And example of the synonym-based approach that comes
to mind is MySQL's MAX and GREATEST functions, where MAX is for row
aggregation and GREATEST is for column aggregation.
5 - If MIN/MAX function selection is based on the column type, we can't
implement Yifan's proposal of making COLLECTION_MAX(7) =
COLLECTION_MAX([7]) = 7, which would be very useful for combining
collection functions with time functions.

What do others 

Re: [DISCUSS] API modifications and when to raise a thread on the dev ML

2022-12-05 Thread Andrés de la Peña
hould be seeking the broadest visibility,
>>>> including casual observers and non-contributors.
>>>>
>>>> On 5 Dec 2022, at 13:05, Paulo Motta  wrote:
>>>>
>>>> 
>>>>
>>>> It feels bit of overkill to me to require addition of any new virtual
>>>> tables/JMX/configuration/knob to go through a discuss thread. If this would
>>>> require 70 threads for the previous release I think this would easily
>>>> become spammy and counter-productive.
>>>>
>>>> I think the burden should be on the maintainer to keep up with changes
>>>> being added to the database and chime in any areas it feel responsible for,
>>>> as it has been the case and has worked relatively well.
>>>>
>>>> I think it makes sense to look into improving visibility of API
>>>> changes, so people can more easily review a summary of API changes versus
>>>> reading through the whole changelog (perhaps we need a summarized API
>>>> change log?).
>>>>
>>>> It would also help to have more explicit guidelines on what kinds of
>>>> API changes are riskier and might require additional  visibility via a
>>>> DISCUSS thread.
>>>>
>>>> Also, would it make sense to introduce a new API review stage during
>>>> release validation, and agree to revert/update any API changes that may be
>>>> controversial that were not caught during normal review?
>>>>
>>>> On Mon, 5 Dec 2022 at 06:49 Andrés de la Peña 
>>>> wrote:
>>>>
>>>>> Indeed that contribution policy should be clearer and not be on a page
>>>>> titled code style, thanks for briging that up.
>>>>>
>>>>> If we consider all those things APIs, and additions are also
>>>>> considered changes that require a DISCUSS thread, it turns out that almost
>>>>> any not-bugfix ticket would require a mail list thread. In fact, if one
>>>>> goes through CHANGES.txt it's easy to see that most entries would have
>>>>> required a DISCUSS thread.
>>>>>
>>>>> I think that such a strict policy would only make us lose agility and
>>>>> increase the burden of almost any contribution. After all, it's not that
>>>>> changes without a DISCUSS thread happen in secret. Changes are publicly
>>>>> visible on their tickets, those tickets are notified on Slack so anyone 
>>>>> can
>>>>> jump into the ticket discussions and set themselves as reviewers, and
>>>>> reviewers can ask for DISCUSS threads whenever they think more opinions or
>>>>> broader consensus are needed.
>>>>>
>>>>> Also, a previous DISCUSS thread is not going to impede that any
>>>>> changes are going to be questioned later. We have seen changes that are
>>>>> proposed, discussed and approved as CEPs, reviewed for weeks or months, 
>>>>> and
>>>>> finally committed, and still they are questioned shortly after that cycle,
>>>>> and asked to be changed or discussed again. I don't think that an 
>>>>> avalanche
>>>>> of DISCUSS threads is going to improve that, since usually the problem is
>>>>> that people don't have the time for deeply looking into the changes when
>>>>> they are happening. I doubt that more notification channels are going to
>>>>> improve that.
>>>>>
>>>>> Of course I'm not saying that there should never DISCUSS threads
>>>>> before starting a change. Probably we can all agree that major changes and
>>>>> things that break compatibility would need previous discussion.
>>>>>
>>>>> On Mon, 5 Dec 2022 at 10:16, Benjamin Lerer  wrote:
>>>>>
>>>>>> Thanks for opening this thread Josh,
>>>>>>
>>>>>> It seems perfectly normal to me that for important changes or
>>>>>> questions we raise some discussion to the mailing list.
>>>>>>
>>>>>> My understanding of the current proposal  implies that for the 4.1
>>>>>> release we should have had to raise over 70 discussion threads.
>>>>>> We have a minimum of 2 commiters required for every patch. Should we
>>>>>> not trust them to update nodetool, the virtual tables or other things on
>>>>>> their own?
>>>>>>
>>>>>&

Re: [VOTE] Release Apache Cassandra 4.1.0 GA

2022-12-05 Thread Andrés de la Peña
+1

On Mon, 5 Dec 2022 at 11:37, Benedict  wrote:

> -0
>
> CASSANDRA-18086 should probably be fixed and merged first, as Paxos v2
> will be unlikely to work well for users without it. Either that or we need
> to update NEWS.txt to mention it.
>
> On 5 Dec 2022, at 11:01, Aleksey Yeshchenko  wrote:
>
> +1
>
> On 5 Dec 2022, at 10:17, Benjamin Lerer  wrote:
>
> +1
>
> Le lun. 5 déc. 2022 à 11:02, Berenguer Blasi  a
> écrit :
>
>> +1
>> On 5/12/22 10:53, guo Maxwell wrote:
>>
>> +1
>>
>> Mick Semb Wever 于2022年12月5日 周一下午5:33写道:
>>
>>>
>>> Proposing the test build of Cassandra 4.1.0 GA for release.
>>>
>>> sha1: b807f97b37933fac251020dbd949ee8ef245b158
>>> Git:
>>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.1.0-tentative
>>> Maven Artifacts:
>>> https://repository.apache.org/content/repositories/orgapachecassandra-1281/org/apache/cassandra/cassandra-all/4.1.0/
>>>
>>> The Source and Build Artifacts, and the Debian and RPM packages and
>>> repositories, are available here:
>>> https://dist.apache.org/repos/dist/dev/cassandra/4.1.0/
>>>
>>> The vote will be open for 72 hours (longer if needed). Everyone who has
>>> tested the build is invited to vote. Votes by PMC members are considered
>>> binding. A vote passes if there are at least three binding +1s and no -1's.
>>>
>>> [1]: CHANGES.txt:
>>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/4.1.0-tentative
>>> [2]: NEWS.txt:
>>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/4.1.0-tentative
>>>
>> --
>> you are the apple of my eye !
>>
>>
>


Re: [DISCUSS] API modifications and when to raise a thread on the dev ML

2022-12-05 Thread Andrés de la Peña
Indeed that contribution policy should be clearer and not be on a page
titled code style, thanks for briging that up.

If we consider all those things APIs, and additions are also considered
changes that require a DISCUSS thread, it turns out that almost any
not-bugfix ticket would require a mail list thread. In fact, if one goes
through CHANGES.txt it's easy to see that most entries would have required
a DISCUSS thread.

I think that such a strict policy would only make us lose agility and
increase the burden of almost any contribution. After all, it's not that
changes without a DISCUSS thread happen in secret. Changes are publicly
visible on their tickets, those tickets are notified on Slack so anyone can
jump into the ticket discussions and set themselves as reviewers, and
reviewers can ask for DISCUSS threads whenever they think more opinions or
broader consensus are needed.

Also, a previous DISCUSS thread is not going to impede that any changes are
going to be questioned later. We have seen changes that are proposed,
discussed and approved as CEPs, reviewed for weeks or months, and finally
committed, and still they are questioned shortly after that cycle, and
asked to be changed or discussed again. I don't think that an avalanche of
DISCUSS threads is going to improve that, since usually the problem is that
people don't have the time for deeply looking into the changes when they
are happening. I doubt that more notification channels are going to improve
that.

Of course I'm not saying that there should never DISCUSS threads before
starting a change. Probably we can all agree that major changes and things
that break compatibility would need previous discussion.

On Mon, 5 Dec 2022 at 10:16, Benjamin Lerer  wrote:

> Thanks for opening this thread Josh,
>
> It seems perfectly normal to me that for important changes or questions we
> raise some discussion to the mailing list.
>
> My understanding of the current proposal  implies that for the 4.1 release
> we should have had to raise over 70 discussion threads.
> We have a minimum of 2 commiters required for every patch. Should we not
> trust them to update nodetool, the virtual tables or other things on their
> own?
>
> There is already multiple existing ways to track changes in specific code
> areas. I am personaly tracking the areas in which I am the most involved
> this way and I know that a lot of people do the same.
>
> To be transparent, It is not clear to me what the underlying issue is? Do
> we have some specific cases that illustrate the underlying problem? Thrift
> and JMX are from a different time in my opinion.
>
> Le lun. 5 déc. 2022 à 08:09, Berenguer Blasi  a
> écrit :
>
>> +1 to moving that into it's own section outside the coding style page.
>>
>> Dinesh I also thought in terms of backward compatibility here. But notice
>> the discussion is about _any change_ to the API such as adding new CQL
>> functions. Would adding or changing an exception type or a user warning
>> qualify for a DISCUSS thread also? I wonder if we're talking ourselves into
>> opening a DISCUSS for almost every ticket and sthg easy to miss.
>>
>> I wonder, you guys know the code better, if 'public APIs' could be
>> matched to a reasonable set of files (cql parsing, yaml, etc) and have
>> jenkins send an email when changes are detected on them. Overkill? bad
>> idea? :thinking:...
>> On 4/12/22 1:14, Dinesh Joshi wrote:
>>
>> We should also very clearly list out what is considered a public API. The
>> current statement that we have is insufficient:
>>
>> public APIs, including CQL, virtual tables, JMX, yaml, system
>> properties, etc.
>>
>>
>> The guidance on treatment of public APIs should also move out of "Code
>> Style" page as it isn't strictly related to code style. Backward
>> compatibility of public APIs is a best practice & project policy.
>>
>>
>> On Dec 2, 2022, at 2:08 PM, Benedict  wrote:
>>
>> I think some of that text also got garbled by mixing up how you approach
>> internal APIs and external APIs. We should probably clarify that there are
>> different burdens for each. Which is all my fault as the formulator. I
>> remember it being much clearer in my head.
>>
>> My view is the same as yours Josh. Evolving the database’s public APIs is
>> something that needs community consensus. The more visibility these
>> decisions get, the better the final outcome (usually). Even small API
>> changes need to be carefully considered to ensure the API evolves
>> coherently, and this is particularly true for something as complex and
>> central as CQL.
>>
>> A DISCUSS thread is a good forcing function to think about what you’re
>> trying to achieve and why, and to provide others a chance to spot potential
>> flaws, alternatives and interactions with work you may not be aware of.
>>
>> It would be nice if there were an easy rubric for whether something needs
>> feedback, but I don’t think there is. One person’s obvious change may be
>> another’s obvious problem. So I 

Re: [VOTE] Release Apache Cassandra 4.1-rc1

2022-11-22 Thread Andrés de la Peña
+1

On Mon, 21 Nov 2022 at 19:55, Josh McKenzie  wrote:

> +1
>
> On Mon, Nov 21, 2022, at 12:38 PM, Mick Semb Wever wrote:
>
>
>
> On Fri, 18 Nov 2022 at 13:10, Mick Semb Wever  wrote:
>
> Proposing the test build of Cassandra 4.1-rc1 for release.
>
> sha1: d6822c45ae3d476bc2ff674cedf7d4107b8ca2d0
> Git:
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.1-rc1-tentative
> Maven Artifacts:
> https://repository.apache.org/content/repositories/orgapachecassandra-1280/org/apache/cassandra/cassandra-all/4.1-rc1/
>
> The Source and Build Artifacts, and the Debian and RPM packages and
> repositories, are available here:
> https://dist.apache.org/repos/dist/dev/cassandra/4.1-rc1/
>
> The vote will be open for 72 hours (longer if needed). Everyone who has
> tested the build is invited to vote. Votes by PMC members are considered
> binding. A vote passes if there are at least three binding +1s and no -1's.
>
>
>
> I plan to hold the vote open for an extra 24 hours, because it went over
> the weekend.
>
> And, if the vote passes, I will cut and stage the GA immediately after,
> but wait one week before starting a vote on it. (If anything arises that
> warrants fixing we toss it and go back to an rc2.)
>
>


Re: Some tests are never executed in CI due to their name

2022-11-15 Thread Andrés de la Peña
+1 to waiver

On Tue, 15 Nov 2022 at 05:54, Berenguer Blasi 
wrote:

> +1 to waiver
> On 15/11/22 2:07, Josh McKenzie wrote:
>
> +1 to waiver.
>
> We still don't have some kind of @flaky annotation that sequesters tests
> do we? :)
>
> On Mon, Nov 14, 2022, at 5:58 PM, Ekaterina Dimitrova wrote:
>
> +1
>
> On Mon, 14 Nov 2022 at 17:55, Brandon Williams  wrote:
>
> +1 to waiving these.
>
> On Mon, Nov 14, 2022, 4:49 PM Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> wrote:
>
> Tickets for the flaky tests are here
>
> https://issues.apache.org/jira/browse/CASSANDRA-18047
> https://issues.apache.org/jira/browse/CASSANDRA-18048
>
> 
> From: Mick Semb Wever 
> Sent: Monday, November 14, 2022 23:28
> To: dev@cassandra.apache.org
> Subject: Re: Some tests are never executed in CI due to their name
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
>
> in CASSANDRA-18029, two flaky tests were committed by mistake due to my
> misunderstanding. We agreed on this thread that we should not commit flaky
> tests right before rc. So now the rc is technically blocked by them. To
> unblock it, what is needed is to have a waiver on them. If there is not a
> waiver, I need to go back to that test and remove the two test methods
> which are flaky. (In practice they will be probably just @Ignore-ed with
> comment about flakiness so we can fix them later).
>
> Flaky tests are
>
>
> org.apache.cassandra.distributed.test.PaxosRepair2Test.paxosRepairHistoryIsntUpdatedInForcedRepair
>
> org.apache.cassandra.distributed.test.PaxosRepair2Test.legacyPurgeRepairLoop
>
>
> +1 to a waiver on these two 4.1 flaky regressions to the RC and GA
> releases.
>
> Thanks for bringing it back to dev@ Stefan. Waivers should be done on dev@
> (build/release managers can't be keeping up with every ticket), and dev
> threads and tickets should be kept (reasonably) in-sync, for the sake of
> inclusiveness.
>
> I believe there will be follow up tickets to address these flakies in
> 4.1.x ?
>
>
>


Re: Naming conventions for CQL native functions

2022-11-11 Thread Andrés de la Peña
Thanks for the feedback. Adopting snake case and using aliases seems pretty
uncontroversial so far. I have created CASSANDRA-18037
<https://issues.apache.org/jira/browse/CASSANDRA-18037> for doing it.

On Thu, 10 Nov 2022 at 19:06, Jeremy Hanna 
wrote:

> +1 (nb) mixed case is a miserable experience and snake case makes it
> readable.
>
> > On Nov 10, 2022, at 10:57 AM, Francisco Guerrero 
> wrote:
> >
> > +1 (nb) as well
> >
> >> On 2022/11/10 17:16:21 Caleb Rackliffe wrote:
> >> +100 on snake case for built-in functions  given I think MySQL and
> Postgres
> >> use that convention as well.
> >>
> >> ex. https://www.postgresql.org/docs/9.2/functions-string.html
> >>
> >>> On Thu, Nov 10, 2022 at 7:51 AM Brandon Williams 
> wrote:
> >>>
> >>> I too meant snake case and need coffee.
> >>>
> >>>> On Thu, Nov 10, 2022, 7:26 AM Brandon Williams 
> wrote:
> >>>
> >>>> +1 on camel case and aliases for compatibility.
> >>>>
> >>>> On Thu, Nov 10, 2022, 6:21 AM Andrés de la Peña  >
> >>>> wrote:
> >>>>
> >>>>> It seems we don't have a clear convention on how to name CQL native
> >>>>> functions.
> >>>>>
> >>>>> Most native functions are named all lower case, without underscore
> nor
> >>>>> hyphen to separate words. That's the case, for example, of
> "intasblob" or
> >>>>> "blobasint".
> >>>>>
> >>>>> We also have some functions using camel case, as in "castAsInt" or
> >>>>> "castAsTimestamp". Note that the came cased names require quoting
> due to
> >>>>> CQL's case insensitivity.
> >>>>>
> >>>>> Differently to CQL native functions, system keyspaces, tables and
> >>>>> columns consistently use snake case. For example, we have
> "system_schema",
> >>>>> "dropped_columns", "default_time_to_live".
> >>>>>
> >>>>> I think it would be good to adopt a convention on how to name CQL
> native
> >>>>> functions, at least the new ones. IMO camel case would make sense
> because
> >>>>> it plays well with CQL's case insensitivity, it makes long names
> easier to
> >>>>> read and it's consistent with the names used for most other things.
> >>>>>
> >>>>> For example, in CASSANDRA-17811 I'm working on a set of functions to
> do
> >>>>> within-collection operations, which would be named "map_keys",
> >>>>> "map_values", "collection_min", "collection_max", "collection_sum",
> >>>>> "collection_count", etc. Also, CEP-20 will add a set of functions
> that
> >>>>> would be named "mask_null", "mask_default", "mask_replace",
> "mask_inner",
> >>>>> "mask_outer", "mask_hash", etc.
> >>>>>
> >>>>> As for the already existing functions, we could either let them be or
> >>>>> add snake case aliases for them, so for example we'd have both
> "castAsInt"
> >>>>> and "cast_as_int", at least for a time.
> >>>>>
> >>>>> What do you think?
> >>>>>
> >>>>
> >>
>


Re: Naming conventions for CQL native functions

2022-11-10 Thread Andrés de la Peña
>
> IMO camel case would make sense because it plays well with CQL's case
> insensitivity, it makes long names easier to read and it's consistent with
> the names used for most other things.


I meant that we should use snake case, as in "collection_max" and the other
example I give, but I wrongly wrote camel case instead. I'm sorry for the
confusion. I understand that Ekaterina and Brandon meant the same, adopting
snake_case as the convention and use camelCase aliases for compatibility.

On Thu, 10 Nov 2022 at 13:26, Brandon Williams  wrote:

> +1 on camel case and aliases for compatibility.
>
> On Thu, Nov 10, 2022, 6:21 AM Andrés de la Peña 
> wrote:
>
>> It seems we don't have a clear convention on how to name CQL native
>> functions.
>>
>> Most native functions are named all lower case, without underscore nor
>> hyphen to separate words. That's the case, for example, of "intasblob" or
>> "blobasint".
>>
>> We also have some functions using camel case, as in "castAsInt" or
>> "castAsTimestamp". Note that the came cased names require quoting due to
>> CQL's case insensitivity.
>>
>> Differently to CQL native functions, system keyspaces, tables and columns
>> consistently use snake case. For example, we have "system_schema",
>> "dropped_columns", "default_time_to_live".
>>
>> I think it would be good to adopt a convention on how to name CQL native
>> functions, at least the new ones. IMO camel case would make sense because
>> it plays well with CQL's case insensitivity, it makes long names easier to
>> read and it's consistent with the names used for most other things.
>>
>> For example, in CASSANDRA-17811 I'm working on a set of functions to do
>> within-collection operations, which would be named "map_keys",
>> "map_values", "collection_min", "collection_max", "collection_sum",
>> "collection_count", etc. Also, CEP-20 will add a set of functions that
>> would be named "mask_null", "mask_default", "mask_replace", "mask_inner",
>> "mask_outer", "mask_hash", etc.
>>
>> As for the already existing functions, we could either let them be or add
>> snake case aliases for them, so for example we'd have both "castAsInt" and
>> "cast_as_int", at least for a time.
>>
>> What do you think?
>>
>


Naming conventions for CQL native functions

2022-11-10 Thread Andrés de la Peña
It seems we don't have a clear convention on how to name CQL native
functions.

Most native functions are named all lower case, without underscore nor
hyphen to separate words. That's the case, for example, of "intasblob" or
"blobasint".

We also have some functions using camel case, as in "castAsInt" or
"castAsTimestamp". Note that the came cased names require quoting due to
CQL's case insensitivity.

Differently to CQL native functions, system keyspaces, tables and columns
consistently use snake case. For example, we have "system_schema",
"dropped_columns", "default_time_to_live".

I think it would be good to adopt a convention on how to name CQL native
functions, at least the new ones. IMO camel case would make sense because
it plays well with CQL's case insensitivity, it makes long names easier to
read and it's consistent with the names used for most other things.

For example, in CASSANDRA-17811 I'm working on a set of functions to do
within-collection operations, which would be named "map_keys",
"map_values", "collection_min", "collection_max", "collection_sum",
"collection_count", etc. Also, CEP-20 will add a set of functions that
would be named "mask_null", "mask_default", "mask_replace", "mask_inner",
"mask_outer", "mask_hash", etc.

As for the already existing functions, we could either let them be or add
snake case aliases for them, so for example we'd have both "castAsInt" and
"cast_as_int", at least for a time.

What do you think?


Re: Some tests are never executed in CI due to their name

2022-10-25 Thread Andrés de la Peña
Note that the test multiplexer also searches for tests ending with "Test",
so it will also miss new or modified tests with nonstandard names. The
automatically detected tests are listed when running generate.sh, and they
are added to the repeated run jobs. That gives us another opportunity to
detect missed tests when we introduce something like "MyTestSplit1.java"
and see that it's not been included in the mandatory repeated runs.

On Tue, 25 Oct 2022 at 10:31, Miklosovic, Stefan <
stefan.mikloso...@netapp.com> wrote:

> Hi Berenguer,
>
> I am glad you asked. I was expecting this question.
>
> I think there is a fundamental difference in how we approach this problem.
> I do not say mine is better, I just find it important to describe it
> clearly.
>
> Let's say we are on a ship which leaks. When I detect that, my course of
> action is to fix the leakage with the biggest patches I can find around
> with minimal effort necessary so we are not taking water anymore. There
> might still be occasional leakages which are very rare but I have a crew
> who is checking the ship constantly anyway (review) and the risk that big
> leakages like we just fixed happen again are relatively very low.
>
> You spot a leakage and instead of fixing it with big patches and calling
> it a day, you are trying to remodel the cabins so they are completely
> waterproof but while doing so, you renamed them so everyone needs to get
> used to how these cabins are called and where they are located, because
> there is this minimal chance that some cadet comes around and starts to
> live in a cabin he is not supposed to and we need to explain it to him.
> Thousands of cadets found their cabins just fine. Occasionally, there are
> few people who just miss the right board completely (Test at start, Tests
> at the end) and they are automatically navigated around (checkstyle) but
> once in five years there is this guy who just completely missed it.
>
> I believe we can just navigate him when that happens. You want to cover
> that guy too.
>
> I just find my approach easier.
>
> You can remodel it all, for sure but I am afraid I will not be a part of
> that. I just do not find it necessary to do that.
>
> 
> From: Berenguer Blasi 
> Sent: Tuesday, October 25, 2022 11:08
> To: dev@cassandra.apache.org
> Subject: Re: Some tests are never executed in CI due to their name
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
>
> IIUC we're relying on catching the word 'Split' in the file name for
> option 1. If somebody named his test i.e. 'MyTestGroup1', 'MyTestGroup2',
> 'TTLTestPre2038', 'TTLTestPost2038',... I think we would leak tests again?
> or any other word that is not specifically accounted for. Unless I am
> missing sthg ofc! :-)
>
> On 25/10/22 10:39, Miklosovic, Stefan wrote:
> > I think that what you wrote is not entirely correct. It will prevent it
> from happening again when there are tests ending on "Tests" or starting on
> "Test". The only case it will not cover is "SplitN" issue we plan to cover
> with relaxed test.name property.
> >
> > It seems like what you wrote means that we will fix it and tests will
> leak again. That is not true.
> >
> > 
> > From: Berenguer Blasi 
> > Sent: Tuesday, October 25, 2022 7:26
> > To: dev@cassandra.apache.org
> > Subject: Re: Some tests are never executed in CI due to their name
> >
> > NetApp Security WARNING: This is an external email. Do not click links
> or open attachments unless you recognize the sender and know the content is
> safe.
> >
> >
> >
> >
> > The problem with using the first approach is that it fixes the current
> > situation but it doesn't prevent it from happening again. The second
> > proposal prevents it from happening again but at the cost of a bigger
> > rename I'd volunteer to if needed.
> >
> > Regards
> >
> > On 24/10/22 20:38, Miklosovic, Stefan wrote:
> >> Yeah, that is what the branch in my original email actually already
> solved. I mean ...
> >>
> >> CassandraAuthorizerTruncatingTests
> >> BatchTests
> >> UUIDTests
> >>
> >> these are ending on Tests which is illegal and they are fixed there.
> >>
> >> Other cases are either "TestBase" or "Tester" which should be legal.
> >>
> >> I think using the first approach and fixing all "SplitN" PLUS adding
> Split* to test.name regexp should do it.
> >>
> >> I think we can "automate" at most like you suggest and scan it
> manually, fix the stuff and from then incorporate checkstyle to take care
> of that.
> >>
> >> There are also some classes which do include @Test methods but I think
> they are abstract as they are meant to be extended as the real test is just
> wrapping that. This might happen when there are slight variations across
> test classes. This is fine as well.
> >>
> >> 
> >> 

Re: [DISCUSS] Potential circleci config and workflow changes

2022-10-24 Thread Andrés de la Peña
>
> - Ticket for: remove -h, have -f and -p (free and paid)


+1 to this, probably there isn't anyone using -h. There are some jobs that
can't pass with the free option. Maybe we should remove them from the
workflow when the free option is used. Perhaps that could save new
contributors some confusion. Or should we leave them because a subset of
the tests inside those jobs can still pass even with the free tier?

By the way, the generate.sh script already accepts a -f flag. It's used to
stop checking that the specified environment variables are known. It was
meant to be a kind of general "--force" flag.

On Mon, 24 Oct 2022 at 20:07, Ekaterina Dimitrova 
wrote:

> Seems like my email crashed with Andres’ one.
> My understanding is we will use the ticket CASSANDRA-17113 as
> placeholder, the work there will be rebased/reworked etc depending on what
> we agree with.
> I also agree with the other points he made. Sounds reasonable to me
>
> On Mon, 24 Oct 2022 at 15:03, Ekaterina Dimitrova 
> wrote:
>
>> Thank you Josh
>>
>> So about push with/without a single click, I guess you mean to
>> parameterize whether the step build needs approval or not? Pre-commit the
>> new flag will use the “no-approval” version, but during development we
>> still will be able to push the tests without immediately starting all
>> tests, right?
>> - parallelism + -h being removed - just to confirm, that means we will
>> not use xlarge containers. As David confirmed, this is not needed for all
>> jibs and it is important as otherwise whoever uses paid account will burn
>> their credits time faster for very similar duration runs.
>>
>> CASSANDRA-17930 - I will use the opportunity also to mention that many of
>> the identified missing jobs in CircleCI will be soon there - Andres is
>> working on all variations unit tests, I am doing final testing on fixing
>> the Python upgrade tests (we weren’t using the right parameters and running
>> way more jobs then we should) and Derek is looking into the rest of the
>> Python test. I still need to check whether we need something regarding
>> in-jvm etc, the simulator ones are running only for jdk8 for now,
>> confirmed. All this should unblock us to be able to do next releases based
>> on CircleCI as we agreed. Then we move to do some
>> changes/additions/improvements to Jenkins. And of course, the future
>> improvements we agreed on.
>>
>> On Mon, 24 Oct 2022 at 14:10, Josh McKenzie  wrote:
>>
>>> Auto-run on push? Can you elaborate?
>>>
>>> Yep - instead of having to go to circle and click, when you push your
>>> branch the circle hook picks it up and kicks off the top level job
>>> automatically. I tend to be paranoid and push a lot of incremental work
>>> that's not ready for CI remotely so it's not great for me, but I think
>>> having it be optional is the Right Thing.
>>>
>>> So here's the outstanding work I've distilled from this thread:
>>> - Create an epic for circleci improvement work (we have a lot of little
>>> augments to do here; keep it organized and try and avoid redundancy)
>>> - Include CASSANDRA-17600 in epic umbrella
>>> - Include CASSANDRA-17930 in epic umbrella
>>> - Ticket to tune parallelism per job
>>> -
>>> > def java_parallelism(src_dir, kind, num_file_in_worker, include =
>>> lambda a, b: True):
>>> > d = os.path.join(src_dir, 'test', kind)
>>> > num_files = 0
>>> > for root, dirs, files in os.walk(d):
>>> > for f in files:
>>> > if f.endswith('Test.java') and
>>> include(os.path.join(root, f), f):
>>> > num_files += 1
>>> > return math.floor(num_files / num_file_in_worker)
>>> >
>>> > def fix_parallelism(args, contents):
>>> > jobs = contents['jobs']
>>> >
>>> > unit_parallelism= java_parallelism(args.src,
>>> 'unit', 20)
>>> > jvm_dtest_parallelism   = java_parallelism(args.src,
>>> 'distributed', 4, lambda full, name: 'upgrade' not in full)
>>> > jvm_dtest_upgrade_parallelism   = java_parallelism(args.src,
>>> 'distributed', 2, lambda full, name: 'upgrade' in full)
>>> - `TL;DR - I find all test files we are going to run, and based off
>>> a pre-defined variable that says “idea” number of files per worker, I then
>>> calculate how many workers we need.  So unit tests are num_files / 20 ~= 35
>>> workers.  Can I be “smarter” by knowing which files have higher cost?
>>> Sure… but the “perfect” and the “average” are too similar that it wasn’t
>>> worth it...`
>>> - Ticket to combine pre-commit jobs into 1 pipeline for all JDK's
>>> - Path to activate all supported JDK's for pre-commit at root
>>> (one-click pre-merge full validation)
>>> - Path to activate per JDK below that (interim work partial
>>> validation)
>>> - Ticket to rename jobs in circleci
>>> - Reference comment:
>>> 

Re: [DISCUSS] Potential circleci config and workflow changes

2022-10-24 Thread Andrés de la Peña
>
> Yep - instead of having to go to circle and click, when you push your
> branch the circle hook picks it up and kicks off the top level job
> automatically. I tend to be paranoid and push a lot of incremental work
> that's not ready for CI remotely so it's not great for me, but I think
> having it be optional is the Right Thing.

- Ticket for flag in generate.sh to support auto run on push (see response
> above)


CASSANDRA-17113 was created almost a year ago for this. While we can have
flags to specify whether the runs tart automatically or not, we'd still
need to have a default. I think the default should be not starting anything
without either manual approval or the usage of those flags when generating
the config, as we decided during CASSANDRA-16882 and the discussions around
it.

- Ticket to combine pre-commit jobs into 1 pipeline for all JDK's
> - Ticket to rename jobs in circleci


I'd say these two things should be in a single ticket, since the problems
with naming appear when we try to unify the two workflows.


On Mon, 24 Oct 2022 at 19:10, Josh McKenzie  wrote:

> Auto-run on push? Can you elaborate?
>
> Yep - instead of having to go to circle and click, when you push your
> branch the circle hook picks it up and kicks off the top level job
> automatically. I tend to be paranoid and push a lot of incremental work
> that's not ready for CI remotely so it's not great for me, but I think
> having it be optional is the Right Thing.
>
> So here's the outstanding work I've distilled from this thread:
> - Create an epic for circleci improvement work (we have a lot of little
> augments to do here; keep it organized and try and avoid redundancy)
> - Include CASSANDRA-17600 in epic umbrella
> - Include CASSANDRA-17930 in epic umbrella
> - Ticket to tune parallelism per job
> -
> > def java_parallelism(src_dir, kind, num_file_in_worker, include =
> lambda a, b: True):
> > d = os.path.join(src_dir, 'test', kind)
> > num_files = 0
> > for root, dirs, files in os.walk(d):
> > for f in files:
> > if f.endswith('Test.java') and
> include(os.path.join(root, f), f):
> > num_files += 1
> > return math.floor(num_files / num_file_in_worker)
> >
> > def fix_parallelism(args, contents):
> > jobs = contents['jobs']
> >
> > unit_parallelism= java_parallelism(args.src,
> 'unit', 20)
> > jvm_dtest_parallelism   = java_parallelism(args.src,
> 'distributed', 4, lambda full, name: 'upgrade' not in full)
> > jvm_dtest_upgrade_parallelism   = java_parallelism(args.src,
> 'distributed', 2, lambda full, name: 'upgrade' in full)
> - `TL;DR - I find all test files we are going to run, and based off a
> pre-defined variable that says “idea” number of files per worker, I then
> calculate how many workers we need.  So unit tests are num_files / 20 ~= 35
> workers.  Can I be “smarter” by knowing which files have higher cost?
> Sure… but the “perfect” and the “average” are too similar that it wasn’t
> worth it...`
> - Ticket to combine pre-commit jobs into 1 pipeline for all JDK's
> - Path to activate all supported JDK's for pre-commit at root
> (one-click pre-merge full validation)
> - Path to activate per JDK below that (interim work partial validation)
> - Ticket to rename jobs in circleci
> - Reference comment:
> https://issues.apache.org/jira/browse/CASSANDRA-17939?focusedCommentId=17617016=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17617016
> - (buildjdk)_(runjdk)_(testsuite) format:
> - j8_j8_jvm_dtests
> - j8_j11_jvm_dtests
> - j11_j11_jvm_dtest_vnode
> etc
> - Ticket for flag in generate.sh to support auto run on push (see response
> above)
> - Ticket for: remove -h, have -f and -p (free and paid) (probably
> intersects with https://issues.apache.org/jira/browse/CASSANDRA-17600)
>
> Anything wrong w/the above or anything missed? If not, I'll go do some
> JIRA'ing.
>
> ~Josh
>
>
> On Fri, Oct 21, 2022, at 3:50 PM, Josh McKenzie wrote:
>
> I am cool with removing circle if apache CI is stable and works, we do
> need to solve the non-committer issue but would argue that partially exists
> in circle today (you can be a non-commuter with a paid account, but you
> can’t be a non-committer with a free account)
>
> There's a few threads here:
> 1. non-committers should be able to run ci
> 2. People that have resources and want to run ci faster should be able to
> do so (assuming the ci of record could serve to be faster)
> 3. ci should be stable
>
> Thus far we haven't landed on 1 system that satisfies all 3. There's some
> background discussions brainstorming how to get there; when / if things
> come from that they'll as always be brought to the list for discussion.
>
> On Fri, Oct 21, 2022, at 1:44 PM, Ekaterina Dimitrova wrote:
>
> I agree with David with one caveat - last time I checked only some Python
> 

Re: New CircleCI test multiplexer

2022-10-18 Thread Andrés de la Peña
The -h profile works but it spends a lot of resources for slightly faster
results. The -m profile is better value in terms of speed per resources. I
guess -h can be used if one wants to get results as soon as possible, no
matter the cost. Ekaterina might be better informed than me, given her work
on CASSANDRA-15712.

In any case, the new multiplexer doesn't change the resources configuration
at all. We might want to reevaluate that config in the future, and probably
follow David's suggestion of deciding parallelism as a function of the
number of tests to be run.

On Tue, 18 Oct 2022 at 18:13, Josh McKenzie  wrote:

> * Running the .circleci/generate.sh script with -l/-m/-h flags will use
> git diff to automatically detect the new or modified tests and will add
> them to the lists of tests to be repeated. The pre-commit workflow will
> automatically start repeated runs for these tests. The only exception to
> this are Python dtests, that should be specified manually.
>
> Of note: the -h profile should not be used (correct me if I'm wrong here
> Andres). Use -l for the free tier on circle or -m for paid.
>
> Will have some follow up tickets regarding job naming, default config
> type, and updating documentation shortly.
>
> On Tue, Oct 18, 2022, at 12:33 PM, Andrés de la Peña wrote:
>
> Just to let you know that CASSANDRA-17939 has just been committed.
>
> It changes the way the CircleCI multiplexer works, in line with the recent
> changes in our release criteria:
>
> * The default number of repeated tests iterations is 500, except for long
> and upgrade tests.
> * It is possible to specify multiple test classes and methods to be
> repeated into the same config push. So patches altering dozens of tests
> won't require dozens of config pushes anymore.
> * Running the .circleci/generate.sh script with -l/-m/-h flags will use
> git diff to automatically detect the new or modified tests and will add
> them to the lists of tests to be repeated. The pre-commit workflow will
> automatically start repeated runs for these tests. The only exception to
> this are Python dtests, that should be specified manually.
> * The CircleCI jobs are rearranged so for every regular job there is a
> companion job to run the repeated tests associated to that job. Those
> companion jobs will only be visible if there are repeated tests to run.
> Here
> <https://app.circleci.com/pipelines/github/adelapena/cassandra/2278/workflows/e339f5d4-0e16-4d4a-bde8-f0d9b9f3912d>
> is an example run with repeated tests for all the test suites, and here
> <https://app.circleci.com/pipelines/github/adelapena/cassandra/2269/workflows/d8907cbc-dbca-4d21-bdb9-1a4c58e1a412>
> is the same workflow without any repeated tests.
>
> Some documentation on how to use it can be found here:
> https://github.com/apache/cassandra/blob/trunk/.circleci/readme.md#running-tests-in-a-loop
>
>
>


New CircleCI test multiplexer

2022-10-18 Thread Andrés de la Peña
Just to let you know that CASSANDRA-17939 has just been committed.

It changes the way the CircleCI multiplexer works, in line with the recent
changes in our release criteria:

* The default number of repeated tests iterations is 500, except for long
and upgrade tests.
* It is possible to specify multiple test classes and methods to be
repeated into the same config push. So patches altering dozens of tests
won't require dozens of config pushes anymore.
* Running the .circleci/generate.sh script with -l/-m/-h flags will use git
diff to automatically detect the new or modified tests and will add them to
the lists of tests to be repeated. The pre-commit workflow will
automatically start repeated runs for these tests. The only exception to
this are Python dtests, that should be specified manually.
* The CircleCI jobs are rearranged so for every regular job there is a
companion job to run the repeated tests associated to that job. Those
companion jobs will only be visible if there are repeated tests to run. Here

is an example run with repeated tests for all the test suites, and here

is the same workflow without any repeated tests.

Some documentation on how to use it can be found here:
https://github.com/apache/cassandra/blob/trunk/.circleci/readme.md#running-tests-in-a-loop


Re: [VOTE] Revising release gating criteria and CI systems

2022-10-11 Thread Andrés de la Peña
+1

On Tue, 11 Oct 2022 at 11:57, Brandon Williams  wrote:

> +1
>
> On Sat, Oct 8, 2022 at 7:30 AM Josh McKenzie  wrote:
> >
> > DISCUSS thread:
> https://lists.apache.org/thread/o166v7nr9lxnzdy5511tv40rr9t6zbrw
> >
> > Revise Release Lifecycle cwiki page (
> https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle):
> >  - Ensure we have parity on jobs run and coverage between circle and
> asf-ci
> >  - Allow usage of circleci as gatekeeper for releases. A release will
> require 1 green run for beta, 3 green runs consecutively for ga
> >  - No new consistent regressions on CI for asf compared to prior branches
> >  - Explicitly do not consider ci-cassandra asf flaky tests as release
> blockers
> >
> > Changes to codify into documentation (
> https://cassandra.apache.org/_/development/how_to_commit.html):
> >  - On patch before commit, multiplex @500 all new tests, changed tests,
> or expected to be impacted tests ("expected to be impacted" piece pending
> multi-class multiplexing support).
> >  - Add support for / documentation for multi-class specification in
> multiplexer and document
> >
> > Add informal project commitment during next major release lifecycle to
> continue working on bringing asf ci-cassandra up to where it can be formal
> gatekeeper for release.
> >
> > ---
> > The vote for these revisions will run through EoD 10/12/22 to give us
> the weekend + 72 business hours.
>


Re: [DISCUSS] Revising our release criteria, commit guidelines, and the role of circleci vs. ASF CI

2022-10-05 Thread Andrés de la Peña
The proposal looks good to me. I have created tickets for:
 - Increasing the default number of repeated test iterations to 500 (
CASSANDRA-17937 <https://issues.apache.org/jira/browse/CASSANDRA-17937>,
ready to commit)
 - Automatically detecting and repeating new or modified JUnit tests (
CASSANDRA-17939 <https://issues.apache.org/jira/browse/CASSANDRA-17939>,
patch available)
 - Allowing to specify multiple tests in the test multiplexer (
CASSANDRA-17938 <https://issues.apache.org/jira/browse/CASSANDRA-17938>, in
progress)

On Mon, 3 Oct 2022 at 15:23, Josh McKenzie  wrote:

> Any further revisions or objections to this or are we good to take it to a
> vote?
>
> On Wed, Sep 28, 2022, at 10:54 AM, Josh McKenzie wrote:
>
> So revised proposal:
>
> On Release Lifecycle cwiki page:
>  - Ensure we have parity on jobs run between circle and asf-ci
>  - Allow usage of circleci as gatekeeper for releases. 1 green run ->
> beta, 3 green runs consecutive -> ga
>  - No new consistent regressions on CI for asf compared to prior branches
>  - Explicitly do not consider ci-cassandra asf flaky tests as release
> blockers
>
> Changes to codify into documentation:
>  - On patch before commit, multiplex @500 all new tests, changed tests, or
> expected to be impacted tests ("expected to be impacted" piece pending
> multi-class multiplexing support):
>  - Add support for multi-class specification in multiplexer and document
>
> Add informal project commitment during next major release lifecycle to
> continue working on bringing asf ci-cassandra up to where it can be formal
> gatekeeper for release.
>
> On Wed, Sep 28, 2022, at 10:13 AM, Ekaterina Dimitrova wrote:
>
> If we talk blockers nothing more than ensuring we see all tests we want
> pre-release, IMHO.
> The other points sound to me like future important improvements that will
> help us significantly in the flaky test fight.
>
> On Wed, 28 Sep 2022 at 10:08, Josh McKenzie  wrote:
>
>
> I'm receptive to that but I wouldn't gate our ability to get 4.1 out the
> door based on circle on that. Honestly probably only need to have the
> parity of coverage be the blocker for its use in retrospect.
>
> On Wed, Sep 28, 2022, at 1:32 AM, Berenguer Blasi wrote:
>
> I would add an option for generate.sh to detect all changed *Test.java
> files, that would be handy imo.
> On 28/9/22 4:29, Josh McKenzie wrote:
>
> So:
>
>1. 500 iterations on multiplexer
>2. Augmenting generate.sh to allow providing multiple class names and
>generating a single config that'll multiplex all the tests provided
>3. Test parity / pre-release config added on circleci (see
>https://issues.apache.org/jira/browse/CASSANDRA-17930),
>specifically dtest-large, dtest-offheap, test-large-novnode
>
> If we get the above 3, are we at a place where we're good to consider
> vetting releases on circleci for beta / rc / ga?
>
> On Tue, Sep 27, 2022, at 11:28 AM, Ekaterina Dimitrova wrote:
>
> “I have plans on modifying the multiplexer to allow specifying a list of
> classes per test target, so we don't have to needlessly suffer with this”
>
>
> That would be great, I was thinking of that the other day too. With that
> said I’ll be happy to support you in that effort too :-)
>
>
> On Tue, 27 Sep 2022 at 11:18, Josh McKenzie  wrote:
>
>
> I have plans on modifying the multiplexer to allow specifying a list of
> classes per test target, so we don't have to needlessly suffer with this
>
> This sounds integral to us multiplexing tests on large diffs whether we go
> with circle for releases or not and would be a great addition!
>
> On Tue, Sep 27, 2022, at 6:19 AM, Andrés de la Peña wrote:
>
> 250 iterations isn't enough; I use 500 as a low water mark.
>
>
> I agree that 500 iterations would be a reasonable minimum. We have seen
> flaky unit tests requiring far more iterations, but that's not very common.
> We could use to 500 iterations as default, and discretionary use a higher
> limit in tests that are quick and might be prone to concurrency issues. I
> can change the defaults on CirceCI config file if we agree to a new limit,
> the current default of 100 iterations is quite arbitrary.
>
> The test multiplexer allows to either run test individual test methods or
> entire classes. It is quite frequent to see tests methods that pass
> individually but fail when they are run together with the other tests in
> the same class. Because of this, I think that we should always run entire
> classes when repeating new or modified tests. The only exception to this
> would be Python dtests, which usually are more resource intensive and not
> so prone to that type of issues.
>
> For CI on a 

Re: [VOTE] Release Apache Cassandra 4.1-beta1

2022-09-30 Thread Andrés de la Peña
+1

On Fri, 30 Sept 2022 at 09:26, Benjamin Lerer  wrote:

> +1
>
> Le ven. 30 sept. 2022 à 08:11, Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> a écrit :
>
>> +1
>>
>> 
>> From: Mick Semb Wever 
>> Sent: Tuesday, September 27, 2022 15:13
>> To: dev
>> Subject: [VOTE] Release Apache Cassandra 4.1-beta1
>>
>> NetApp Security WARNING: This is an external email. Do not click links or
>> open attachments unless you recognize the sender and know the content is
>> safe.
>>
>>
>>
>>
>> Proposing the test build of Cassandra 4.1-beta1 for release.
>>
>> sha1: 5d9d93ea08d9c76402aa1d14bad54bf9ec875686
>> Git:
>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.1-beta1-tentative
>> Maven Artifacts:
>> https://repository.apache.org/content/repositories/orgapachecassandra-1276/org/apache/cassandra/cassandra-all/4.1-beta1/
>>
>> The Source and Build Artifacts, and the Debian and RPM packages and
>> repositories, are available here:
>> https://dist.apache.org/repos/dist/dev/cassandra/4.1-beta1/
>>
>> The vote will be open for 72 hours (longer if needed). Everyone who has
>> tested the build is invited to vote. Votes by PMC members are considered
>> binding. A vote passes if there are at least three binding +1s and no -1's.
>>
>> [1]: CHANGES.txt:
>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/4.1-beta1-tentative
>> [2]: NEWS.txt:
>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/4.1-beta1-tentative
>>
>


Re: [DISCUSS] Revising our release criteria, commit guidelines, and the role of circleci vs. ASF CI

2022-09-27 Thread Andrés de la Peña
>
> 250 iterations isn't enough; I use 500 as a low water mark.


I agree that 500 iterations would be a reasonable minimum. We have seen
flaky unit tests requiring far more iterations, but that's not very common.
We could use to 500 iterations as default, and discretionary use a higher
limit in tests that are quick and might be prone to concurrency issues. I
can change the defaults on CirceCI config file if we agree to a new limit,
the current default of 100 iterations is quite arbitrary.

The test multiplexer allows to either run test individual test methods or
entire classes. It is quite frequent to see tests methods that pass
individually but fail when they are run together with the other tests in
the same class. Because of this, I think that we should always run entire
classes when repeating new or modified tests. The only exception to this
would be Python dtests, which usually are more resource intensive and not
so prone to that type of issues.

For CI on a patch, run the pre-commit suite and also run multiplexer with
> 250 runs on new, changed, or related tests to ensure not flaky


The multiplexer only allows to run a single test class per push. This is ok
for fixing existing flakies (its original purpose), and for most minor
changes, but it can be quite inconvenient for testing large patches that
add or modify many tests. For example, the patch for CEP-19 directly
modifies 31 test classes, which means 31 CircleCI config pushes. This
number can be somewhat reduced with some wildcards on the class names, but
the process is still quite inconvenient. I guess that other large patches
will find the same problem. I have plans on modifying the multiplexer to
allow specifying a list of classes per test target, so we don't have to
needlessly suffer with this.

On Mon, 26 Sept 2022 at 22:44, Brandon Williams  wrote:

> On Mon, Sep 26, 2022 at 1:31 PM Josh McKenzie 
> wrote:
> >
> > 250 iterations isn't enough; I use 500 as a low water mark.
> >
> > Say more here. I originally had it at 500 but neither Mick nor I knew
> why and figured we could suss this out on this thread.
>
> I've seen flakies that passed with less later exhibit at that point.
>
> > This is also assuming that circle and ASF CI run the same tests, which
> > is not entirely true.
> >
> > +1: we need to fix this. My intuition is the path to getting circle-ci
> in parity on coverage is a shorter path than getting ASF CI to 3 green runs
> for GA. That consistent w/your perception as well or do you disagree?
>
> I agree that bringing parity to the coverage will be the shorter path.
>


Re: [Discuss] CEP-24 Password validation and generation

2022-09-23 Thread Andrés de la Peña
I think that custom, pluggable type of guardrail will be a great addition
to the framework.

The first guardrails prototype included a factory of guardrails that was
able to provide different guardrail instances depending on the specified
class and client state. That was discarded during review in favour of a
pluggable provider of guardrail configurations, so the guardrail instances
are always the same but the source of configuration can be customized. That
allows to provide different configurations for different users with minimal
hassle. Although that works for most of the guardrails, it doesn't give the
kind of flexibility than the ability to plug custom guardrail
implementations does.

I think that the proposed approach for making specific guardrail
implementations pluggable allows us to keep the simplicity of the current
approach while also giving us flexibility for particular cases like
password validation. The generic CustomGuardrail that will be added to the
framework should ease (and standardize!) validation pluggability, so I
think it can be useful in the future.

On Mon, 19 Sept 2022 at 12:27, Miklosovic, Stefan <
stefan.mikloso...@netapp.com> wrote:

> Hi list,
>
> together with my colleague Jackson Fleming we put together CEP-24 about
> password validation and password generation in Cassandra.
>
> https://cwiki.apache.org/confluence/x/QoueDQ
>
> We are looking forward to discuss this CEP with you in depth.
>
> The outcome of this thread would be to sort out any issues / concerns you
> have so we might eventually vote and implement that in upstream if our
> contribution is found to be useful.
>
> There is a reference implementation provided we would like to build our
> solution on top.
>
> Regards
>
> Stefan Miklosovic
>


Re: [VOTE] CEP-20: Dynamic Data Masking

2022-09-23 Thread Andrés de la Peña
Vote passes with eight +1s (seven binding) and no vetos.

Thanks everyone.

On Thu, 22 Sept 2022 at 20:50, Josh McKenzie  wrote:

> +1
>
> On Thu, Sep 22, 2022, at 4:28 AM, Mick Semb Wever wrote:
>
>
>
> I'd like to propose CEP-20 for approval.
>
> Proposal:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
> Discussion:
> https://lists.apache.org/thread/qsmxsymozymy6dy9tp5xw9gn5fhz9nt4
>
> The vote will be open for 72 hours.
> Votes by committers are considered binding.
> A vote passes if there are at least three binding +1s and no binding
> vetoes.
>
>
>
>
> +1
>
>
>


[VOTE] CEP-20: Dynamic Data Masking

2022-09-19 Thread Andrés de la Peña
Hi everyone,

I'd like to propose CEP-20 for approval.

Proposal:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
Discussion: https://lists.apache.org/thread/qsmxsymozymy6dy9tp5xw9gn5fhz9nt4

The vote will be open for 72 hours.
Votes by committers are considered binding.
A vote passes if there are at least three binding +1s and no binding vetoes.

Thank you,


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-16 Thread Andrés de la Peña
It's been 9 days since we started the poll, and we haven't had any new vote
since Monday. So we are still on 5 votes for A and 2 votes for B.

The poll results doesn't seem to oppose the CEP. If no one has anything
else to add, I'll start the actual vote thread.

On Tue, 13 Sept 2022 at 15:05, Andrés de la Peña 
wrote:

> That's 5 votes for A and 2 votes for B so far. None of these options
> opposes to the CEP, so I think we can probably start the vote, unless we
> want to wait longer for the poll.
>
> On Mon, 12 Sept 2022 at 13:51, Benjamin Lerer  wrote:
>
>> A
>>
>> Le mer. 7 sept. 2022 à 17:02, Jeremiah D Jordan <
>> jeremiah.jor...@gmail.com> a écrit :
>>
>>> A
>>>
>>> On Sep 7, 2022, at 8:58 AM, Benedict  wrote:
>>>
>>> Well, I am not convinced these changes will materially impact the
>>> outcome, but at least we’ll have some extra fun collating the votes.
>>>
>>>
>>> On 7 Sep 2022, at 14:05, Andrés de la Peña  wrote:
>>>
>>> 
>>> The poll makes sense to me. I would slightly change it to:
>>>
>>> A) We shouldn't prefer neither approach, and I agree to the implementor
>>> selecting the table schema approach for this CEP
>>> B) We should prefer the view approach, but I am not opposed to the
>>> implementor selecting the table schema approach for this CEP
>>> C) We should NOT implement the table schema approach, and should
>>> implement the view approach
>>> D) We should NOT implement the table view approach, and should implement
>>> the schema approach
>>> E) We should NOT implement the table schema approach, and should
>>> implement some other scheme (or not implement this feature)
>>>
>>> Where my vote is for A.
>>>
>>>
>>> On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:
>>>
>>>> I’m not convinced there’s been adequate resolution over which approach
>>>> is adopted. I know you have expressed a preference for the table schema
>>>> approach, but the weight of other opinion so far appears to be against this
>>>> approach - even if it is broadly adopted by other databases. I will note
>>>> that Postgres does not adopt this approach, it has a more sophisticated
>>>> security label approach that has not been proposed by anybody so far.
>>>>
>>>> I think extra weight should be given to the implementer’s preference,
>>>> so while I personally do not like the table schema approach, I am happy to
>>>> accept this is an industry norm, and leave the decision to you.
>>>>
>>>> However, we should ensure the community as a whole endorses this. I
>>>> think an indicative poll should be undertaken first, eg:
>>>>
>>>> A) We should implement the table schema approach, as proposed
>>>> B) We should prefer the view approach, but I am not opposed to the
>>>> implementor selecting the table schema approach for this CEP
>>>> C) We should NOT implement the table schema approach, and should
>>>> implement the view approach
>>>> D) We should NOT implement the table schema approach, and should
>>>> implement some other scheme (or not implement this feature)
>>>>
>>>> Where my vote is B
>>>>
>>>> On 7 Sep 2022, at 12:50, Andrés de la Peña 
>>>> wrote:
>>>>
>>>> 
>>>> If nobody has more concerns regarding the CEP I will start the vote
>>>> tomorrow.
>>>>
>>>> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
>>>> wrote:
>>>>
>>>>> Is there enough support here for VIEWS to be the implementation
>>>>>> strategy for displaying masking functions?
>>>>>
>>>>>
>>>>> I'm not sure that views should be "the" strategy for masking
>>>>> functions. We have multiple approaches here:
>>>>>
>>>>> 1) CQL functions only. Users can decide to use the masking functions
>>>>> on their own will. I think most dbs allow this pattern of usage, which is
>>>>> quite straightforward. Obviously, it doesn't allow admins to decide 
>>>>> enforce
>>>>> users seeing only masked data. Nevertheless, it's still useful for trusted
>>>>> database users generating masked data that will be consumed by the end
>>>>> users of the application.
>>>>>
>>>>> 2) Masking functions attached to specific columns. T

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-13 Thread Andrés de la Peña
That's 5 votes for A and 2 votes for B so far. None of these options
opposes to the CEP, so I think we can probably start the vote, unless we
want to wait longer for the poll.

On Mon, 12 Sept 2022 at 13:51, Benjamin Lerer  wrote:

> A
>
> Le mer. 7 sept. 2022 à 17:02, Jeremiah D Jordan 
> a écrit :
>
>> A
>>
>> On Sep 7, 2022, at 8:58 AM, Benedict  wrote:
>>
>> Well, I am not convinced these changes will materially impact the
>> outcome, but at least we’ll have some extra fun collating the votes.
>>
>>
>> On 7 Sep 2022, at 14:05, Andrés de la Peña  wrote:
>>
>> 
>> The poll makes sense to me. I would slightly change it to:
>>
>> A) We shouldn't prefer neither approach, and I agree to the implementor
>> selecting the table schema approach for this CEP
>> B) We should prefer the view approach, but I am not opposed to the
>> implementor selecting the table schema approach for this CEP
>> C) We should NOT implement the table schema approach, and should
>> implement the view approach
>> D) We should NOT implement the table view approach, and should implement
>> the schema approach
>> E) We should NOT implement the table schema approach, and should
>> implement some other scheme (or not implement this feature)
>>
>> Where my vote is for A.
>>
>>
>> On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:
>>
>>> I’m not convinced there’s been adequate resolution over which approach
>>> is adopted. I know you have expressed a preference for the table schema
>>> approach, but the weight of other opinion so far appears to be against this
>>> approach - even if it is broadly adopted by other databases. I will note
>>> that Postgres does not adopt this approach, it has a more sophisticated
>>> security label approach that has not been proposed by anybody so far.
>>>
>>> I think extra weight should be given to the implementer’s preference, so
>>> while I personally do not like the table schema approach, I am happy to
>>> accept this is an industry norm, and leave the decision to you.
>>>
>>> However, we should ensure the community as a whole endorses this. I
>>> think an indicative poll should be undertaken first, eg:
>>>
>>> A) We should implement the table schema approach, as proposed
>>> B) We should prefer the view approach, but I am not opposed to the
>>> implementor selecting the table schema approach for this CEP
>>> C) We should NOT implement the table schema approach, and should
>>> implement the view approach
>>> D) We should NOT implement the table schema approach, and should
>>> implement some other scheme (or not implement this feature)
>>>
>>> Where my vote is B
>>>
>>> On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:
>>>
>>> 
>>> If nobody has more concerns regarding the CEP I will start the vote
>>> tomorrow.
>>>
>>> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
>>> wrote:
>>>
>>>> Is there enough support here for VIEWS to be the implementation
>>>>> strategy for displaying masking functions?
>>>>
>>>>
>>>> I'm not sure that views should be "the" strategy for masking functions.
>>>> We have multiple approaches here:
>>>>
>>>> 1) CQL functions only. Users can decide to use the masking functions on
>>>> their own will. I think most dbs allow this pattern of usage, which is
>>>> quite straightforward. Obviously, it doesn't allow admins to decide enforce
>>>> users seeing only masked data. Nevertheless, it's still useful for trusted
>>>> database users generating masked data that will be consumed by the end
>>>> users of the application.
>>>>
>>>> 2) Masking functions attached to specific columns. This way the same
>>>> queries will see different data (masked or not) depending on the
>>>> permissions of the user running the query. It has the advantage of not
>>>> requiring to change the queries that users with different permissions run.
>>>> The downside is that users would need to query the schema if they need to
>>>> know whether a column is masked, unless we change the names of the returned
>>>> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
>>>> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
>>>> applying the masking function to columns on the base table, and some of
>>>> them a

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Andrés de la Peña
The poll makes sense to me. I would slightly change it to:

A) We shouldn't prefer neither approach, and I agree to the implementor
selecting the table schema approach for this CEP
B) We should prefer the view approach, but I am not opposed to the
implementor selecting the table schema approach for this CEP
C) We should NOT implement the table schema approach, and should implement
the view approach
D) We should NOT implement the table view approach, and should implement
the schema approach
E) We should NOT implement the table schema approach, and should implement
some other scheme (or not implement this feature)

Where my vote is for A.


On Wed, 7 Sept 2022 at 13:12, Benedict  wrote:

> I’m not convinced there’s been adequate resolution over which approach is
> adopted. I know you have expressed a preference for the table schema
> approach, but the weight of other opinion so far appears to be against this
> approach - even if it is broadly adopted by other databases. I will note
> that Postgres does not adopt this approach, it has a more sophisticated
> security label approach that has not been proposed by anybody so far.
>
> I think extra weight should be given to the implementer’s preference, so
> while I personally do not like the table schema approach, I am happy to
> accept this is an industry norm, and leave the decision to you.
>
> However, we should ensure the community as a whole endorses this. I think
> an indicative poll should be undertaken first, eg:
>
> A) We should implement the table schema approach, as proposed
> B) We should prefer the view approach, but I am not opposed to the
> implementor selecting the table schema approach for this CEP
> C) We should NOT implement the table schema approach, and should implement
> the view approach
> D) We should NOT implement the table schema approach, and should implement
> some other scheme (or not implement this feature)
>
> Where my vote is B
>
> On 7 Sep 2022, at 12:50, Andrés de la Peña  wrote:
>
> 
> If nobody has more concerns regarding the CEP I will start the vote
> tomorrow.
>
> On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
> wrote:
>
>> Is there enough support here for VIEWS to be the implementation strategy
>>> for displaying masking functions?
>>
>>
>> I'm not sure that views should be "the" strategy for masking functions.
>> We have multiple approaches here:
>>
>> 1) CQL functions only. Users can decide to use the masking functions on
>> their own will. I think most dbs allow this pattern of usage, which is
>> quite straightforward. Obviously, it doesn't allow admins to decide enforce
>> users seeing only masked data. Nevertheless, it's still useful for trusted
>> database users generating masked data that will be consumed by the end
>> users of the application.
>>
>> 2) Masking functions attached to specific columns. This way the same
>> queries will see different data (masked or not) depending on the
>> permissions of the user running the query. It has the advantage of not
>> requiring to change the queries that users with different permissions run.
>> The downside is that users would need to query the schema if they need to
>> know whether a column is masked, unless we change the names of the returned
>> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
>> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
>> applying the masking function to columns on the base table, and some of
>> them also allow to apply masking to views.
>>
>> 3) Masking functions as part of projected views. This ways users might
>> need to query the view appropriate for their permissions instead of the
>> base table. This might mean changing the queries if the masking policy is
>> changed by the admin. MySQL recommends this approach on a blog entry,
>> although it's not part of its main documentation for data masking, and the
>> implementation has security issues. Some of the other databases offering
>> the approach 2) as their main option also support masking on view columns.
>>
>> Each approach has its own advantages and limitations, and I don't think
>> we necessarily have to choose. The CEP proposes implementing 1) and 2), but
>> no one impedes us to also have 3) if we get to have projected views.
>> However, I think that projected views is a new general-purpose feature with
>> its own complexities, so it would deserve its own CEP, if someone is
>> willing to work on the implementation.
>>
>>
>>
>> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev <
>> dev@cassandra.apache.org> wrote:
>>
>>> Is there enou

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-09-07 Thread Andrés de la Peña
If nobody has more concerns regarding the CEP I will start the vote
tomorrow.

On Wed, 31 Aug 2022 at 13:18, Andrés de la Peña 
wrote:

> Is there enough support here for VIEWS to be the implementation strategy
>> for displaying masking functions?
>
>
> I'm not sure that views should be "the" strategy for masking functions. We
> have multiple approaches here:
>
> 1) CQL functions only. Users can decide to use the masking functions on
> their own will. I think most dbs allow this pattern of usage, which is
> quite straightforward. Obviously, it doesn't allow admins to decide enforce
> users seeing only masked data. Nevertheless, it's still useful for trusted
> database users generating masked data that will be consumed by the end
> users of the application.
>
> 2) Masking functions attached to specific columns. This way the same
> queries will see different data (masked or not) depending on the
> permissions of the user running the query. It has the advantage of not
> requiring to change the queries that users with different permissions run.
> The downside is that users would need to query the schema if they need to
> know whether a column is masked, unless we change the names of the returned
> columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
> Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
> applying the masking function to columns on the base table, and some of
> them also allow to apply masking to views.
>
> 3) Masking functions as part of projected views. This ways users might
> need to query the view appropriate for their permissions instead of the
> base table. This might mean changing the queries if the masking policy is
> changed by the admin. MySQL recommends this approach on a blog entry,
> although it's not part of its main documentation for data masking, and the
> implementation has security issues. Some of the other databases offering
> the approach 2) as their main option also support masking on view columns.
>
> Each approach has its own advantages and limitations, and I don't think we
> necessarily have to choose. The CEP proposes implementing 1) and 2), but no
> one impedes us to also have 3) if we get to have projected views. However,
> I think that projected views is a new general-purpose feature with its own
> complexities, so it would deserve its own CEP, if someone is willing to
> work on the implementation.
>
>
>
> On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev <
> dev@cassandra.apache.org> wrote:
>
>> Is there enough support here for VIEWS to be the implementation strategy
>> for displaying masking functions?
>>
>> It seems to me the view would have to store the query and apply a where
>> clause to it, so the same PK would be in play.
>>
>> It has data leaking properties.
>>
>> It has more use cases as it can be used to
>>
>>- construct views that filter out sensitive columns
>>- apply transforms to convert units of measure
>>
>> Are there more thoughts along this line?
>>
>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-31 Thread Andrés de la Peña
>
> Is there enough support here for VIEWS to be the implementation strategy
> for displaying masking functions?


I'm not sure that views should be "the" strategy for masking functions. We
have multiple approaches here:

1) CQL functions only. Users can decide to use the masking functions on
their own will. I think most dbs allow this pattern of usage, which is
quite straightforward. Obviously, it doesn't allow admins to decide enforce
users seeing only masked data. Nevertheless, it's still useful for trusted
database users generating masked data that will be consumed by the end
users of the application.

2) Masking functions attached to specific columns. This way the same
queries will see different data (masked or not) depending on the
permissions of the user running the query. It has the advantage of not
requiring to change the queries that users with different permissions run.
The downside is that users would need to query the schema if they need to
know whether a column is masked, unless we change the names of the returned
columns. This is the approach offered by Azure/SQL Server, PostgreSQL, IBM
Db2, Oracle, MariaDB/MaxScale and SnowFlake. All these databases support
applying the masking function to columns on the base table, and some of
them also allow to apply masking to views.

3) Masking functions as part of projected views. This ways users might need
to query the view appropriate for their permissions instead of the base
table. This might mean changing the queries if the masking policy is
changed by the admin. MySQL recommends this approach on a blog entry,
although it's not part of its main documentation for data masking, and the
implementation has security issues. Some of the other databases offering
the approach 2) as their main option also support masking on view columns.

Each approach has its own advantages and limitations, and I don't think we
necessarily have to choose. The CEP proposes implementing 1) and 2), but no
one impedes us to also have 3) if we get to have projected views. However,
I think that projected views is a new general-purpose feature with its own
complexities, so it would deserve its own CEP, if someone is willing to
work on the implementation.



On Wed, 31 Aug 2022 at 12:03, Claude Warren via dev <
dev@cassandra.apache.org> wrote:

> Is there enough support here for VIEWS to be the implementation strategy
> for displaying masking functions?
>
> It seems to me the view would have to store the query and apply a where
> clause to it, so the same PK would be in play.
>
> It has data leaking properties.
>
> It has more use cases as it can be used to
>
>- construct views that filter out sensitive columns
>- apply transforms to convert units of measure
>
> Are there more thoughts along this line?
>


Re: [DISCUSS] LWT UPDATE semantics with + and - when null

2022-08-31 Thread Andrés de la Peña
I think I'd prefer 2), the SQL behaviour. We could also get the convenience
of 3) by adding CQL functions such as "ifNull(column, default)" or
"zeroIfNull(column)", as it's done by other dbs. So we could do things like
"UPDATE ... SET name = zeroIfNull(name) + 42".

On Wed, 31 Aug 2022 at 04:54, Caleb Rackliffe 
wrote:

> Also +1 on the SQL behavior here. I was uneasy w/ coercing to "" / 0 / 1
> (depending on the type) in our previous discussion, but for some reason
> didn't bring up the SQL analog :-|
>
> On Tue, Aug 30, 2022 at 5:38 PM Benedict  wrote:
>
>> I’m a bit torn here, as consistency with counters is important. But they
>> are a unique eventually consistent data type, and I am inclined to default
>> standard numeric types to behave as SQL does, since they write a new value
>> rather than a “delta”
>>
>> It is far from optimal to have divergent behaviours, but also suboptimal
>> to diverge from relational algebra, and probably special casing counters is
>> the least bad outcome IMO.
>>
>>
>> On 30 Aug 2022, at 22:52, David Capwell  wrote:
>>
>> 
>> 4.1 added the ability for LWT to support "UPDATE ... SET name = name +
>> 42", but we never really fleshed out with the larger community what the
>> semantics should be in the case where the column or row are NULL; I opened
>> up https://issues.apache.org/jira/browse/CASSANDRA-17857 for this issue.
>>
>> As I see it there are 3 possible outcomes:
>> 1) fail the query
>> 2) null + 42 = null (matches SQL)
>> 3) null + 42 == 0 + 42 = 42 (matches counters)
>>
>> In SQL you get NULL (option 2), but CQL counters treat NULL as 0 (option
>> 3) meaning we already do not match SQL (though counters are not a standard
>> SQL type so might not be applicable).  Personally I lean towards option 3
>> as the "zero" for addition and subtraction is 0 (1 for multiplication and
>> division).
>>
>> So looking for feedback so we can update in CASSANDRA-17857 before 4.1
>> release.
>>
>>
>>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-30 Thread Andrés de la Peña
>
> GRANT SELECT ON foo.unmasked_name TO top_secret;


Note that Cassandra doesn't have support for column-level permissions.
There was an initiative to add them in 2016, CASSANDRA-12859
<https://issues.apache.org/jira/browse/CASSANDRA-12859>. However, the
ticket has been inactive since 2017. The last comments seem some
discussions about design.

Also, generated columns in PostgreSQL are always stored, so if they were
used for masking they would constitute static data masking, not dynamic.

The approach for dynamic data masking that PostgreSQL suggests on its
documentation
<https://postgresql-anonymizer.readthedocs.io/en/latest/dynamic_masking/>
doesn't
seem based on generating a masked copy of the column, neither on a
generated column or on a view. Instead, it uses security labels to
associate columns to users and masking functions. That way, the same column
will be seen masked or unmasked depending on the user.

I'd say that applying the masking rule to the base column itself, and not
to a copy, is the most common approach among the discussed databases so
far. Also, it has the advantage for us of not being based on other
relatively complex features that we miss, such as column-level permissions
or not-materialized views. If someday we add those features I think they
would play well with what is proposed on the CEP.

On Tue, 30 Aug 2022 at 11:46, Avi Kivity via dev 
wrote:

> Agree with views, or alternatively, column permissions together with
> computed columns:
>
>
> CREATE TABLE foo (
>
>   id int PRIMARY KEY,
>
>   unmasked_name text,
>
>   name text GENERATED ALWAYS AS some_mask_function(text, 'xxx', 7)
>
> )
>
>
> (syntax from postgresql)
>
>
> GRANT SELECT ON foo.name TO general_use;
>
> GRANT SELECT ON foo.unmasked_name TO top_secret;
>
>
> On 26/08/2022 00.10, Benedict wrote:
>
> I’m inclined to agree that this seems a more straightforward approach that
> makes fewer implied promises.
>
> Perhaps we could deliver simple views backed by virtual tables, and model
> our approach on that of Postgres, MySQL et al?
>
> Views in C* would be very simple, just offering a subset of fields with
> some UDFs applied. It would allow users to define roles with access only to
> the views, or for applications to use the views for presentation purposes.
>
> It feels like a cleaner approach to me, and we’d get two features for the
> price of one. BUT I don’t feel super strongly about this.
>
> On 25 Aug 2022, at 20:16, Derek Chen-Becker 
>  wrote:
>
> 
> To make sure I understand, if I wanted to use a masked column for a
> conditional update, you're saying we would need SELECT_MASKED to use it in
> the IF clause? I worry that this proposal is increasing in complexity; I
> would actually be OK starting with something smaller in scope. Perhaps just
> providing the masking functions and not tying masking to schema would be
> sufficient for an initial goal? That wouldn't preclude additional
> permissions, schema integration, or perhaps just plain Views in the future.
>
> Cheers,
>
> Derek
>
> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña 
> wrote:
>
>> I have modified the proposal adding a new SELECT_MASKED permission. Using
>> masked columns on WHERE/IF clauses would require having SELECT and either
>> UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the
>> query results would always require both SELECT and UNMASK.
>>
>> This way we can have the best of both worlds, allowing admins to decide
>> whether they trust their immediate users or not. wdyt?
>>
>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo 
>> wrote:
>>
>>> This is the difference between security and compliance I guess :-D
>>>
>>> The way I see this, the attacker or threat in this concept is not the
>>> developer with access to the database. Rather a feature like this is just a
>>> convenient way to apply some masking rule in a centralized way. The
>>> protection is against an end user of the application, who should not be
>>> able to see the personal data of someone else. Or themselves, even. As long
>>> as the application end user doesn't have access to run arbitrary CQL, then
>>> these frorms of masking prevent accidental unauthorized use/leaking of
>>> personal data.
>>>
>>> henrik
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:
>>>
>>>> Is it typical for a masking feature to make no effort to prevent
>>>> unmasking? I’m just struggling to see the value of this without such
>>>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>>>>

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-26 Thread Andrés de la Peña
>
> Yes, I was thinking that simple projection views (essentially a SELECT
> statement with application of transform functions) would complement masking
> functions, and from the discussion it sounds like this is basically what
> some of the other databases do.


I don't see that the mentioned databases in general suggest using views for
dynamic data masking. So far, I have only seen this this blog post entry
<https://dev.mysql.com/blog-archive/data-masking-in-mysql/> suggesting to
use MySQL's not-materialized views with masking functions, probably because
MySQL lacks the more sophisticated mechanisms for data masking that other
databases offer.

However, using MySQL views can allow malicious users to run queries to
infer the masked data, which is what we were trying to avoid. For example:

CREATE TABLE employees(
 id INT NOT NULL AUTO_INCREMENT,
 name VARCHAR(100) NOT NULL,
 PRIMARY KEY (id));

CREATE VIEW employee_mask AS SELECT
  id,
  mask_inner(name, 1, 0, _binary'*') AS name
  FROM employees;

INSERT INTO employees(name) SELECT "Joseph";
INSERT INTO employees(name) SELECT "Olivia";

SELECT * FROM employee_mask WHERE name="Joseph";
+++
| id | name   |
+++
|  1 | J* |
+++

On Fri, 26 Aug 2022 at 02:45, Derek Chen-Becker 
wrote:

> Yes, I was thinking that simple projection views (essentially a SELECT
> statement with application of transform functions) would complement masking
> functions, and from the discussion it sounds like this is basically what
> some of the other databases do. Projection views seem like they would be
> useful in their own right, so would it be proper to write a separate CEP
> for that? I would be happy to help drive that document and discussion. I'm
> not sure if it's the best name, but I'm trying to distinguish views that
> expose a subset of an existing schema vs materialized views, which offer
> more complex capabilities.
>
> Cheers,
>
> Derek
>
> On Thu, Aug 25, 2022, 3:11 PM Benedict  wrote:
>
>> I’m inclined to agree that this seems a more straightforward approach
>> that makes fewer implied promises.
>>
>> Perhaps we could deliver simple views backed by virtual tables, and model
>> our approach on that of Postgres, MySQL et al?
>>
>> Views in C* would be very simple, just offering a subset of fields with
>> some UDFs applied. It would allow users to define roles with access only to
>> the views, or for applications to use the views for presentation purposes.
>>
>> It feels like a cleaner approach to me, and we’d get two features for the
>> price of one. BUT I don’t feel super strongly about this.
>>
>> On 25 Aug 2022, at 20:16, Derek Chen-Becker 
>> wrote:
>>
>> 
>> To make sure I understand, if I wanted to use a masked column for a
>> conditional update, you're saying we would need SELECT_MASKED to use it in
>> the IF clause? I worry that this proposal is increasing in complexity; I
>> would actually be OK starting with something smaller in scope. Perhaps just
>> providing the masking functions and not tying masking to schema would be
>> sufficient for an initial goal? That wouldn't preclude additional
>> permissions, schema integration, or perhaps just plain Views in the future.
>>
>> Cheers,
>>
>> Derek
>>
>> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña 
>> wrote:
>>
>>> I have modified the proposal adding a new SELECT_MASKED permission.
>>> Using masked columns on WHERE/IF clauses would require having SELECT and
>>> either UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in
>>> the query results would always require both SELECT and UNMASK.
>>>
>>> This way we can have the best of both worlds, allowing admins to decide
>>> whether they trust their immediate users or not. wdyt?
>>>
>>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo 
>>> wrote:
>>>
>>>> This is the difference between security and compliance I guess :-D
>>>>
>>>> The way I see this, the attacker or threat in this concept is not the
>>>> developer with access to the database. Rather a feature like this is just a
>>>> convenient way to apply some masking rule in a centralized way. The
>>>> protection is against an end user of the application, who should not be
>>>> able to see the personal data of someone else. Or themselves, even. As long
>>>> as the application end user doesn't have access to run arbitrary CQL, then
>>>> these frorms of masking prevent accidental unauthorized use/leaking of
>>>> personal data.
>>>>
>>>> hen

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-25 Thread Andrés de la Peña
>
> Perhaps we could deliver simple views backed by virtual tables, and model
> our approach on that of Postgres, MySQL et al?


The approach of PostgresSQL
<https://postgresql-anonymizer.readthedocs.io/en/latest/dynamic_masking/>
allows attaching masking functions to columns and users with commands such
as:

SECURITY LABEL FOR anon ON COLUMN people.phone
IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$**$$,2)';

MySQL however does only provide the masking functions without the ability
to attaching them to neither columns or users, as far as I know.

The most similar to the proposed one is the approach of Azure/SQL Server,
which is almost identical except for the CEP trying to address the recent
concerns about querying masked columns.



On Thu, 25 Aug 2022 at 22:10, Benedict  wrote:

> I’m inclined to agree that this seems a more straightforward approach that
> makes fewer implied promises.
>
> Perhaps we could deliver simple views backed by virtual tables, and model
> our approach on that of Postgres, MySQL et al?
>
> Views in C* would be very simple, just offering a subset of fields with
> some UDFs applied. It would allow users to define roles with access only to
> the views, or for applications to use the views for presentation purposes.
>
> It feels like a cleaner approach to me, and we’d get two features for the
> price of one. BUT I don’t feel super strongly about this.
>
> On 25 Aug 2022, at 20:16, Derek Chen-Becker  wrote:
>
> 
> To make sure I understand, if I wanted to use a masked column for a
> conditional update, you're saying we would need SELECT_MASKED to use it in
> the IF clause? I worry that this proposal is increasing in complexity; I
> would actually be OK starting with something smaller in scope. Perhaps just
> providing the masking functions and not tying masking to schema would be
> sufficient for an initial goal? That wouldn't preclude additional
> permissions, schema integration, or perhaps just plain Views in the future.
>
> Cheers,
>
> Derek
>
> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña 
> wrote:
>
>> I have modified the proposal adding a new SELECT_MASKED permission. Using
>> masked columns on WHERE/IF clauses would require having SELECT and either
>> UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the
>> query results would always require both SELECT and UNMASK.
>>
>> This way we can have the best of both worlds, allowing admins to decide
>> whether they trust their immediate users or not. wdyt?
>>
>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo 
>> wrote:
>>
>>> This is the difference between security and compliance I guess :-D
>>>
>>> The way I see this, the attacker or threat in this concept is not the
>>> developer with access to the database. Rather a feature like this is just a
>>> convenient way to apply some masking rule in a centralized way. The
>>> protection is against an end user of the application, who should not be
>>> able to see the personal data of someone else. Or themselves, even. As long
>>> as the application end user doesn't have access to run arbitrary CQL, then
>>> these frorms of masking prevent accidental unauthorized use/leaking of
>>> personal data.
>>>
>>> henrik
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:
>>>
>>>> Is it typical for a masking feature to make no effort to prevent
>>>> unmasking? I’m just struggling to see the value of this without such
>>>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>>>> renaming the feature IMO
>>>>
>>>> On 23 Aug 2022, at 21:27, Andrés de la Peña 
>>>> wrote:
>>>>
>>>> 
>>>> As mentioned in the CEP document, dynamic data masking doesn't try to
>>>> prevent malicious users with SELECT permissions to indirectly guess the
>>>> real value of the masked value. This can easily be done by just trying
>>>> values on the WHERE clause of SELECT queries. DDM would not be a
>>>> replacement for proper column-level permissions.
>>>>
>>>> The data served by the database is usually consumed by applications
>>>> that present this data to end users. These end users are not necessarily
>>>> the users directly connecting to the database. With DDM, it would be easy
>>>> for applications to mask sensitive data that is going to be consumed by the
>>>> end users. However, the users directly connecting to the database should be
>>>> trusted, provided that they have the right SELE

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-25 Thread Andrés de la Peña
Note that conditional updates return true or false to notify whether the
update has happened or not. That can also be exploited to infer the masked
data. Indeed, at the moment they also require SELECT permissions.

The masking functions can always be used on their own, as any other CQL
function and without necessarily associating them to the schema.

You would only need either UNMASK or SELECT_MASKED permissions for a
conditional update if the masking function is attached to the column
declaration in the schema of the table.

There is a timeline section
<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking#CEP20:DynamicDataMasking-Timeline>
of the CEP listing the planned development steps. The first step is adding
the functions on their own. The next steps are for allowing to attach those
functions to the columns with the mentioned permissions.

On Thu, 25 Aug 2022 at 20:16, Derek Chen-Becker 
wrote:

> To make sure I understand, if I wanted to use a masked column for a
> conditional update, you're saying we would need SELECT_MASKED to use it in
> the IF clause? I worry that this proposal is increasing in complexity; I
> would actually be OK starting with something smaller in scope. Perhaps just
> providing the masking functions and not tying masking to schema would be
> sufficient for an initial goal? That wouldn't preclude additional
> permissions, schema integration, or perhaps just plain Views in the future.
>
> Cheers,
>
> Derek
>
> On Thu, Aug 25, 2022 at 11:12 AM Andrés de la Peña 
> wrote:
>
>> I have modified the proposal adding a new SELECT_MASKED permission. Using
>> masked columns on WHERE/IF clauses would require having SELECT and either
>> UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the
>> query results would always require both SELECT and UNMASK.
>>
>> This way we can have the best of both worlds, allowing admins to decide
>> whether they trust their immediate users or not. wdyt?
>>
>> On Wed, 24 Aug 2022 at 16:06, Henrik Ingo 
>> wrote:
>>
>>> This is the difference between security and compliance I guess :-D
>>>
>>> The way I see this, the attacker or threat in this concept is not the
>>> developer with access to the database. Rather a feature like this is just a
>>> convenient way to apply some masking rule in a centralized way. The
>>> protection is against an end user of the application, who should not be
>>> able to see the personal data of someone else. Or themselves, even. As long
>>> as the application end user doesn't have access to run arbitrary CQL, then
>>> these frorms of masking prevent accidental unauthorized use/leaking of
>>> personal data.
>>>
>>> henrik
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:
>>>
>>>> Is it typical for a masking feature to make no effort to prevent
>>>> unmasking? I’m just struggling to see the value of this without such
>>>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>>>> renaming the feature IMO
>>>>
>>>> On 23 Aug 2022, at 21:27, Andrés de la Peña 
>>>> wrote:
>>>>
>>>> 
>>>> As mentioned in the CEP document, dynamic data masking doesn't try to
>>>> prevent malicious users with SELECT permissions to indirectly guess the
>>>> real value of the masked value. This can easily be done by just trying
>>>> values on the WHERE clause of SELECT queries. DDM would not be a
>>>> replacement for proper column-level permissions.
>>>>
>>>> The data served by the database is usually consumed by applications
>>>> that present this data to end users. These end users are not necessarily
>>>> the users directly connecting to the database. With DDM, it would be easy
>>>> for applications to mask sensitive data that is going to be consumed by the
>>>> end users. However, the users directly connecting to the database should be
>>>> trusted, provided that they have the right SELECT permissions.
>>>>
>>>> In other words, DDM doesn't directly protect the data, but it eases the
>>>> production of protected data.
>>>>
>>>> Said that, we could later go one step ahead and add a way to prevent
>>>> untrusted users from inferring the masked data. That could be done adding a
>>>> new permission required to use certain columns on WHERE clauses, different
>>>> to the current SELECT permission. That would play especially well with
>>>> column-level pe

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-25 Thread Andrés de la Peña
I have modified the proposal adding a new SELECT_MASKED permission. Using
masked columns on WHERE/IF clauses would require having SELECT and either
UNMASK or SELECT_MASKED permissions. Seeing the unmasked values in the
query results would always require both SELECT and UNMASK.

This way we can have the best of both worlds, allowing admins to decide
whether they trust their immediate users or not. wdyt?

On Wed, 24 Aug 2022 at 16:06, Henrik Ingo  wrote:

> This is the difference between security and compliance I guess :-D
>
> The way I see this, the attacker or threat in this concept is not the
> developer with access to the database. Rather a feature like this is just a
> convenient way to apply some masking rule in a centralized way. The
> protection is against an end user of the application, who should not be
> able to see the personal data of someone else. Or themselves, even. As long
> as the application end user doesn't have access to run arbitrary CQL, then
> these frorms of masking prevent accidental unauthorized use/leaking of
> personal data.
>
> henrik
>
>
>
> On Wed, Aug 24, 2022 at 10:40 AM Benedict  wrote:
>
>> Is it typical for a masking feature to make no effort to prevent
>> unmasking? I’m just struggling to see the value of this without such
>> mechanisms. Otherwise it’s just a default formatter, and we should consider
>> renaming the feature IMO
>>
>> On 23 Aug 2022, at 21:27, Andrés de la Peña  wrote:
>>
>> 
>> As mentioned in the CEP document, dynamic data masking doesn't try to
>> prevent malicious users with SELECT permissions to indirectly guess the
>> real value of the masked value. This can easily be done by just trying
>> values on the WHERE clause of SELECT queries. DDM would not be a
>> replacement for proper column-level permissions.
>>
>> The data served by the database is usually consumed by applications that
>> present this data to end users. These end users are not necessarily the
>> users directly connecting to the database. With DDM, it would be easy for
>> applications to mask sensitive data that is going to be consumed by the end
>> users. However, the users directly connecting to the database should be
>> trusted, provided that they have the right SELECT permissions.
>>
>> In other words, DDM doesn't directly protect the data, but it eases the
>> production of protected data.
>>
>> Said that, we could later go one step ahead and add a way to prevent
>> untrusted users from inferring the masked data. That could be done adding a
>> new permission required to use certain columns on WHERE clauses, different
>> to the current SELECT permission. That would play especially well with
>> column-level permissions, which is something that we still have pending.
>>
>> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:
>>
>>> Applying this should prevent querying on a field, else you could leak
>>>> its contents, surely?
>>>>
>>>
>>> In theory, yes.  Although I could see folks doing something like this:
>>>
>>> SELECT COUNT(*) FROM patients
>>> WHERE year_of_birth = 2002
>>> AND date_of_birth >= '2002-04-01'
>>> AND date_of_birth < '2002-11-01';
>>>
>>> In this case, the rows containing the masked key column(s) could be
>>> filtered on without revealing the actual data.  But again, that's probably
>>> better for a "phase 2" of the implementation.
>>>
>>> Agreed on not being a queryable field. That would also preclude
>>>> secondary indexing, right?
>>>
>>>
>>> Yes, that's my thought as well.
>>>
>>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker <
>>> de...@chen-becker.org> wrote:
>>>
>>>> Agreed on not being a queryable field. That would also preclude
>>>> secondary indexing, right?
>>>>
>>>> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>>>>
>>>>> Applying this should prevent querying on a field, else you could leak
>>>>> its contents, surely? This pretty much prohibits using it in a clustering
>>>>> key, and a partition key with the ordered partitioner - but probably also 
>>>>> a
>>>>> hashed partitioner since we do not use a cryptographic hash and the hash
>>>>> function is well defined.
>>>>>
>>>>> We probably also need to ensure that any ALLOW FILTERING queries on
>>>>> such a field are disabled.
>>>>>
>>>>> Plausibly the data could be cryptographical

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Andrés de la Peña
Where does MySQL suggest that? As far I can tell MySQL only offers a set of
functions for masking. I can't see a way to force users or tables to use
those functions, and is up to the users to use those functions or not. I'm
reading this documentation
<https://dev.mysql.com/doc/refman/8.0/en/data-masking.html>.

As for broadening the scope the proposal to prevent malicious users from
inferring the masked data, I guess that the additional rule would simply be
that a user with READ but not UNMASK permissions cannot use masked columns
on WHERE or IF clauses. That would include both SELECT and UPDATE
statements. That would differentiate us from many popular databases out
there, where data masking usually is a simpler thing.

On Wed, 24 Aug 2022 at 14:08, Benedict  wrote:

> I can’t tell for sure, but the documentation on Postgres’ feature suggests
> to me that it does apply the masking to all possible uses of the data,
> including joining and querying.
>
> Snowflake’s documentation explicitly says that it does.
>
> MySQL’s documentation suggests that it does this.
>
> Oracle, AWS and MS SQL do not.
>
> My inclination would be to - at least by default - forbid querying on
> columns that are masked, unless the mask permits it.
>
>
> On 24 Aug 2022, at 11:06, Andrés de la Peña  wrote:
>
> 
> Here are the names of the feature on same databases out there, errors and
> omission excepted:
>
>- Microsoft SQL Server / Azure SQL: Dynamic data masking
>- MySQL: Enterprise data masking and de-identification
>- PostgreSQL: Dynamic masking
>- MongoDB: Data masking
>- IBM Db2: Masks
>- Oracle: Redaction
>- MariaDB/MaxScale: Data masking
>- Snowflake: Dynamic data masking
>
>
> On Wed, 24 Aug 2022 at 10:40, Benedict  wrote:
>
>> Right, but we get to decide how we offer such features and what we call
>> them. I can’t imagine a good reason to call this a masking feature,
>> especially one that applies differentially to certain users, when it is
>> trivial to unmask.
>>
>> I’m ok offering a feature called “default formatter” or something that
>> applies some UDF to a field before returning to the client, and if users
>> wish to “mask” their data in this way that’s fine. But calling it a data
>> mask when it is trivial to circumvent is IMO dangerous, and I’d at least
>> want to see evidence that all other equivalent features in the industry are
>> similarly poorly named and offer similarly poor protection.
>>
>> On 24 Aug 2022, at 09:50, Benjamin Lerer  wrote:
>>
>> 
>>
>>> The PCI DSS Standard v4_0
>>> <https://docs-prv.pcisecuritystandards.org/PCI%20DSS/Standard/PCI-DSS-v4_0.pdf>
>>>  requires
>>> that credit card numbers stored on the system must be "rendered
>>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>>> numbers.
>>
>>
>> My point was simply about the fact that Dynamic Data Masking like any
>> other feature made sense for some scenario but not for others. I apologise
>> if my example was a bad one.
>>
>> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev <
>> dev@cassandra.apache.org> a écrit :
>>
>>> This change appears to be looking at two aspects:
>>>
>>>1. Add metadata to columns
>>>2. Add functionality based on the metadata.
>>>
>>> If the system had a generic user defined metadata and the ability to
>>> define filter functions at the point where data are being returned to the
>>> client it would be possible for users implement this filter, or any other
>>> filter on the data.
>>>
>>> The concept of user defined metadata and filters could be applied to
>>> other parts of the system as well.  For example, if the metadata were
>>> accessible from UDFs the metadata could be used in low level filters to
>>> remove rows from queries before they were returned.
>>>
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr <
>>> claude.war...@aiven.io> wrote:
>>>
>>>> The PCI DSS Standard v4_0
>>>> <https://docs-prv.pcisecuritystandards.org/PCI%20DSS/Standard/PCI-DSS-v4_0.pdf>
>>>>  requires
>>>> that credit card numbers stored on the system must be "rendered
>>>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>>>> numbers.  In fact, for any critically sensitive data this is not an
>>>> appropriate solution.  However, there seems to be agreement that it is
>>>> appropriate fo

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Andrés de la Peña
Here are the names of the feature on same databases out there, errors and
omission excepted:

   - Microsoft SQL Server / Azure SQL: Dynamic data masking
   - MySQL: Enterprise data masking and de-identification
   - PostgreSQL: Dynamic masking
   - MongoDB: Data masking
   - IBM Db2: Masks
   - Oracle: Redaction
   - MariaDB/MaxScale: Data masking
   - Snowflake: Dynamic data masking


On Wed, 24 Aug 2022 at 10:40, Benedict  wrote:

> Right, but we get to decide how we offer such features and what we call
> them. I can’t imagine a good reason to call this a masking feature,
> especially one that applies differentially to certain users, when it is
> trivial to unmask.
>
> I’m ok offering a feature called “default formatter” or something that
> applies some UDF to a field before returning to the client, and if users
> wish to “mask” their data in this way that’s fine. But calling it a data
> mask when it is trivial to circumvent is IMO dangerous, and I’d at least
> want to see evidence that all other equivalent features in the industry are
> similarly poorly named and offer similarly poor protection.
>
> On 24 Aug 2022, at 09:50, Benjamin Lerer  wrote:
>
> 
>
>> The PCI DSS Standard v4_0
>> <https://docs-prv.pcisecuritystandards.org/PCI%20DSS/Standard/PCI-DSS-v4_0.pdf>
>>  requires
>> that credit card numbers stored on the system must be "rendered
>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>> numbers.
>
>
> My point was simply about the fact that Dynamic Data Masking like any
> other feature made sense for some scenario but not for others. I apologise
> if my example was a bad one.
>
> Le mer. 24 août 2022 à 10:36, Claude Warren, Jr via dev <
> dev@cassandra.apache.org> a écrit :
>
>> This change appears to be looking at two aspects:
>>
>>1. Add metadata to columns
>>2. Add functionality based on the metadata.
>>
>> If the system had a generic user defined metadata and the ability to
>> define filter functions at the point where data are being returned to the
>> client it would be possible for users implement this filter, or any other
>> filter on the data.
>>
>> The concept of user defined metadata and filters could be applied to
>> other parts of the system as well.  For example, if the metadata were
>> accessible from UDFs the metadata could be used in low level filters to
>> remove rows from queries before they were returned.
>>
>>
>>
>>
>> On Wed, Aug 24, 2022 at 9:29 AM Claude Warren, Jr 
>> wrote:
>>
>>> The PCI DSS Standard v4_0
>>> <https://docs-prv.pcisecuritystandards.org/PCI%20DSS/Standard/PCI-DSS-v4_0.pdf>
>>>  requires
>>> that credit card numbers stored on the system must be "rendered
>>> unreadable", thus this proposal is _NOT_ a good way to protect credit card
>>> numbers.  In fact, for any critically sensitive data this is not an
>>> appropriate solution.  However, there seems to be agreement that it is
>>> appropriate for obfuscating some data in some queries by some users.
>>>
>>>
>>>
>>> On Wed, Aug 24, 2022 at 9:02 AM Benjamin Lerer 
>>> wrote:
>>>
>>>> Is it typical for a masking feature to make no effort to prevent
>>>>> unmasking? I’m just struggling to see the value of this without such
>>>>> mechanisms. Otherwise it’s just a default formatter, and we should 
>>>>> consider
>>>>> renaming the feature IMO
>>>>
>>>>
>>>> The security that Dynamic Data Masking is bringing is related to how
>>>> you make use of the feature. It is somehow the same with passwords. If you
>>>> use a weak password it does not bring much security.
>>>> Masking a field like people's gender is useless because you will be
>>>> able to determine its value in one query. On the other hand masking credit
>>>> card numbers makes a lot of sense as it will complicate the life of the
>>>> person trying to have access to it and the queries needed to reach the
>>>> information will leave some clear traces in the audit log.
>>>>
>>>> Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good
>>>> way to protect sensitive data like credit card numbers or passwords.
>>>>
>>>>
>>>> Le mer. 24 août 2022 à 09:40, Benedict  a écrit :
>>>>
>>>>> Is it typical for a masking feature to make no effort to prevent
>>>>> unmasking? I’m just struggling to see the value of this witho

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-24 Thread Andrés de la Peña
truggling to see the value of this without such
>>>>> mechanisms. Otherwise it’s just a default formatter, and we should 
>>>>> consider
>>>>> renaming the feature IMO
>>>>
>>>>
>>>> The security that Dynamic Data Masking is bringing is related to how
>>>> you make use of the feature. It is somehow the same with passwords. If you
>>>> use a weak password it does not bring much security.
>>>> Masking a field like people's gender is useless because you will be
>>>> able to determine its value in one query. On the other hand masking credit
>>>> card numbers makes a lot of sense as it will complicate the life of the
>>>> person trying to have access to it and the queries needed to reach the
>>>> information will leave some clear traces in the audit log.
>>>>
>>>> Dynamic Data Masking is not a magic bullet. Nevertheless, it is a good
>>>> way to protect sensitive data like credit card numbers or passwords.
>>>>
>>>>
>>>> Le mer. 24 août 2022 à 09:40, Benedict  a écrit :
>>>>
>>>>> Is it typical for a masking feature to make no effort to prevent
>>>>> unmasking? I’m just struggling to see the value of this without such
>>>>> mechanisms. Otherwise it’s just a default formatter, and we should 
>>>>> consider
>>>>> renaming the feature IMO
>>>>>
>>>>> On 23 Aug 2022, at 21:27, Andrés de la Peña 
>>>>> wrote:
>>>>>
>>>>> 
>>>>> As mentioned in the CEP document, dynamic data masking doesn't try to
>>>>> prevent malicious users with SELECT permissions to indirectly guess the
>>>>> real value of the masked value. This can easily be done by just trying
>>>>> values on the WHERE clause of SELECT queries. DDM would not be a
>>>>> replacement for proper column-level permissions.
>>>>>
>>>>> The data served by the database is usually consumed by applications
>>>>> that present this data to end users. These end users are not necessarily
>>>>> the users directly connecting to the database. With DDM, it would be easy
>>>>> for applications to mask sensitive data that is going to be consumed by 
>>>>> the
>>>>> end users. However, the users directly connecting to the database should 
>>>>> be
>>>>> trusted, provided that they have the right SELECT permissions.
>>>>>
>>>>> In other words, DDM doesn't directly protect the data, but it eases
>>>>> the production of protected data.
>>>>>
>>>>> Said that, we could later go one step ahead and add a way to prevent
>>>>> untrusted users from inferring the masked data. That could be done adding 
>>>>> a
>>>>> new permission required to use certain columns on WHERE clauses, different
>>>>> to the current SELECT permission. That would play especially well with
>>>>> column-level permissions, which is something that we still have pending.
>>>>>
>>>>> On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz 
>>>>> wrote:
>>>>>
>>>>>> Applying this should prevent querying on a field, else you could leak
>>>>>>> its contents, surely?
>>>>>>>
>>>>>>
>>>>>> In theory, yes.  Although I could see folks doing something like this:
>>>>>>
>>>>>> SELECT COUNT(*) FROM patients
>>>>>> WHERE year_of_birth = 2002
>>>>>> AND date_of_birth >= '2002-04-01'
>>>>>> AND date_of_birth < '2002-11-01';
>>>>>>
>>>>>> In this case, the rows containing the masked key column(s) could be
>>>>>> filtered on without revealing the actual data.  But again, that's 
>>>>>> probably
>>>>>> better for a "phase 2" of the implementation.
>>>>>>
>>>>>> Agreed on not being a queryable field. That would also preclude
>>>>>>> secondary indexing, right?
>>>>>>
>>>>>>
>>>>>> Yes, that's my thought as well.
>>>>>>
>>>>>> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker <
>>>>>> de...@chen-becker.org> wrote:
>>>>>>
>>>>>>> Agreed on not being a querya

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-23 Thread Andrés de la Peña
As mentioned in the CEP document, dynamic data masking doesn't try to
prevent malicious users with SELECT permissions to indirectly guess the
real value of the masked value. This can easily be done by just trying
values on the WHERE clause of SELECT queries. DDM would not be a
replacement for proper column-level permissions.

The data served by the database is usually consumed by applications that
present this data to end users. These end users are not necessarily the
users directly connecting to the database. With DDM, it would be easy for
applications to mask sensitive data that is going to be consumed by the end
users. However, the users directly connecting to the database should be
trusted, provided that they have the right SELECT permissions.

In other words, DDM doesn't directly protect the data, but it eases the
production of protected data.

Said that, we could later go one step ahead and add a way to prevent
untrusted users from inferring the masked data. That could be done adding a
new permission required to use certain columns on WHERE clauses, different
to the current SELECT permission. That would play especially well with
column-level permissions, which is something that we still have pending.

On Tue, 23 Aug 2022 at 19:13, Aaron Ploetz  wrote:

> Applying this should prevent querying on a field, else you could leak its
>> contents, surely?
>>
>
> In theory, yes.  Although I could see folks doing something like this:
>
> SELECT COUNT(*) FROM patients
> WHERE year_of_birth = 2002
> AND date_of_birth >= '2002-04-01'
> AND date_of_birth < '2002-11-01';
>
> In this case, the rows containing the masked key column(s) could be
> filtered on without revealing the actual data.  But again, that's probably
> better for a "phase 2" of the implementation.
>
> Agreed on not being a queryable field. That would also preclude secondary
>> indexing, right?
>
>
> Yes, that's my thought as well.
>
> On Tue, Aug 23, 2022 at 12:42 PM Derek Chen-Becker 
> wrote:
>
>> Agreed on not being a queryable field. That would also preclude secondary
>> indexing, right?
>>
>> On Tue, Aug 23, 2022 at 11:20 AM Benedict  wrote:
>>
>>> Applying this should prevent querying on a field, else you could leak
>>> its contents, surely? This pretty much prohibits using it in a clustering
>>> key, and a partition key with the ordered partitioner - but probably also a
>>> hashed partitioner since we do not use a cryptographic hash and the hash
>>> function is well defined.
>>>
>>> We probably also need to ensure that any ALLOW FILTERING queries on such
>>> a field are disabled.
>>>
>>> Plausibly the data could be cryptographically jumbled before using it in
>>> a primary key component (or permitting filtering), but it is probably
>>> easier and safer to exclude for now…
>>>
>>> On 23 Aug 2022, at 18:13, Aaron Ploetz  wrote:
>>>
>>> 
>>> Some thoughts on this one:
>>>
>>> In a prior job, we'd give app teams access to a single keyspace, and two
>>> roles: a read-write role and a read-only role.  In some cases, a
>>> "privileged" application role was also requested.  Depending on the
>>> requirements, I could see the UNMASK permission being applied to the RW or
>>> privileged roles.  But if there's a problem on the table and the operators
>>> go in to investigate, they will likely use a SUPERUSER account, and they'll
>>> see that data.
>>>
>>> How hard would it be for SUPERUSERs to *not* automatically get the
>>> UNMASK permission?
>>>
>>> I'll also echo the concerns around masking primary key components.  It's
>>> highly likely that certain personal data properties would be used as a
>>> partition or clustering key (ex: range query for people born within a
>>> certain timeframe).  In addition to the "breaks existing" concern, I'm
>>> curious about the challenges around getting that to work with the current
>>> primary key implementation.
>>>
>>> Does this first implementation only apply to payload (non-key) columns?
>>> The examples in the CEP currently do not show primary key components being
>>> masked.
>>>
>>> Thanks,
>>>
>>> Aaron
>>>
>>>
>>> On Tue, Aug 23, 2022 at 6:44 AM Henrik Ingo 
>>> wrote:
>>>
>>>> On Tue, Aug 23, 2022 at 1:10 PM Andrés de la Peña 
>>>> wrote:
>>>>
>>>>> One thought: The way the CEP is currently written, it is only possible
>>>>>> to mask a column one way. You can only d

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-23 Thread Andrés de la Peña
>
> One thought: The way the CEP is currently written, it is only possible to
> mask a column one way. You can only define one masking function for a
> column, and since you use the original column name, you could only return
> one version of it in the result set, even if you had a way to define
> several functions.
>

Right, it's one single type of mapping per the column, declared on
CREATE/ALTER TABLE statements. Also, users can manually specify their own
masking function in SELECT statements if they have permissions for seeing
the clear data.

For those cases where the data is automatically masked for an unprivileged
user, I don't see the use of including different types of masking for the
same column into the same result set. Instead, we might be interested on
having different types of masking associated to different roles. We could
do so with dedicated CREATE/DROP/LIST MASK statements, instead of using the
CREATE/ALTER/DESCRIBE TABLE statements. That CREATE MASK statement would
associate a masking function to a column and role. However, I'm not sure we
need that type of granularity instead of the simplicity of attaching the
masking to the column declaration. wdyt?


On Mon, 22 Aug 2022 at 19:31, Henrik Ingo  wrote:

> One thought: The way the CEP is currently written, it is only possible to
> mask a column one way. You can only define one masking function for a
> column, and since you use the original column name, you could only return
> one version of it in the result set, even if you had a way to define
> several functions.
>
> I'm not proposing this should change, just calling it out.
>
> henrik
>
> On Fri, Aug 19, 2022 at 2:50 PM Andrés de la Peña 
> wrote:
>
>> Hi everyone,
>>
>> I'd like to start a discussion about this proposal for dynamic data
>> masking:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>>
>> Dynamic data masking allows to obscure sensitive information without
>> changing the stored data. It would be based on a set of native CQL
>> functions providing different types of masking, such as replacing the
>> column value by "". These functions could be used as regular functions
>> or attached to table columns with CREATE/ALTER table. There would be a new
>> UNMASK permission, so only the users with this permissions would be able to
>> see the unmasked column values. It would be possible to customize masking
>> by using UDFs as masking functions.
>>
>> Thanks,
>>
>
>
> --
>
> Henrik Ingo
>
> +358 40 569 7354 <358405697354>
>
> [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us
> on Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on
> YouTube.]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg=DwMFaQ=adz96Xi0w1RHqtPMowiL2g=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw=>
>   [image: Visit my LinkedIn profile.]
> <https://www.linkedin.com/in/heingo/>
>


Re: [VOTE] Release Apache Cassandra 4.0.6

2022-08-23 Thread Andrés de la Peña
+1 (nb)

On Tue, 23 Aug 2022 at 06:14, Tommy Stendahl via dev <
dev@cassandra.apache.org> wrote:

> +1 nb
>
> -Original Message-
> *From*: Brandon Williams  >
> *Reply-To*: dev@cassandra.apache.org
> *To*: dev 
> >
> *Subject*: Re: [VOTE] Release Apache Cassandra 4.0.6
> *Date*: Mon, 22 Aug 2022 17:47:59 -0500
>
> +1
>
>
> On Sun, Aug 21, 2022 at 7:44 AM Mick Semb Wever <
>
> m...@apache.org
>
> > wrote:
>
>
>
> Proposing the test build of Cassandra 4.0.6 for release.
>
>
> sha1: eb2375718483f4c360810127ae457f2a26ccce67
>
> Git:
>
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.0.6-tentative
>
>
> Maven Artifacts:
>
> https://repository.apache.org/content/repositories/orgapachecassandra-/org/apache/cassandra/cassandra-all/4.0.6/
>
>
>
> The Source and Build Artifacts, and the Debian and RPM packages and 
> repositories, are available here:
>
> https://dist.apache.org/repos/dist/dev/cassandra/4.0.6/
>
>
>
> The vote will be open for 72 hours (longer if needed). Everyone who has 
> tested the build is invited to vote. Votes by PMC members are considered 
> binding. A vote passes if there are at least three binding +1s and no -1's.
>
>
> [1]: CHANGES.txt:
>
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/4.0.6-tentative
>
>
> [2]: NEWS.txt:
>
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/4.0.6-tentative
>
>
>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-22 Thread Andrés de la Peña
>
> Isn't there an assumption here that encryption can not be used?  Would we
> not be better served to build in an encryption strategy that keeps the data
> encrypted until the user shows permissions to decrypt, like the unmask
> property?  An encryption strategy that can work within the Cassandra
> internals?
> I think that issue is that there are some data fields that should not be
> discoverable by unauthorized users/systems, and I think this solution masks
> that issue.  I fear that this capability will be seized upon by pointy
> haired managers as a cheaper alternative to encryption, regardless of the
> warnings otherwise, and that as a whole will harm the Cassandra ecosystem.


Data encryption, access permissions and data masking are different
solutions to different problems. We don't have to choose between them, and
indeed we should aim to support the three of them at some point. None of
these features impedes the implementation of the others. Actually, is quite
common for popular databases to provide all of them.

Data encryption should protect the data files from anyone that has direct
access to the data files, such sstables, commitlog, etc. It offers
protection outside the interfaces of the database. Of course there is also
encryption of communications.

Permissions should completely prevent the access of unauthorized users to
the data within the database interface. Currently we have permissions on
CQL at the keyspace and table level, but we are missing column-level
permissions.

Data masking obfuscates all or part of the data without totally forbidding
access to it. The key here is that the masked data can still contain parts
of the original information, or be representative enough. For example,
masking can obfuscate all the digits of a credit card number except the
last four, so the clear digits can be used for some degree of
identification. As another example, a masking function returning the hash
would allow to join the masked data of different sources without exposing
it.

An example of how data masking and permissions can be used together could
be a company storing the social security numbers (SSN) of its customers.
The accounting team might need full access to the stored SSNs. Employees
attending phone calls might need to ask for the last two digits of SSN for
identification purposes, so they would need masked access. The rest of the
organization would need no access at all.

This CEP focuses exclusively on data masking, but there is no reason not to
start parallel work on other related-but-different features like
column-level permissions on on-disk data encryption.




On Mon, 22 Aug 2022 at 07:05, Claude Warren, Jr via dev <
dev@cassandra.apache.org> wrote:

> I am more interested in the motivation where it is stated:
>
> Many users have the need of masking sensitive data, such as contact info,
>> age, gender, credit card numbers, etc. Dynamic data masking (DDM) allows to
>> obscure sensitive information while still allowing access to the masked
>> columns, and without changing the stored data.
>
>
> There is an unspoken assumption that the stored data format can not be
> changed.  It feels like this solution is starting from a false premise.
> Throughout the document there are guard statements about how this does not
> replace encryption.  Isn't there an assumption here that encryption can not
> be used?  Would we not be better served to build in an encryption strategy
> that keeps the data encrypted until the user shows permissions to decrypt,
> like the unmask property?  An encryption strategy that can work within the
> Cassandra internals?
>
> I think that issue is that there are some data fields that should not be
> discoverable by unauthorized users/systems, and I think this solution masks
> that issue.  I fear that this capability will be seized upon by pointy
> haired managers as a cheaper alternative to encryption, regardless of the
> warnings otherwise, and that as a whole will harm the Cassandra ecosystem.
>
> Yes, encryption is more difficult to implement and will take longer, but
> this feels like a sticking plaster that distracts from that underlying
> issue.
>
> my 0.02
>
> On Mon, Aug 22, 2022 at 12:30 AM Andrés de la Peña 
> wrote:
>
>> > If the column names are the same for masked and unmasked data, it would
>>> impact existing applications. I am curious what the transition plan look
>>> like for applications that expect unmasked data?
>>
>> For example, let’s say you store SSNs and Birth dates. Upon enabling this
>>> feature, let’s say the app user is not given the UNMASK permission. Now the
>>> app is receiving masked values for these columns. This is fine for most
>>> read only applications. However, a lot of times these columns may be used
>>>

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-22 Thread Andrés de la Peña
>
> Maybe a small improvement is the redacted value could be of the form
> `XXX1...1000` meaning XXX followed by a rand number from 1 to 1000: XXX54,
> XXX998, XXX456,... Some randomness would prevent some apps flattening all
> rows to a single XXX'ed one, giving a more realistic redacted data
> distribution/structure.


I'm not sure I understand why that would be useful. Why would random
suffixes give us a more realistic redacted data distribution? If we want to
avoid returning always the same value, we could use a function that just
return the random value, without the  part, so we can use any data
type. Microsoft's SQL Server and Azure SQL have this function among their
masking functions.

Nevertheless, it would be quite easy to keep adding new masking functions
when we need them.

On Mon, 22 Aug 2022 at 06:52, Berenguer Blasi 
wrote:

> Maybe a small improvement is the redacted value could be of the form
> `XXX1...1000` meaning XXX followed by a rand number from 1 to 1000: XXX54,
> XXX998, XXX456,... Some randomness would prevent some apps flattening all
> rows to a single XXX'ed one, giving a more realistic redacted data
> distribution/structure.
>
> I am not sure either about it's value, as that would still break any key
> or other cross-referencing.
>
> My 2cts.
> On 22/8/22 1:30, Andrés de la Peña wrote:
>
> > If the column names are the same for masked and unmasked data, it would
>> impact existing applications. I am curious what the transition plan look
>> like for applications that expect unmasked data?
>
> For example, let’s say you store SSNs and Birth dates. Upon enabling this
>> feature, let’s say the app user is not given the UNMASK permission. Now the
>> app is receiving masked values for these columns. This is fine for most
>> read only applications. However, a lot of times these columns may be used
>> as primary keys or part of primary keys in other tables. This would break
>> existing applications.
>> How would this work in mixed mode when  ew nodes in the cluster are
>> masking data and others aren’t? How would it impact the driver?
>> How would the application learn that the column values are masked? This
>> is important in case a user has UNMASK permission and then later taken
>> away. Again this would break a lot of applications.
>
>
> Changing the masking of a column is a schema change, and as such it can be
> risky for existing applications. However, differently to deleting a column
> or revoking a SELECT permission, suddenly activating masking might pass
> undetected for existing applications.
>
> Applications developed after the introduction of this feature can check
> the table schema to know if a column is masked or not. We can even add a
> specific system view to ease this, if we think it's worth it. However,
> administrators should not activate masking when there could be applications
> that are not aware of the feature. We should be clear about this in the
> documentation.
>
> This is the way data masking seems to work in the databases I've checked.
> I also though that we could just change the name of the column when it's
> masked to something as "masked(column_name)", as it is discussed in the CEP
> document. This would make it impossible to miss that a column is masked.
> However, applications should be prepared to use different column names when
> reading result sets, depending on whether the data is masked for them or
> not. None of the databases mentioned on the "other databases" section of
> the CEP does this kind of column renaming, so it might be a kind of exotic
> behaviour. wdyt?
>
> On Fri, 19 Aug 2022 at 19:17, Andrés de la Peña 
> wrote:
>
>> > This type of feature is very useful, but it may be easier to analyze
>>> this proposal if it’s compared with other DDM implementations from other
>>> databases? Would it be reasonable to add a table to the proposal comparing
>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>
>>
>> Good idea. I have added a section at the end of the document briefly
>> describing how some other databases deal with data masking, and with links
>> to their documentation for the topic. I am not an expert in none of those
>> databases, so please take my comments there with a grain of salt.
>>
>> On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa  wrote:
>>
>>> This type of feature is very useful, but it may be easier to analyze
>>> this proposal if it’s compared with other DDM implementations from other
>>> databases? Would it be reasonable to add a table to the proposal comparing
>>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>>
>

Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-21 Thread Andrés de la Peña
>
> > If the column names are the same for masked and unmasked data, it would
> impact existing applications. I am curious what the transition plan look
> like for applications that expect unmasked data?

For example, let’s say you store SSNs and Birth dates. Upon enabling this
> feature, let’s say the app user is not given the UNMASK permission. Now the
> app is receiving masked values for these columns. This is fine for most
> read only applications. However, a lot of times these columns may be used
> as primary keys or part of primary keys in other tables. This would break
> existing applications.
> How would this work in mixed mode when  ew nodes in the cluster are
> masking data and others aren’t? How would it impact the driver?
> How would the application learn that the column values are masked? This is
> important in case a user has UNMASK permission and then later taken away.
> Again this would break a lot of applications.


Changing the masking of a column is a schema change, and as such it can be
risky for existing applications. However, differently to deleting a column
or revoking a SELECT permission, suddenly activating masking might pass
undetected for existing applications.

Applications developed after the introduction of this feature can check the
table schema to know if a column is masked or not. We can even add a
specific system view to ease this, if we think it's worth it. However,
administrators should not activate masking when there could be applications
that are not aware of the feature. We should be clear about this in the
documentation.

This is the way data masking seems to work in the databases I've checked. I
also though that we could just change the name of the column when it's
masked to something as "masked(column_name)", as it is discussed in the CEP
document. This would make it impossible to miss that a column is masked.
However, applications should be prepared to use different column names when
reading result sets, depending on whether the data is masked for them or
not. None of the databases mentioned on the "other databases" section of
the CEP does this kind of column renaming, so it might be a kind of exotic
behaviour. wdyt?

On Fri, 19 Aug 2022 at 19:17, Andrés de la Peña 
wrote:

> > This type of feature is very useful, but it may be easier to analyze
>> this proposal if it’s compared with other DDM implementations from other
>> databases? Would it be reasonable to add a table to the proposal comparing
>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>
>
> Good idea. I have added a section at the end of the document briefly
> describing how some other databases deal with data masking, and with links
> to their documentation for the topic. I am not an expert in none of those
> databases, so please take my comments there with a grain of salt.
>
> On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa  wrote:
>
>> This type of feature is very useful, but it may be easier to analyze this
>> proposal if it’s compared with other DDM implementations from other
>> databases? Would it be reasonable to add a table to the proposal comparing
>> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>>
>>
>> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña 
>> wrote:
>>
>> 
>> Hi everyone,
>>
>> I'd like to start a discussion about this proposal for dynamic data
>> masking:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>>
>> Dynamic data masking allows to obscure sensitive information without
>> changing the stored data. It would be based on a set of native CQL
>> functions providing different types of masking, such as replacing the
>> column value by "". These functions could be used as regular functions
>> or attached to table columns with CREATE/ALTER table. There would be a new
>> UNMASK permission, so only the users with this permissions would be able to
>> see the unmasked column values. It would be possible to customize masking
>> by using UDFs as masking functions.
>>
>> Thanks,
>>
>>


Re: [DISCUSS] CEP-20: Dynamic Data Masking

2022-08-19 Thread Andrés de la Peña
>
> > This type of feature is very useful, but it may be easier to analyze
> this proposal if it’s compared with other DDM implementations from other
> databases? Would it be reasonable to add a table to the proposal comparing
> syntax and output from eg Azure SQL vs Cassandra vs whatever ?


Good idea. I have added a section at the end of the document briefly
describing how some other databases deal with data masking, and with links
to their documentation for the topic. I am not an expert in none of those
databases, so please take my comments there with a grain of salt.

On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa  wrote:

> This type of feature is very useful, but it may be easier to analyze this
> proposal if it’s compared with other DDM implementations from other
> databases? Would it be reasonable to add a table to the proposal comparing
> syntax and output from eg Azure SQL vs Cassandra vs whatever ?
>
>
> On Aug 19, 2022, at 4:50 AM, Andrés de la Peña 
> wrote:
>
> 
> Hi everyone,
>
> I'd like to start a discussion about this proposal for dynamic data
> masking:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>
> Dynamic data masking allows to obscure sensitive information without
> changing the stored data. It would be based on a set of native CQL
> functions providing different types of masking, such as replacing the
> column value by "". These functions could be used as regular functions
> or attached to table columns with CREATE/ALTER table. There would be a new
> UNMASK permission, so only the users with this permissions would be able to
> see the unmasked column values. It would be possible to customize masking
> by using UDFs as masking functions.
>
> Thanks,
>
>


[DISCUSS] CEP-20: Dynamic Data Masking

2022-08-19 Thread Andrés de la Peña
Hi everyone,

I'd like to start a discussion about this proposal for dynamic data
masking:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking

Dynamic data masking allows to obscure sensitive information without
changing the stored data. It would be based on a set of native CQL
functions providing different types of masking, such as replacing the
column value by "". These functions could be used as regular functions
or attached to table columns with CREATE/ALTER table. There would be a new
UNMASK permission, so only the users with this permissions would be able to
see the unmasked column values. It would be possible to customize masking
by using UDFs as masking functions.

Thanks,


Re: Cassandra project status update 2022-08-03

2022-08-11 Thread Andrés de la Peña
>
> > I think if we want to do this, it should be extremely easy - by which I
> mean automatic, really. This shouldn’t be too tricky I think? We just need
> to produce a diff of new test classes and methods within existing classes.


Having a CircleCI job that automatically runs all new/modified tests would
be a great way to prevent most of the new flakies. We would still miss some
cases, like unmodified tests that turn flaky after changing the tested
code, but I'd say that's not as usual.

> I can probably help out by putting together something to output @Test
> annotated methods within a source tree, if others are able to turn this
> into a part of the CircleCI pre-commit task (i.e. to pick the common
> ancestor with trunk, 4.1 etc, and run this task for each of the outputs)


I think we would need a bash/sh shell script taking a diff file and test
directory, and returning the file path and qualified class name of every
modified test class. I'd say we don't need the method names for Java tests
because quite often we see flaky tests that only fail when running their
entire class, so it's probably better to repeatedly run entire test classes
instead of particular methods.

We would also need a similar script for Python dtests. We would probably
want it to provide the full path of the modified tests (as
in cqlsh_tests/test_cqlsh.py::TestCqlshSmoke::test_create_index) because
those tests can be quite resource-intensive.

I think once we have those scripts we could plug their output to the
CircleCI commands for repeating tests.

Putting together all this seems relatively involved, so it can take us some
time to get it ready. In the meantime, I think it's a good practice to just
manually include any new/modified tests into the CircleCI config. Doing so
only requires to pass a few additional options to the script that generates
the config, which doesn't seem to require too much effort.

On Wed, 10 Aug 2022 at 19:47, Brandon Williams  wrote:

> > Side note, Butler is reporting CASSANDRA-17348 as open (it's resolved as
> a duplicate).
>
> This is fixed.
>


Re: Inclusive/exclusive endpoints when compacting token ranges

2022-07-26 Thread Andrés de la Peña
I think that's right, using a closed range makes sense to consume the data
provided by "sstablemetadata", which also provides closed ranges.
Especially because with half-open ranges we couldn't compact a sstable with
a single big partition, of which we might only know the token but no the
partition key.

So probably we should just add documentation about both -st and -et being
inclusive, and live with a different meaning of -st in repair and compact.

Also, the reason why this is so confusing in the test that started the
discussion is that those closed token ranges are internally represented as
"Range" objects, which are half-open by definition. So we should
document those methods, and maybe do some minor changes to avoid the use of
"Range" to silently represent closed token ranges.

On Tue, 26 Jul 2022 at 16:27, Jeremiah D Jordan 
wrote:

> Reading the responses here and taking a step back, I think the current
> behavior of nodetool compact is probably the correct behavior.  The main
> use case I can see for using nodetool compact is someone wants to take some
> sstable and compact it with all the overlapping sstables.  So you run
> “sstablemetadata” on the sstable and get the min and max tokens, and then
> you pass those in to nodetool compact.  In that case you do want the closed
> range.
>
> This is different from running repair where you get the tokens from the
> nodes/nodetool ring and node those level token ranges ownership is half
> open when going from “token owned by node a to token owned by node b”.
>
> So my initial thought/gut reaction that it should work like repair is
> misleading, because you don’t get the tokens from the same place you get
> them when running repair.
>
> Making the command line options more explicit and documented does seem
> like it could be useful.
>
> -Jeremiah Jordan
>
> On Jul 26, 2022, at 9:16 AM, Derek Chen-Becker 
> wrote:
>
> +1 to new flags. A released, albeit undocumented, behavior is still a
> contract with the end user. Flags (and documentation) seem like the right
> path to address the situation.
>
> Cheers,
>
> Derek
>
> On Tue, Jul 26, 2022 at 7:28 AM Benedict Elliott Smith <
> bened...@apache.org> wrote:
>
>>
>> I think a change like this could be dangerous for a lot of existing
>> automation built atop nodetool.
>>
>> I’m not sure this change is worthwhile. I think it would be better to
>> introduce e.g. -ste and -ete for “start token exclusive” and “end token
>> exclusive” so that users can opt-in to whichever scheme they prefer for
>> their tooling, without breaking existing users.
>>
>> > On 26 Jul 2022, at 14:22, Brandon Williams  wrote:
>> >
>> > +1, I think that makes the most sense.
>> >
>> > Kind Regards,
>> > Brandon
>> >
>> > On Tue, Jul 26, 2022 at 8:19 AM J. D. Jordan 
>> wrote:
>> >>
>> >> I like the third option, especially if it makes it consistent with
>> repair, which has supported ranges longer and I would guess most people
>> would think the compact ranges work the same as the repair ranges.
>> >>
>> >> -Jeremiah Jordan
>> >>
>> >>> On Jul 26, 2022, at 6:49 AM, Andrés de la Peña 
>> wrote:
>> >>>
>> >>> 
>> >>> Hi all,
>> >>>
>> >>> CASSANDRA-17575 has detected that token ranges in nodetool compact
>> are interpreted as closed on both sides. For example, the command "nodetool
>> compact -st 10 -et 50" will compact the tokens in [10, 50]. This way of
>> interpreting token ranges is unusual since token ranges are usually
>> half-open, and I think that in the previous example one would expect that
>> the compacted tokens would be in (10, 50]. That's for example the way
>> nodetool repair works, and indeed the class org.apache.cassandra.dht.Range
>> is always half-open.
>> >>>
>> >>> It's worth mentioning that, differently from nodetool repair, the
>> help and doc for nodetool compact doesn't specify whether the supplied
>> start/end tokens are inclusive or exclusive.
>> >>>
>> >>> I think that ideally nodetool compact should interpret the provided
>> token ranges as half-open, to be consistent with how token ranges are
>> usually interpreted. However, this would change the way the tool has worked
>> until now. This change might be problematic for existing users relying on
>> the old behaviour. That would be especially severe for the case where the
>> begin and end token are the same, because interpreting [x, x] we would
>> compact a single toke

Inclusive/exclusive endpoints when compacting token ranges

2022-07-26 Thread Andrés de la Peña
Hi all,

CASSANDRA-17575 has detected that token ranges in nodetool compact are
interpreted as closed on both sides. For example, the command "nodetool
compact -st 10 -et 50" will compact the tokens in [10, 50]. This way of
interpreting token ranges is unusual since token ranges are usually
half-open, and I think that in the previous example one would expect that
the compacted tokens would be in (10, 50]. That's for example the way
nodetool repair works, and indeed the class org.apache.cassandra.dht.Range
is always half-open.

It's worth mentioning that, differently from nodetool repair, the help and
doc for nodetool compact doesn't specify whether the supplied start/end
tokens are inclusive or exclusive.

I think that ideally nodetool compact should interpret the provided token
ranges as half-open, to be consistent with how token ranges are usually
interpreted. However, this would change the way the tool has worked until
now. This change might be problematic for existing users relying on the old
behaviour. That would be especially severe for the case where the begin and
end token are the same, because interpreting [x, x] we would compact a
single token, whereas I think that interpreting (x, x] would compact all
the tokens. As for compacting ranges including multiple tokens, I think the
change wouldn't be so bad, since probably the supplied token ranges come
from tools that are already presenting the ranges as half-open. Also, if we
are splitting the full ring into smaller ranges, half-open intervals would
still work and would save us some repetitions.

So my question is: Should we change the behaviour of nodetool compact to
interpret the token ranges as half-opened, aligning it with the usual
interpretation of ranges? Or should we just document the current odd
behaviour to prevent compatibility issues?

A third option would be changing to half-opened ranges and also forbidding
ranges where the begin and end token are the same, to prevent the
accidental compaction of the entire ring. Note that nodetool repair also
forbids this type of token ranges.

What do you think?


Re: [VOTE] Release Apache Cassandra 4.1-alpha1

2022-05-24 Thread Andrés de la Peña
+1 nb

On Tue, 24 May 2022 at 16:10, Benjamin Lerer  wrote:

> +1
>
> Le mar. 24 mai 2022 à 16:19, Josh McKenzie  a
> écrit :
>
>> +1
>>
>> On Tue, May 24, 2022, at 10:13 AM, Brandon Williams wrote:
>>
>> +1
>>
>> On Tue, May 24, 2022 at 3:39 AM Mick Semb Wever  wrote:
>> >
>> > Proposing the test build of Cassandra 4.1-alpha1 for release.
>> >
>> > sha1: 6f05be447073925a7f3620ddbbd572aa9fcd10ed
>> > Git:
>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.1-alpha1-tentative
>> > Maven Artifacts:
>> >
>> https://repository.apache.org/content/repositories/orgapachecassandra-1273/org/apache/cassandra/cassandra-all/4.1-alpha1/
>> >
>> > The Source and Build Artifacts, and the Debian and RPM packages and
>> > repositories, are available here:
>> > https://dist.apache.org/repos/dist/dev/cassandra/4.1-alpha1/
>> >
>> > The vote will be open for 72 hours (longer if needed). Everyone who
>> > has tested the build is invited to vote. Votes by PMC members are
>> > considered binding. A vote passes if there are at least three binding
>> > +1s and no -1's.
>> >
>> > [1]: CHANGES.txt:
>> >
>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/4.1-alpha1-tentative
>> > [2]: NEWS.txt:
>> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/4.1-alpha1-tentative
>>
>>


Re: [VOTE] Release Apache Cassandra 4.0.4 (take2)

2022-05-11 Thread Andrés de la Peña
+1 (nb)

On Wed, 11 May 2022 at 09:00, Sam Tunnicliffe  wrote:

> +1
>
> > On 7 May 2022, at 07:39, Mick Semb Wever  wrote:
> >
> > Proposing the test build of Cassandra 4.0.4 for release.
> > This is from the (take4) test artifact.
> >
> > sha1: 052125f2c6ed308f1473355dfe43470f0da44364
> > Git:
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.0.4-tentative
> > Maven Artifacts:
> >
> https://repository.apache.org/content/repositories/orgapachecassandra-1270/org/apache/cassandra/cassandra-all/4.0.4/
> >
> > The Source and Build Artifacts, and the Debian and RPM packages and
> > repositories, are available here:
> > https://dist.apache.org/repos/dist/dev/cassandra/4.0.4/
> >
> > The vote will be open for 72 hours (longer if needed). Everyone who
> > has tested the build is invited to vote. Votes by PMC members are
> > considered binding. A vote passes if there are at least three binding
> > +1s and no -1's.
> >
> > [1]: CHANGES.txt:
> >
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/4.0.4-tentative
> > [2]: NEWS.txt:
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/4.0.4-tentative
>
>


Re: [VOTE] Release Apache Cassandra 3.11.13

2022-05-11 Thread Andrés de la Peña
+1 (nb)

On Wed, 11 May 2022 at 09:00, Sam Tunnicliffe  wrote:

> +1
>
> > On 7 May 2022, at 07:38, Mick Semb Wever  wrote:
> >
> > Proposing the test build of Cassandra 3.11.13 for release.
> >
> > sha1: 836ab2802521a685efe84382cb48db56caf4478d
> > Git:
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/3.11.13-tentative
> > Maven Artifacts:
> >
> https://repository.apache.org/content/repositories/orgapachecassandra-1268/org/apache/cassandra/cassandra-all/3.11.13/
> >
> > The Source and Build Artifacts, and the Debian and RPM packages and
> > repositories, are available here:
> > https://dist.apache.org/repos/dist/dev/cassandra/3.11.13/
> >
> > The vote will be open for 72 hours (longer if needed). Everyone who
> > has tested the build is invited to vote. Votes by PMC members are
> > considered binding. A vote passes if there are at least three binding
> > +1s and no -1's.
> >
> > [1]: CHANGES.txt:
> >
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/3.11.13-tentative
> > [2]: NEWS.txt:
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/3.11.13-tentative
>
>


Re: [VOTE] Release Apache Cassandra 3.0.27

2022-05-11 Thread Andrés de la Peña
+1 (nb)

On Wed, 11 May 2022 at 08:59, Sam Tunnicliffe  wrote:

> +1
>
> > On 7 May 2022, at 07:37, Mick Semb Wever  wrote:
> >
> > Proposing the test build of Cassandra 3.0.27 for release.
> >
> > sha1: 205366131484967a3a8a749f1d1d841c952127e8
> > Git:
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/3.0.27-tentative
> > Maven Artifacts:
> >
> https://repository.apache.org/content/repositories/orgapachecassandra-1267/org/apache/cassandra/cassandra-all/3.0.27/
> >
> > The Source and Build Artifacts, and the Debian and RPM packages and
> > repositories, are available here:
> > https://dist.apache.org/repos/dist/dev/cassandra/3.0.27/
> >
> > The vote will be open for 72 hours (longer if needed). Everyone who
> > has tested the build is invited to vote. Votes by PMC members are
> > considered binding. A vote passes if there are at least three binding
> > +1s and no -1's.
> >
> > [1]: CHANGES.txt:
> >
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/3.0.27-tentative
> > [2]: NEWS.txt:
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/3.0.27-tentative
>
>


Re: Pluggability improvements in 4.1

2022-04-26 Thread Andrés de la Peña
Hi,

Although it's not yet officially supported and it might still change in a
minor release, guardrails config is also pluggable (CEP-3). It is possible
to provide a custom implementation supplying guardrail properties different
to those defined in cassandra.yaml, so the thresholds, flags, etc. can be
based on things like, for example, who is running the guarded query.

On Tue, 26 Apr 2022 at 21:12, Henrik Ingo  wrote:

> Hi all
>
> As one would expect, I've been involved in several discussions lately on
> what is going to make it into 4.1, versus what patches unfortunately won't.
>
> In particular debating this with Patrick McFaddin we realized that a big
> theme in 4.1 appears to be a huge number of pluggability improvements. So
> the intent of this email is to take an inventory of all new plugin APIs I'm
> aware of, and invite the community to add to the list where I'm not aware
> of some work.
>
>
> CEP-9
> 
> Pluggable SSLContext. Allows to store SSL certs and secrets elsewhere than
> in files. Supplies an example implementation for storing as Kubernetes
> Secret.
>
>
> *CEP-10*
> 
> SimpleCondition, Semaphore, CountDownLatch, BlockingQueue, etc
> Executors, futures, starting threads, etc - including important
> improvements to consistency of approach in the codebase
> The use of currentTimeMillis and nanoTime
> The replacement of java.io.File with a wrapper on java.nio.files.Path
> providing an ergonomic API, and some improvements to consistency of file
> handling
> Support for alternative streaming implementations
> Improvements to the dtest API to support necessary functionality
>
> Commentary: Of the above at least the Path and alternative streaming
> implementations seem like significant APIs that can be used for much more
> than just fault injection. In fact, I believe java.nio.files.Path is what
> we use in Astra Serverless to send files to S3 instead of local filesystem.
>
>
> *CEP-11*
> 
> Pluggable memtable
>
> Commentary: While we won't have any new memtable implementations in 4.1,
> it is a goal to merge the memtable API this week. Notably, since this is
> designed to support also persistent memtables (ie memtable on persistent
> memory), this new API could essentially be seen as a full blown storage
> engine API.
>
>
> *CASSANDRA-17044* 
> Pluggable schema management
>
> I hear rumors someone may be working on a new schema management
> implementation?
>
>
> (Just for completeness, CASSANDRA-17058
>  pluggable cluster
> membership is not merged.)
>
> CEP-16
> 
> While client side, worth mentioning: Pluggable auth for CQLSH
>
>
>
> If there are more that I don't know about, please reply and add to the
> list.
>
> henrik
>
> --
>
> Henrik Ingo
>
> +358 40 569 7354 <358405697354>
>
> [image: Visit us online.]   [image: Visit us
> on Twitter.]   [image: Visit us on
> YouTube.]
> 
>   [image: Visit my LinkedIn profile.]
> 
>


Re: Call for Volunteers - Build Lead

2022-03-25 Thread Andrés de la Peña
Hi all,

9 people have already participated on the Build Lead rotation, now we need
a brave volunteer for the next week.

Thanks for your help,

On Wed, 2 Mar 2022 at 17:10, Ekaterina Dimitrova 
wrote:

> Hi everyone,
>
> It's been a month and a half since we started the Build Lead rotation.
> 6 people already participated. (Thank you!) Most failures in Butler have
> respective tickets linked to them, many tickets were even already closed.
>
> This email is to remind you about this initiative. Don't be shy to
> volunteer participation in the rotation. :-) Josh and Brandon did the heavy
> lifting with opening the most tickets during  the first two weeks so now
> primarily brand new failures need attention. Also, I should acknowledge
> that many people worked on failures, thank you for doing it!
>
> Feel free to add yourself to the rotation schedule here -
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=199527692
> We don't have any names populated after this week.
>
> Thank you in advance for all your help!
>
> Ekaterina Dimitrova
>
>
>


Re: Welcome Aleksandr Sorokoumov as Cassandra committer

2022-03-16 Thread Andrés de la Peña
Congrats, well deserved!

On Wed, 16 Mar 2022 at 14:01, J. D. Jordan 
wrote:

> Congratulations!
>
> On Mar 16, 2022, at 8:43 AM, Ekaterina Dimitrova 
> wrote:
>
> 
> Great news! Well deserved! Congrats and thank you for all your support!
>
> On Wed, 16 Mar 2022 at 9:41, Paulo Motta  wrote:
>
>> Congratulations Alex, well deserved! :-)
>>
>> Em qua., 16 de mar. de 2022 às 10:15, Benjamin Lerer 
>> escreveu:
>>
>>> The PMC members are pleased to announce that Aleksandr Sorokoumov has
>>> accepted
>>> the invitation to become committer.
>>>
>>> Thanks a lot, Aleksandr , for everything you have done for the project.
>>>
>>> Congratulations and welcome
>>>
>>> The Apache Cassandra PMC members
>>>
>>


Re: [FOR REVIEW] Blog post: An Interview with Project Contributor, Lorina Poland

2022-03-16 Thread Andrés de la Peña
+1

On Wed, 16 Mar 2022 at 11:55, Anthony Grasso 
wrote:

> +1
>
> On Wed, 16 Mar 2022 at 21:58, bened...@apache.org 
> wrote:
>
>> +1
>>
>>
>>
>> *From: *Erick Ramirez 
>> *Date: *Tuesday, 15 March 2022 at 22:08
>> *To: *dev@cassandra.apache.org 
>> *Subject: *Re: [FOR REVIEW] Blog post: An Interview with Project
>> Contributor, Lorina Poland
>>
>> Looks good to me! 
>>
>>
>>
>> On Wed, 16 Mar 2022 at 08:17, Chris Thornett  wrote:
>>
>> As requested, I'm posting content contributions for community review on
>> the ML for those that might not spot them on Slack.
>>
>>
>>
>> We're currently mid-review for our first contributor Q which is with
>> Lorina Poland:
>>
>>
>> https://docs.google.com/document/d/1nnH4V1XvTcfTeeUdZ_mjSxlNlWTbSXFu_qKtJRQUFBk/edit.
>> Please add edits or suggests as comments.
>>
>>
>>
>> Thanks!
>>
>> --
>>
>>
>>
>> Chris Thornett
>>
>> senior content strategist, Constantia.io
>>
>> ch...@constantia.io
>>
>>


Re: [DISCUSS] Should we deprecate / freeze python dtests

2022-03-14 Thread Andrés de la Peña
Last time I checked there wasn't support for vnodes on in-jvm dtests, which
seems an important limitation.

On Mon, 14 Mar 2022 at 12:24, bened...@apache.org 
wrote:

> I am strongly in favour of deprecating python dtests in all cases where
> they are currently superseded by in-jvm dtests. They are environmentally
> more challenging to work with, causing many problems on local and remote
> machines. They are harder to debug, slower, flakier, and mostly less
> sophisticated.
>
>
>
> > all focus on getting the in-jvm framework robust enough to cover
> edge-cases
>
>
>
> Would be great to collect gaps. I think it’s just vnodes, which is by no
> means a fundamental limitation? There may also be some stuff to do
> startup/shutdown and environmental scripts, that may be a niche we retain
> something like python dtests for.
>
>
>
> > people aren’t familiar
>
>
>
> I would be interested to hear from these folk to understand their concerns
> or problems using in-jvm dtests, if there is a cohort holding off for this
> reason
>
>
>
> > This is going to require documentation work from some of the original
> authors
>
>
>
> I think a collection of template-like tests we can point people to would
> be a cheap initial effort. Cutting and pasting an existing test with the
> required functionality, then editing to suit, should get most people off to
> a quick start who aren’t familiar.
>
>
>
> > Labor and process around revving new releases of the in-jvm dtest API
>
>
>
> I think we need to revisit how we do this, as it is currently broken. We
> should consider either using ASF snapshots until we cut new releases of C*
> itself, or else using git subprojects. This will also become a problem for
> Accord’s integration over time, and perhaps other subprojects in future, so
> it is worth better solving this.
>
>
>
> I think this has been made worse than necessary by moving too many
> implementation details to the shared API project – some should be retained
> within the C* tree, with the API primarily serving as the shared API itself
> to ensure cross-version compatibility. However, this is far from a complete
> explanation of (or solution to) the problem.
>
>
>
>
>
>
>
> *From: *Josh McKenzie 
> *Date: *Monday, 14 March 2022 at 12:11
> *To: *dev@cassandra.apache.org 
> *Subject: *[DISCUSS] Should we deprecate / freeze python dtests
>
> I've been wrestling with the python dtests recently and that led to some
> discussions with other contributors about whether we as a project should be
> writing new tests in the python dtest framework or the in-jvm framework.
> This discussion has come up tangentially on some other topics, including
> the lack of documentation / expertise on the in-jvm framework
> dis-incentivizing some folks from authoring new tests there vs. the
> difficulty debugging and maintaining timer-based, sleep-based
> non-deterministic python dtests, etc.
>
>
>
> I don't know of a place where we've formally discussed this and made a
> project-wide call on where we expect new distributed tests to be written;
> if I've missed an email about this someone please link on the thread here
> (and stop reading! ;))
>
>
>
> At this time we don't specify a preference for where you write new
> multi-node distributed tests on our "development/testing" portion of the
> site and documentation:
> https://cassandra.apache.org/_/development/testing.html
>
>
>
> The primary tradeoffs as I understand them for moving from python-based
> multi-node testing to jdk-based are:
>
> Pros:
>
>1. Better debugging functionality (breakpoints, IDE integration, etc)
>2. Integration with simulator
>3. More deterministic runtime (anecdotally; python dtests _should_ be
>deterministic but in practice they prove to be very prone to environmental
>disruption)
>4. Test time visibility to internals of cassandra
>
> Cons:
>
>1. The framework is not as mature as the python dtest framework (some
>functionality missing)
>2. Labor and process around revving new releases of the in-jvm dtest
>API
>3. People aren't familiar with it yet and there's a learning curve
>
>
>
> So my bid here: I personally think we as a project should freeze writing
> new tests in the python dtest framework and all focus on getting the in-jvm
> framework robust enough to cover edge-cases that might still be causing new
> tests to be written in the python framework. This is going to require
> documentation work from some of the original authors of the in-jvm
> framework as well as folks currently familiar with it and effort from those
> of us not yet intimately familiar with the API to get to know it, however I
> believe the long-term benefits to the project will be well worth it.
>
>
>
> We could institute a pre-commit check that warns on a commit increasing
> our raw count of python dtests to help provide process-based visibility to
> this change in direction for the project's testing.
>
>
>
> So: what do we think?
>
>
>


Re: [VOTE] CEP-19: Trie memtable implementation

2022-02-16 Thread Andrés de la Peña
+1nb

On Wed, 16 Feb 2022 at 15:57, C. Scott Andreas  wrote:

> +1nb
>
> On Feb 16, 2022, at 5:59 AM, Jeremy Hanna 
> wrote:
>
> +1 nb.  Thanks for all of the great work on this Branimir.  Excited to
> see this moving forward.
>
> On Feb 16, 2022, at 7:56 AM, J. D. Jordan 
> wrote:
>
> +1 nb
>
> On Feb 16, 2022, at 7:30 AM, Josh McKenzie  wrote:
>
> 
> +1
>
> On Wed, Feb 16, 2022, at 7:33 AM, Ekaterina Dimitrova wrote:
>
> +1nb
>
> On Wed, 16 Feb 2022 at 7:30, Brandon Williams  wrote:
>
> +1
>
> On Wed, Feb 16, 2022 at 3:00 AM Branimir Lambov 
> wrote:
> >
> > Hi everyone,
> >
> > I'd like to propose CEP-19 for approval.
> >
> > Proposal:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-19%3A+Trie+memtable+implementation
> > Discussion:
> https://lists.apache.org/thread/fdvf1wmxwnv5jod59jznbnql23nqosty
> >
> > The vote will be open for 72 hours.
> > Votes by committers are considered binding.
> > A vote passes if there are at least three binding +1s and no binding
> vetoes.
> >
> > Thank you,
> > Branimir
>
>
>


Re: Have we considered static type checking for our python libs?

2022-01-26 Thread Andrés de la Peña
Last time I ported dtests during the 4.0 quality test epic there wasn't
support for virtual nodes in in-jvm dtests. We have many Python dtests
depending on vnodes that can't be totally ported if we still don't have
support for vnodes, I don't know if it's still the case.

On Wed, 26 Jan 2022 at 14:02, Joshua McKenzie  wrote:

> Could be a very fruitful source of LHF tickets to highlight in the
> biweekly email and would be pretty trivial to integrate this into the build
> lead role (getting an epic and jira tickets created to port tests over,
> etc).
>
> we can run our tests much more quickly, and debug failures much more
>> easily.
>
> Please Yes. If we can get away from python upgrade tests I think all our
> lives would be improved.
>
> I like it.
>
>
> On Wed, Jan 26, 2022 at 8:42 AM bened...@apache.org 
> wrote:
>
>> We could set this as a broad goal of the project, and like the build lead
>> role could each volunteer to adopt a test every X weeks. We would have
>> migrated in no time, I expect, with this kind of concerted effort, and
>> might not even notice a significant penalty to other ongoing work.
>>
>>
>>
>> Last time I ported a dtest it was a very easy thing to do.
>>
>>
>>
>> I might even venture to predict that it might payoff with lower
>> development overhead, as we can run our tests much more quickly, and debug
>> failures much more easily.
>>
>>
>>
>> *From: *Joshua McKenzie 
>> *Date: *Wednesday, 26 January 2022 at 13:40
>> *To: *dev 
>> *Subject: *Re: Have we considered static type checking for our python
>> libs?
>>
>> I have yet to encounter this class of problem in the dtests.
>>
>> It's more about development velocity and convenience than about
>> preventing defects in our case, since we're not abusing duck-typing
>> everywhere. Every time I have to work on python dtests (for instance, when
>> doing build lead work and looking at flaky tests) it's a little irritating
>> and I think of this.
>>
>>
>>
>>  I would hate to expend loads of effort modernising them when the same
>> effort could see them superseded by much better versions of the same test.
>>
>> I completely agree, however this is something someone would have to take
>> on as an effort and I don't believe I've seen anybody step up yet. At the
>> current rate we're going to be dragging along the python dtests into
>> perpetuity.
>>
>>
>>
>>
>>
>> On Wed, Jan 26, 2022 at 8:16 AM bened...@apache.org 
>> wrote:
>>
>> I was sort of hoping we would retire the python dtests before long, at
>> least in large part (probably not ever entirely, but 99%).
>>
>>
>>
>> I think many of them could be migrated to in-jvm dtests without much
>> effort. I would hate to expend loads of effort modernising them when the
>> same effort could see them superseded by much better versions of the same
>> test.
>>
>>
>>
>>
>>
>> *From: *Joshua McKenzie 
>> *Date: *Wednesday, 26 January 2022 at 12:59
>> *To: *dev 
>> *Subject: *Have we considered static type checking for our python libs?
>>
>> Relevant links:
>>
>> 1) Optional static typing for python:
>> https://docs.python.org/3/library/typing.html
>>
>> 2) Mypy static type checker for python: https://github.com/python/mypy
>>
>>
>>
>> So the question - has anyone given any serious thought to introducing
>> type hints and a static type checker in ccm and python dtests? A search on
>> dev ponymail doesn't turn up anything.
>>
>>
>>
>> I've used it pretty extensively in the past and found it incredibly
>> helpful combined with other linters in surfacing troublesome edge cases,
>> and also found it accelerated development quite a bit.
>>
>>
>>
>> Any thoughts on the topic for or against?
>>
>>
>>
>> ~Josh
>>
>>


Re: [VOTE] Release dtest-api 0.0.12

2022-01-24 Thread Andrés de la Peña
+1

On Mon, 24 Jan 2022 at 12:29, Brandon Williams  wrote:

> +1
>
> On Thu, Jan 13, 2022 at 12:17 PM Mick Semb Wever  wrote:
> >
> > Proposing the test build of in-jvm dtest API 0.0.12 for release.
> >
> > Repository:
> >
> https://gitbox.apache.org/repos/asf?p=cassandra-in-jvm-dtest-api.git;a=shortlog;h=refs/tags/0.0.12
> >
> > Candidate SHA:
> >
> https://github.com/apache/cassandra-in-jvm-dtest-api/commit/207d6cee2d01552f794d322ec05a7577bcab08e0
> > tagged with 0.0.12
> >
> > Artifacts:
> >
> https://repository.apache.org/content/repositories/orgapachecassandra-1252/org/apache/cassandra/dtest-api/0.0.12/
> >
> > Key signature: A4C465FEA0C552561A392A61E91335D77E3E87CB
> >
> > Changes since last release:
> >   * CASSANDRA-17214:Add IInstance.isValid() with default true return
> value
> >
> >
> > The vote will be open for 24 hours. Everyone who has tested the build
> > is invited to vote. Votes by PMC members are considered binding. A
> > vote passes if there are at least three binding +1s.
>


Re: [VOTE] Formalizing our CI process

2022-01-12 Thread Andrés de la Peña
Still +1 with the amendment

On Wed, 12 Jan 2022 at 19:57, C. Scott Andreas  wrote:

> +1nb, with and without the amendment.
>
> Reason for mentioning without: I see the ability to cut a release to
> address an urgent security or data loss issue as one of the strongest
> arguments for maintaining green CI as a resting state so we are ready in
> the event of an emergency.
>
> Test results that we can trust help us ship urgent fixes safely. If I were
> a user and had an urgent need to ramp a new build (e.g., if Apache
> Cassandra were affected by log4j), I would be very concerned about a
> fleet-wide deploy of a distributed database release with failing tests.
>
> But in both cases, +1nb. :)
>
> – Scott
>
> On Jan 12, 2022, at 11:22 AM, David Capwell  wrote:
>
>
> +1
>
> On Jan 12, 2022, at 8:39 AM, Joseph Lynch  wrote:
>
> On Wed, Jan 12, 2022 at 3:25 AM Berenguer Blasi
>  wrote:
>
>
> jenkins CI was at 2/3 flakies consistently post 4.0 release.
>
>
> That is really impressive and I absolutely don't mean to downplay that
> achievement.
>
> Then things broke and we've been working hard to get back to the 2/3
> flakies. Most
> current failures imo are timeuuid C17133 or early termination of process
> C17140 related afaik. So getting back to the 2/3 'impossible' flakies
> should be doable and a reasonable target (famous last words...). My 2cts.
>
>
> I really appreciate all the work folks have been doing to get the
> project to green, and I support the parts of the proposal that try to
> formalize methods to try to keep us there. I am only objecting to #2
> in the proposal where we have a non-negotiable gate on tests before a
> release.
>
> -Joey
>
>
>


  1   2   >