from:"J. D. Jordan"

Re: [DISCUSS] Cassandra 5.0 support for RHEL 7

2024-03-11 Thread J. D. Jordan

<<< multipart/alternative: No recognizable part >>>

Re: [Discuss] CASSANDRA-16999 introduction of a column in system.peers_v2

2024-02-07 Thread J. D. Jordan

Correct. But that initial connection will work and the client will work, it 
just won’t have connections to multiple nodes.

I didn’t say it’s optimal, but this is the best way I can see that doesn’t 
break things more than they are now, and does give an improvement because you 
can pick which ports shows up in peers.

Deprecating helps nothing for existing releases. We can’t/shouldn’t remove the 
feature in existing releases.

> On Feb 7, 2024, at 8:02 AM, Abe Ratnofsky  wrote:
> 
> If dual-native-port is enabled, a client is connecting unencrypted to the 
> non-SSL port, and "advertise-native-port=ssl" (name pending) is enabled, then 
> when that client fetches peers it will get the SSL ports, right? If the 
> client doesn't support SSL, then those subsequent connections will fail. An 
> operator would have to set "advertise-native-port=ssl" and override the port 
> options in all clients, which isn't feasible.

Re: [Discuss] CASSANDRA-16999 introduction of a column in system.peers_v2

2024-02-07 Thread J. D. Jordan

We should not introduce a new column in a patch release. From what I have seen many drivers “select * from peers”, yes it’s not a good idea, but we can’t control what all clients do, and an extra column coming back may break the processing of that.For existing versions what about having a “default ssl” or “default no SSL” yaml setting which decides what port is advertised?  Then someone could still connect on the other port manually specifying.  Then new column can be added with the new table in trunk.-JeremiahOn Feb 7, 2024, at 5:56 AM, Štefan Miklošovič  wrote:Honest to god, I do not know, Abe. If I see a feedback where we reach consensus to deprecate dual port support, I will deprecate that. On Wed, Feb 7, 2024 at 12:42 PM Abe Ratnofsky  wrote:CASSANDRA-9590 (Support for both encrypted and unencrypted native transport connections) was implemented before CASSANDRA-10559 (Support encrypted and plain traffic on the same port), but both been available since 3.0.On 9590, STARTTLS was considered, but rejected due to the changes that would be required to support it from all drivers. But the current server implementation doesn't require STARTTLS: the client is expected to send the first message over the connection, so the server can just check if that message is encrypted, and then enable the Netty pipeline's SslHandler.The implementation in 10559 is compatible with existing clients, and is already used widely. Are there any reasons for users to stick with dual-native-port rather than a single port that supports both encrypted and unencrypted traffic?

Re: [DISCUSS] Add subscription mangement instructions to user@, dev@ message footers

2024-01-22 Thread J. D. Jordan

I think we used to have this and removed them because it was breaking the 
encryption signature on messages or something which meant they were very likely 
to be treated as spam?

Not saying we can’t put it back on, but it was removed for good reasons from 
what I recall.

> On Jan 22, 2024, at 12:19 PM, Brandon Williams  wrote:
> 
> +1
> 
> Kind Regards,
> Brandon
> 
>> On Mon, Jan 22, 2024 at 12:10 PM C. Scott Andreas  
>> wrote:
>> 
>> Hi all,
>> 
>> I'd like to propose appending the following two footers to messages sent to 
>> the user@ and dev@ lists. The proposed postscript including line breaks is 
>> between the "X" blocks below.
>> 
>> User List Footer:
>> X
>> 
>> ---
>> Unsubscribe: Send a blank email to user-unsubscr...@cassandra.apache.org. Do 
>> not reply to this message.
>> Cassandra Community: Follow other mailing lists or join us in Slack: 
>> https://cassandra.apache.org/_/community.html
>> X
>> 
>> Dev List Footer:
>> X
>> 
>> ---
>> Unsubscribe: Send a blank email to dev-unsubscr...@cassandra.apache.org. Do 
>> not reply to this message.
>> Cassandra Community: Follow other mailing lists or join us in Slack: 
>> https://cassandra.apache.org/_/community.html
>> X
>> 
>> Offering this proposal for three reasons:
>> – Many users are sending "Unsubscribe" messages to the full mailing list 
>> which prompts others to wish to unsubscribe – a negative cascade that 
>> affects the size of our user community.
>> – Many users don't know where to go to figure out how to unsubscribe, 
>> especially if they'd joined many years ago.
>> – Nearly all mailing lists provide a one-click mechanism for unsubscribing 
>> or built-in mail client integration to do so via message headers. Including 
>> compact instructions on how to leave is valuable to subscribers.
>> 
>> #asfinfra indicates that such footers can be appended given project 
>> consensus and an INFRA- ticket: 
>> https://the-asf.slack.com/archives/CBX4TSBQ8/p1705939868631079
>> 
>> If we reach consensus on adding a message footer, I'll file an INFRA ticket 
>> with a link to this thread.
>> 
>> Thanks,
>> 
>> – Scott
>>

Re: [DISCUSS] CEP-39: Cost Based Optimizer

2023-12-22 Thread J. D. Jordan

The CEP-29 “rejected alternatives” section mentions one such use case. Being able to put NOT arbitrarily in a query. Adding an OR operator is another thing we are likely to want to do in the near future that would benefit from this work, those benefit from the syntax tree and reordering parts of the proposal.But I think we already have enough complexity available to us to justify a query optimizer in the fact of multi index queries today. Especially when you have the new ANN OF operator in use combined with index queries. Depending on what order you query the indexes in, it can dramatically change the performance of the query. We are seeing and working through such issues in Astra today.-JeremiahOn Dec 21, 2023, at 12:00 PM, Josh McKenzie wrote:we are already late. We have several features running in production that we chose to not open source yet because implementing phase 1 of the CEP would have heavily simplify their designs. The cost of developing them was much higher than what it would have been if the CEP had already been implemented. We are also currently working on some SAI features that need cost based optimization.Are there DISCUSS threads or CEP's for any of that work? For us to have a useful discussion about whether we're at a point in the project where a query optimizer is appropriate for the project this information would be vital.On Thu, Dec 21, 2023, at 12:33 PM, Benjamin Lerer wrote:Hey German,To clarify things, we intend to push cardinalities across nodes, not costs. It will be up to the Cost Model to estimate cost based on those cardinalities. We will implement some functionalities to collect costs on query execution to be able to provide them as the output of EXPLAIN ANALYZE.We will provide more details on how we will collect and distribute cardinalities. We will probably not go into details on how we will estimate costs before the patch for it is ready. The main reason being that there are a lot of different parts that you need to account for and that it will require significant testing and experimentation.Regarding multi-tenancy, even if you use query cost, do not forget that you will have to account also for background tasks such as compaction, repair, backup, ... which is not included in this CEP. Le jeu. 21 déc. 2023 à 00:18, German Eichberger via dev a écrit :All,very much agree with Scott's reasoning. It seems expedient given the advent of ACCORD transactions to be more like the other distributed SQL databases and just support SQL. But just because it's expedient it isn't right and we should work out the relational features in more detail before we embark
on tying us to some query planning design.The main problem in this space is pushing cost / across nodes based on data density.
I understand that TCM will level out data density but the cost based optimizer proposal does a lot of hand waving when it comes to collecting/estimating costs for each node. I like to see more details on this since otherwise it will be fairly limiting.I am less tied to ALLOW FILTERING - many of my customers find allowing filtering beneficial
for their workloads so I think removing it makes sense to me (and yes we try to discourage them )I am also intrigued by this proposal when I think about multi tenancy and resource governance:
We have heard from several operator who run multiple internal teams on the same Cassandra cluster jut to optimize costs. Having a way to attribute those costs more fairly by adding up the costs the optimizer calculates might be hugely beneficial. There could
also be a way to have a "cost budget" on a keyspace to minimize the noisy neighbor problem and do more intelligent request throttling. In summary I support the proposal with the caveats raised above.Thanks,GermanFrom: C. Scott Andreas Sent: Wednesday, December 20, 2023 8:15 AM To: dev@cassandra.apache.org Cc: dev@cassandra.apache.org Subject: [EXTERNAL] Re: [DISCUSS] CEP-39: Cost Based Optimizer You don't often get email from sc...@paradoxica.net. Learn why this is importantThanks for this proposal and apologies for my delayed engagement during the Cassandra Summit last week. Benjamin, I appreciate your work on this and your engagement on this thread – I know it’s a lot of discussion to field.On ALLOW FILTERING:I share Chris Lohfink’s experience in operating clusters that have made heavy use of ALLOW FILTERING. It is a valuable guardrail for the database to require users specifically annotate queries that may cost 1000x+ that of a simple lookup for a primary
key. For my own purposes, I’d actually like to go a step further and disable queries that require ALLOW FILTERING by default unless explicitly reviewed - but haven’t taken the step of adding such a guardrail yet.CBOs, CQL, and SQL:The CBO proposal cuts to the heart of one of the fundamental differences between SQL and CQL that I haven’t seen

Re: [DISCUSS] CASSANDRA-18940 SAI post-filtering reads don't update local table latency metrics

2023-12-01 Thread J. D. Jordan

At the coordinator level SAI queries fall under Range metrics. I would either put them under the same at the lower level or in a new SAI metric.It would be confusing to have the top level coordinator query metrics in Range and the lower level in Read. On Dec 1, 2023, at 12:50 PM, Caleb Rackliffe wrote:So the plan would be to have local "Read" and "Range" remain unchanged in TableMetrics, but have a third "SAIRead" (?) just for SAI post-filtering read SinglePartitionReadCommands? I won't complain too much if that's what we settle on, but it just depends on how much this is a metric for ReadCommand subclasses operating at the node-local level versus something we think we should link conceptually to a user query. SAI queries will produce a SinglePartitionReadCommand per matching primary key, so that definitely won't work for the latter.@Mike On a related note, we now have "PartitionReads" and "RowsFiltered" in TableQueryMetrics. Should the former just be removed, given a.) it actually is rows now not partitions and b.) "RowsFiltered" seems like it'll be almost the same thing now? (I guess if we ever try batching rows reads per partition, it would come in handy again...)On Fri, Dec 1, 2023 at 12:30 PM J. D. Jordan <jeremiah.jor...@gmail.com> wrote:I prefer option 2. It is much easier to understand and roll up two metrics than to do subtractive dashboards.SAI reads are already “range reads” for the client level metrics, not regular reads. So grouping them into the regular read metrics at the lower level seems confusing to me in that sense as well.As an operator I want to know how my SAI reads and normal reads are performing latency wise separately.-JeremiahOn Dec 1, 2023, at 11:15 AM, Caleb Rackliffe <calebrackli...@gmail.com> wrote:Option 1 would be my preference. Seems both useful to have a single metric for read load against the table and a way to break out SAI reads specifically.On Fri, Dec 1, 2023 at 11:00 AM Mike Adamson <madam...@datastax.com> wrote:Hi,We are looking at adding SAI post-filtering reads to the local table metrics and would like some feedback on the best approach.We don't think that SAI reads are that special so they can be included in the table latencies, but how do we handle the global counts and the SAI counts? Do we need to maintain a separate count of SAI reads? We feel the answer to this is yes so how do we do the counting? There are two options (others welcome):1. All reads go into the current global count and we have a separate count for SAI specific reads. So non-SAI reads = global count - SAI count2. We leave the exclude the SAI reads from the current global count so total reads = global count + SAI countOur preference is for option 1 above. Does anyone have any strong views / opinions on this? -- Mike AdamsonEngineering+1 650 389 6000 | datastax.comFind DataStax Online:

Re: [DISCUSS] CASSANDRA-18940 SAI post-filtering reads don't update local table latency metrics

2023-12-01 Thread J. D. Jordan

I prefer option 2. It is much easier to understand and roll up two metrics than to do subtractive dashboards.SAI reads are already “range reads” for the client level metrics, not regular reads. So grouping them into the regular read metrics at the lower level seems confusing to me in that sense as well.As an operator I want to know how my SAI reads and normal reads are performing latency wise separately.-JeremiahOn Dec 1, 2023, at 11:15 AM, Caleb Rackliffe wrote:Option 1 would be my preference. Seems both useful to have a single metric for read load against the table and a way to break out SAI reads specifically.On Fri, Dec 1, 2023 at 11:00 AM Mike Adamson wrote:Hi,We are looking at adding SAI post-filtering reads to the local table metrics and would like some feedback on the best approach.We don't think that SAI reads are that special so they can be included in the table latencies, but how do we handle the global counts and the SAI counts? Do we need to maintain a separate count of SAI reads? We feel the answer to this is yes so how do we do the counting? There are two options (others welcome):1. All reads go into the current global count and we have a separate count for SAI specific reads. So non-SAI reads = global count - SAI count2. We leave the exclude the SAI reads from the current global count so total reads = global count + SAI countOur preference is for option 1 above. Does anyone have any strong views / opinions on this? -- Mike AdamsonEngineering+1 650 389 6000 | datastax.comFind DataStax Online:

Re: Welcome Francisco Guerrero Hernandez as Cassandra Committer

2023-11-28 Thread J. D. Jordan

Congrats!

> On Nov 28, 2023, at 12:57 PM, C. Scott Andreas  wrote:
> 
> Congratulations, Francisco!
> 
> - Scott
> 
>> On Nov 28, 2023, at 10:53 AM, Dinesh Joshi  wrote:
>> 
>> The PMC members are pleased to announce that Francisco Guerrero Hernandez 
>> has accepted
>> the invitation to become committer today.
>> 
>> Congratulations and welcome!
>> 
>> The Apache Cassandra PMC members

Re: [VOTE] Release Apache Cassandra 5.0-beta1

2023-11-28 Thread J. D. Jordan

That said. This is clearly better than and with many fixes from the alpha. Would people be more comfortable if this cut was released as another alpha and we do beta1 once the known fixes land?On Nov 28, 2023, at 12:21 PM, J. D. Jordan  wrote:-0 (NB) on this cut. Given the concerns expressed so far in the thread I would think we should re-cut beta1 at the end of the week.On Nov 28, 2023, at 12:06 PM, Patrick McFadin  wrote:I'm a +1 on a beta now vs maybe later. Beta doesn't imply perfect especially if there are declared known issues. We need people outside of this tight group using it and finding issues. I know how this rolls. Very few people touch a Alpha release. Beta is when the engine starts and we need to get it started asap. Otherwise we are telling ourselves we have the perfect testing apparatus and don't need more users testing. I don't think that is the case. Scott, Ekaterina, and I are going to be on stage in 2 weeks talking about Cassandra 5 in the keynotes. In that time, our call to action is going to be to test the beta. PatrickOn Tue, Nov 28, 2023 at 9:41 AM Mick Semb Wever <m...@apache.org> wrote:The vote will be open for 72 hours (longer if needed). Everyone who has tested the build is invited to vote. Votes by PMC members are considered binding. A vote passes if there are at least three binding +1s and no -1's.+1Checked- signing correct- checksums are correct- source artefact builds (JDK 11+17)- binary artefact runs (JDK 11+17)- debian package runs (JDK 11+17)- debian repo runs (JDK 11+17)- redhat* package runs (JDK11+17)- redhat* repo runs (JDK 11+17)With the disclaimer:  There's a few known bugs in SAI, e.g. 19011, with fixes to be available soon in 5.0-beta2.

Re: [VOTE] Release Apache Cassandra 5.0-beta1

2023-11-28 Thread J. D. Jordan

-0 (NB) on this cut. Given the concerns expressed so far in the thread I would think we should re-cut beta1 at the end of the week.On Nov 28, 2023, at 12:06 PM, Patrick McFadin  wrote:I'm a +1 on a beta now vs maybe later. Beta doesn't imply perfect especially if there are declared known issues. We need people outside of this tight group using it and finding issues. I know how this rolls. Very few people touch a Alpha release. Beta is when the engine starts and we need to get it started asap. Otherwise we are telling ourselves we have the perfect testing apparatus and don't need more users testing. I don't think that is the case. Scott, Ekaterina, and I are going to be on stage in 2 weeks talking about Cassandra 5 in the keynotes. In that time, our call to action is going to be to test the beta. PatrickOn Tue, Nov 28, 2023 at 9:41 AM Mick Semb Wever  wrote:The vote will be open for 72 hours (longer if needed). Everyone who has tested the build is invited to vote. Votes by PMC members are considered binding. A vote passes if there are at least three binding +1s and no -1's.+1Checked- signing correct- checksums are correct- source artefact builds (JDK 11+17)- binary artefact runs (JDK 11+17)- debian package runs (JDK 11+17)- debian repo runs (JDK 11+17)- redhat* package runs (JDK11+17)- redhat* repo runs (JDK 11+17)With the disclaimer:  There's a few known bugs in SAI, e.g. 19011, with fixes to be available soon in 5.0-beta2.

Re: Road to 5.0-GA (was: [VOTE] Release Apache Cassandra 5.0-alpha2)

2023-11-04 Thread J. D. Jordan

Sounds like 18993 is not a regression in 5.0? But present in 4.1 as well?  So I 
would say we should fix it with the highest priority and get a new 4.1.x 
released. Blocking 5.0 beta voting is a secondary issue to me if we have a 
“data not being returned” issue in an existing release?

> On Nov 4, 2023, at 11:09 AM, Benedict  wrote:
> 
> I think before we cut a beta we need to have diagnosed and fixed 18993 
> (assuming it is a bug).
> 
>> On 4 Nov 2023, at 16:04, Mick Semb Wever  wrote:
>> 
>> 
>>> 
>>> With the publication of this release I would like to switch the
>>> default 'latest' docs on the website from 4.1 to 5.0.  Are there any
>>> objections to this ?
>> 
>> 
>> I would also like to propose the next 5.0 release to be 5.0-beta1
>> 
>> With the aim of reaching GA for the Summit, I would like to suggest we
>> work towards the best-case scenario of 5.0-beta1 in two weeks and
>> 5.0-rc1 first week Dec.
>> 
>> I know this is a huge ask with lots of unknowns we can't actually
>> commit to.  But I believe it is a worthy goal, and possible if nothing
>> sideswipes us – but we'll need all the help we can get this month to
>> make it happen.
>

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-30 Thread J. D. Jordan

That is my understanding as well. If the TCM and Accord based on TCM branches are ready to commit by ~12/1 we can cut a 5.1 branch and then a 5.1-alpha release.Where “ready to commit” means our usual things of two committer +1 and green CI etc.If we are not ready to commit then I propose that as long as everything in the accord+tcm Apache repo branch has had two committer +1’s, but maybe people are still working on fixes for getting CI green or similar, we cut a 5.1-preview  build from the feature branch to vote on with known issues documented.  This would not be the preferred path, but would be a way to have a voted on release for summit.-Jeremiah On Oct 30, 2023, at 5:59 PM, Mick Semb Wever  wrote:Hoping we can get clarity on this.The proposal was, once TCM and Accord merges to trunk,  then immediately branch cassandra-5.1 and cut an immediate 5.1-alpha1 release.This was to focus on stabilising TCM and Accord as soon as it lands, hence the immediate branching.And the alpha release as that is what our Release Lifecycle states it to be.https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle My understanding is that there was no squeezing in extra features into 5.1 after TCM+Accord lands, and there's no need for a "preview" release – we move straight to the alpha, as our lifecycle states.  And we will describe all usability shortcomings and bugs with the alpha, our lifecycle docs permit this, if we feel the need to.All this said, if TCM does not merge before the Summit, and we want to get a release into user hands, it has been suggested we cut a preview release 5.1-preview1 off the feature branch.  This is a different scenario, and only a mitigation plan.  On Thu, 26 Oct 2023 at 14:20, Benedict  wrote:The time to stabilise is orthogonal to the time we branch. Once we branch we stop accepting new features for the branch, and work to stabilise.My understanding is we will branch as soon as we have a viable alpha containing TCM and Accord. That means pretty soon after they land in the project, which we expect to be around the summit.If this isn’t the expectation we should make that clear, as it will affect how this decision is made.On 26 Oct 2023, at 10:14, Benjamin Lerer  wrote:
Regarding the release of 5.1, I 
understood the proposal to be that we cut an actual alpha, thereby 
sealing the 5.1 release from new features. Only features merged before 
we cut the alpha would be permitted, and the alpha should be cut as soon
 as practicable. What exactly would we be waiting for? The problem I believe is about expectations. It seems that your expectation is that a release with only TCM and Accord will reach GA quickly. Based on the time it took us to release 4.1, I am simply expecting more delays (a GA around end of May, June). In which case it seems to me that we could be interested in shipping more stuff in the meantime (thinking of CASSANDRA-15254 or CEP-29 for example).I do not have a strong opinion, I just want to make sure that we all share the same understanding and fully understand what we agree upon.    

Le jeu. 26 oct. 2023 à 10:59, Benjamin Lerer  a écrit :
I am surprised this needs to be said, 
but - especially for long-running CEPs - you must involve yourself 
early, and certainly within some reasonable time of being notified the 
work is ready for broader input and review. In this case, more than six 
months ago.It is unfortunately more complicated than that because six month ago Ekaterina and I were working on supporting Java 17 and dropping Java 8 which was needed by different ongoing works. We both missed the announcement that TCM was ready for review and anyway would not have been available at that time. Maxim has asked me ages ago for a review of 
CASSANDRA-15254  more than 6 months ago and I have not been able to help him so far. We all have a limited bandwidth and can miss some announcements.    

The project has grown and a lot of things are going on in parallel. There are also more interdependencies between the different projects. In my opinion what we are lacking is a global overview of the different things going on in the project and some rough ideas of the status of the different significant pieces. It would allow us to better organize ourselves.    

Le jeu. 26 oct. 2023 à 00:26, Benedict  a écrit :I have spoken privately with Ekaterina, and to clear up some possible ambiguity: I realise nobody has demanded a delay to this work to conduct additional reviews; a couple of folk have however said they would prefer one.

My point is that, as a community, we need to work on ensuring folk that care about a CEP participate at an appropriate time. If they aren’t able to, the consequences of that are for them to bear. We should be working to avoid surprises as CEP start to land. To this end, I think we should work on some additional paragraphs for the governance doc covering expectations around the

Re: [VOTE] Accept java-driver

2023-10-06 Thread J. D. Jordan

The software grant agreement covers all donated code. The ASF does not need any historical agreements. The agreement giving the ASF copyright etc is the Software Grant Agreement. Yes, any future work done after donation needs to be covered by ASF CLAs.But happy to see someone ask legal@ to confirm this so we can move forward.On Oct 6, 2023, at 3:33 AM, Benedict wrote:Are we certain about that? It’s unclear to me from the published guidance; would be nice to get legal to weigh in to confirm to make sure we aren’t skipping any steps, as we haven’t been involved until now so haven’t the visibility. At the very least it reads to me that anyone expected to be maintaining the software going forwards should have a CLA on file with ASF, but I’d have expected the ASF to also want a record of the historic CLAs.On 6 Oct 2023, at 09:28, Mick Semb Wever wrote:On Thu, 5 Oct 2023 at 17:50, Jeremiah Jordan wrote:
I think this is covered by the grant agreement?https://www.apache.org/licenses/software-grant-template.pdf2. Licensor represents that, to Licensor's knowledge, Licensor is
legally entitled to grant the above license. Licensor agrees to notify
the Foundation of any facts or circumstances of which Licensor becomes
aware and which makes or would make Licensor's representations in this
License Agreement inaccurate in any respect.

On Oct 5, 2023 at 4:35:08 AM, Benedict wrote:

Surely it needs to be shared with the foundation and the PMC so we can verify? Or at least have ASF legal confirm they have received and are satisfied with the tarball? It certainly can’t be kept private to DS, AFAICT.Of course it shouldn’t be shared publicly but not sure how PMC can fulfil its verification function here without itCorrect, thanks JD.These are CLAs that were submitted to DS, not to ASF.It is DS's legal responsibility to ensure what they are donating they have the right to (i.e. have the copyright), when submitting the SGA. It's not on the ASF or the PMC to verify this. Here we're simply demonstrating that we (DS) have done that due diligence, and are keeping record of it.

Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread J. D. Jordan

This Gen AI generated code use thread should probably be its own mailing list DISCUSS thread?  It applies to all source code we take in, and accept copyright assignment of, not to jars we depend on and not only to vector related code contributions.On Sep 22, 2023, at 7:29 AM, Josh McKenzie  wrote:So if we're going to chat about GenAI on this thread here, 2 things:A dependency we pull in != a code contribution (I am not a lawyer but my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyright and it's not sticky). Easier to transition to a different dep if there's something API compatible or similar.With code contributions we take in, we take on some exposure in terms of copyright and infringement. git revert can be painful.For this thread, here's an excerpt from the ASF policy:a recommended practice when using generative AI tooling is to use tools with features that identify any included content that is similar to parts of the tool’s training data, as well as the license of that content.Given the above, code generated in whole or in part using AI can be contributed if the contributor ensures that:The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition (e.g., ChatGPT’s terms are inconsistent).At least one of the following conditions is met:The output is not copyrightable subject matter (and would not be even if produced by a human)No third party materials are included in the outputAny third party materials that are included in the output are being used with permission (e.g., under a compatible open source license) of the third party copyright holders and in compliance with the applicable license termsA contributor obtain reasonable certainty that conditions 2.2 or 2.3 are met if the AI tool itself provides sufficient information about materials that may have been copied, or from code scanning resultsE.g. AWS CodeWhisperer recently added a feature that provides notice and attributionWhen providing contributions authored using generative AI tooling, a recommended practice is for contributors to indicate the tooling used to create the contribution. This should be included as a token in the source control commit message, for example including the phrase “Generated-byI think the real challenge right now is ensuring that the output from an LLM doesn't include a string of tokens that's identical to something in its input training dataset if it's trained on non-permissively licensed inputs. That plus the risk of, at least in the US, the courts landing on the side of saying that not only is the output of generative AI not copyrightable, but that there's legal liability on either the users of the tools or the creators of the models for some kind of copyright infringement. That can be sticky; if we take PR's that end up with that liability exposure, we end up in a place where either the foundation could be legally exposed and/or we'd need to revert some pretty invasive code / changes.For example, Microsoft and OpenAI have publicly committed to paying legal fees for people sued for copyright infringement for using their tools: https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view. Pretty interesting, and not a step a provider would take in an environment where things were legally clear and settled.So while the usage of these things is apparently incredibly pervasive right now, "everybody is doing it" is a pretty high risk legal defense. :)On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote:On Thu, 21 Sept 2023 at 10:41, Benedict  wrote:At some point we have to discuss this, and here’s as good a place as any. There’s a great news article published talking about how generative AI was used to assist in developing the new vector search feature, which is itself really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal policy on use for contributions to the project. This proposal is to include a dependency, but I’m not sure if that avoids the issue, and I’m equally uncertain how much this issue is isolated to the dependency (or affects it at all?)Anyway, this is an annoying discussion we need to have at some point, so raising it here now so we can figure it out.[1] https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/[2] https://www.apache.org/legal/generative-tooling.htmlMy reading of the ASF's GenAI policy is that any generated work in the jvector library (and cep-30 ?) are not copyrightable, and that makes them ok for us to include.If there was a trace to copyrighted work, or the tooling imposed a copyright or restrictions, we would then have to take considerations.

Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-21 Thread J. D. Jordan

Mick,I am confused by your +1 here. You are +1 on including it, but only if the copyright were different?  Given DataStax wrote the library I don’t see how that will change?On Sep 21, 2023, at 3:05 AM, Mick Semb Wever  wrote:On Wed, 20 Sept 2023 at 18:31, Mike Adamson  wrote:The original patch for CEP-30 brought several modified Lucene classes in-tree to implement the concurrent HNSW graph used by the vector index.These classes are now being replaced with the io.github.jbellis.jvector library, which contains an improved diskANN implementation for the on-disk graph format. The repo for this library is here: https://github.com/jbellis/jvector.The library does not replace any code used by SAI or other parts of the codebase and is used solely by the vector index.I would welcome any feedback on this change. +1but to nit-pick on legalities… it would be nice to avoid including a library copyrighted to DataStax (for historical reasons).The Jamm library is in a similar state in that it has a license that refers to the copyright owner but does not state the copyright owner anywhere.Can we get a copyright on Jamm, and can both not be Datastax (pls) ?

Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-20 Thread J. D. Jordan

+1 for jvector rather than forked lucene classes. On Sep 20, 2023, at 5:14 PM, German Eichberger via dev  wrote:






+1




I am biased because DiskANN is from Microsoft Research but it's  a good library/algorithm


From: Mike Adamson 
Sent: Wednesday, September 20, 2023 8:58 AM
To: dev 
Subject: [EXTERNAL] [DISCUSS] Add JVector as a dependency for CEP-30
 








You don't often get email from madam...@datastax.com. 
Learn why this is important







The original patch for CEP-30 brought several modified Lucene classes in-tree to implement the concurrent HNSW graph used by the vector index.


These classes are now being replaced with the io.github.jbellis.jvector library, which contains an improved diskANN implementation for the on-disk graph format. 


The repo for this library is here: https://github.com/jbellis/jvector.


The library does not replace any code used by SAI or other parts of the codebase and is used solely by the vector index.


I would welcome any feedback on this change.
-- 






Mike Adamson


Engineering




+1 650 389 6000 | datastax.com






Find DataStax Online:

Re: [DISCUSS] Vector type and empty value

2023-09-19 Thread J. D. Jordan

When does empty mean null?  My understanding was that empty is a valid value 
for the types that support it, separate from null (aka a tombstone). Do we have 
types where writing an empty value creates a tombstone?

I agree with David that my preference would be for only blob and string like 
types to support empty. It’s too late for the existing types, but we should 
hold to this going forward. Which is what I think the idea was in 
https://issues.apache.org/jira/browse/CASSANDRA-8951 as well?  That it was sad 
the existing numerics were emptiable, but too late to change, and we could 
correct it for newer types.

> On Sep 19, 2023, at 12:12 PM, David Capwell  wrote:
> 
> 
>> 
>> When we introduced TINYINT and SMALLINT (CASSANDRA-8951) we started making 
>> types non -emptiable. This approach makes more sense to me as having to deal 
>> with empty value is error prone in my opinion.
> 
> I agree it’s confusing, and in the patch I found that different code paths 
> didn’t handle things correctly as we have some times (most) that support 
> empty bytes, and some that do not…. Empty also has different meaning in 
> different code paths; for most it means “null”, and for some other types it 
> means “empty”…. To try to make things more clear I added 
> org.apache.cassandra.db.marshal.AbstractType#isNull(V, 
> org.apache.cassandra.db.marshal.ValueAccessor) to the type system so each 
> type can define if empty is null or not.
> 
>> I also think that it would be good to standardize on one approach to avoid 
>> confusion.
> 
> I agree, but also don’t feel it’s a perfect one-size-fits-all thing…. Let’s 
> say I have a “blob” type and I write an empty byte… what does this mean?  
> What does it mean for "text" type?  The fact I get back a null in both those 
> cases was very confusing to me… I do feel that some types should support 
> empty, and the common code of empty == null I think is very brittle 
> (blob/text was not correct in different places due to this...)… so I am cool 
> with removing that relationship, but don’t think we should have a rule 
> blocking empty for all current / future types as it some times does make 
> sense.
> 
>> empty vector (I presume) for the vector type?
> 
> Empty vectors (vector[0]) are blocked at the type level, the smallest vector 
> is vector[1]
> 
>> as types that can never be null
> 
> One pro here is that “null” is cheaper (in some regards) than delete (though 
> we can never purge), but having 2 similar behaviors (write null, do a delete) 
> at the type level is a bit confusing… Right now I am allowed to do the 
> following (the below isn’t valid CQL, its a hybrid of CQL + Java code…)
> 
> CREATE TABLE fluffykittens (pk int primary key, cuteness int);
> INSERT INTO fluffykittens (pk, cuteness) VALUES (0, new byte[0])
> 
> CREATE TABLE typesarehard (pk1 int, pk2 int, cuteness int, PRIMARY KEY ((pk1, 
> pk2));
> INSERT INTO typesarehard (pk1, pk2, cuteness) VALUES (new byte[0], new 
> byte[0], new byte[0]) — valid as the partition key is not empty as its a 
> composite of 2 empty values, this is the same as new byte[2]
> 
> The first time I ever found out that empty bytes was valid was when a user 
> was trying to abuse this in collections (also the fact collections support 
> null in some cases and not others is fun…)…. It was blowing up in random 
> places… good times!
> 
> I am personally not in favor of allowing empty bytes (other than for blob / 
> text as that is actually valid for the domain), but having similar types 
> having different semantics I feel is more problematic...
> 
>>> On Sep 19, 2023, at 8:56 AM, Josh McKenzie  wrote:
>>> 
>>> I am strongly in favour of permitting the table definition forbidding nulls 
>>> - and perhaps even defaulting to this behaviour. But I don’t think we 
>>> should have types that are inherently incapable of being null.
>> I'm with Benedict. Seems like this could help prevent whatever "nulls in 
>> primary key columns" problems Aleksey was alluding to on those tickets back 
>> in the day that pushed us towards making the new types non-emptiable as well 
>> (i.e. primary keys are non-null in table definition).
>> 
>> Furthering Alex' question, having a default value for unset fields in any 
>> non-collection context seems... quite surprising to me in a database. I 
>> could see the argument for making container / collection types non-nullable, 
>> maybe, but that just keeps us in a potential straddle case (some types 
>> nullable, some not).
>> 
>>> On Tue, Sep 19, 2023, at 8:22 AM, Benedict wrote:
>>> 
>>> If I understand this suggestion correctly it is a whole can of worms, as 
>>> types that can never be null prevent us ever supporting outer joins that 
>>> return these types.
>>> 
>>> I am strongly in favour of permitting the table definition forbidding nulls 
>>> - and perhaps even defaulting to this behaviour. But I don’t think we 
>>> should have types that are inherently incapable of being null. I also 
>>> certainly

Re: [DISCUSS] Addition of smile-nlp test dependency for CEP-30

2023-09-13 Thread J. D. Jordan

Reading through smile license again, it is licensed pure GPL 3, not GPL with classpath exception. So I think that kills all debate here.-1 on inclusion On Sep 13, 2023, at 2:30 PM, Jeremiah Jordan  wrote:
I wonder if it can easily be replaced with Apache open-nlp?  It also provides an implementation of GloVe.https://opennlp.apache.org/docs/2.3.0/apidocs/opennlp-tools/opennlp/tools/util/wordvector/Glove.html


On Sep 13, 2023 at 1:17:46 PM, Benedict  wrote:

There’s a distinction for spotbugs and other build related tools where they can be downloaded and used during the build so long as they’re not critical to the build process.They have to be downloaded dynamically in binary form I believe though, they cannot be included in the release.So it’s not really in conflict with what Jeff is saying, and my recollection accords with Jeff’sOn 13 Sep 2023, at 17:42, Brandon Williams  wrote:On Wed, Sep 13, 2023 at 11:37 AM Jeff Jirsa  wrote:You can open a legal JIRA to confirm, but based on my understanding (and re-confirming reading https://www.apache.org/legal/resolved.html#category-a ): We should probably get clarification here regardless, iirc this came up when we were considering SpotBugs too.

Re: [VOTE] Release Apache Cassandra 5.0-alpha1

2023-08-30 Thread J. D. Jordan

These are not compiled code. They are serialized dumps of bloom filter data.

> On Aug 28, 2023, at 9:58 PM, Justin Mclean  wrote:
> 
> 1../test/data/serialization/3.0/utils.BloomFilter1000.bin
> 2. ./test/data/serialization/4.0/utils.BloomFilter1000.bin

Re: [VOTE] Release Apache Cassandra 5.0-alpha1

2023-08-25 Thread J. D. Jordan

+1 nb.I think it’s good to get an alpha out there for people to starting trying out the features which are done.On Aug 25, 2023, at 4:03 PM, Mick Semb Wever  wrote:There was lazy consensus on this thread: https://lists.apache.org/thread/mzj3dq8b7mzf60k6mkby88b9n9ywmsgw  and also the announcement about the staged release was out for a longer period of time to raise objections.Not that it's too late to object, just that it's been raised for attention twice now with no objections  :-) For me, I'm keen to get it out simply because there's so many features coming in 5.0 it helps bring a bit more awareness to each.  There's also a bit of a wait for TCM (from what I've heard off-thread) and we want all the testing we can get and it takes time to get people to come around to doing it.On Fri, 25 Aug 2023 at 10:36, German Eichberger via dev  wrote:






I concur. Those are major features...


From: C. Scott Andreas 
Sent: Friday, August 25, 2023 9:06 AM
To: dev@cassandra.apache.org 
Subject: [EXTERNAL] Re: [VOTE] Release Apache Cassandra 5.0-alpha1
 








You don't often get email from sc...@paradoxica.net. 
Learn why this is important








A snapshot artifact seems more appropriate for early testing to me, rather than a voted / released build issued by the project given how much has yet to land.



- Scott

On Aug 25, 2023, at 8:46 AM, Ekaterina Dimitrova  wrote:





+1


On Fri, 25 Aug 2023 at 11:14, Mick Semb Wever  wrote:




Proposing the test build of Cassandra 5.0-alpha1 for release.


DISCLAIMER, this alpha release does not contain the expected 5.0
features: Vector Search (CEP-30), Transactional Cluster Metadata
(CEP-21) and Accord Transactions (CEP-15).  These features will land
in a later alpha release.


Please also note that this is an alpha release and what that means, further info at https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle 

sha1: 62cb03cc7311384db6619a102d1da6a024653fa6
Git:

https://github.com/apache/cassandra/tree/5.0-alpha1-tentative
Maven Artifacts:

https://repository.apache.org/content/repositories/orgapachecassandra-1314/org/apache/cassandra/cassandra-all/5.0-alpha1/

The Source and Build Artifacts, and the Debian and RPM packages and repositories, are available here:

https://dist.apache.org/repos/dist/dev/cassandra/5.0-alpha1/

The vote will be open for 72 hours (longer if needed). Everyone who has tested the build is invited to vote. Votes by PMC members are considered binding. A vote passes if
 there are at least three binding +1s and no -1's.

[1]: CHANGES.txt:

https://github.com/apache/cassandra/blob/5.0-alpha1-tentative/CHANGES.txt
[2]: NEWS.txt:

https://github.com/apache/cassandra/blob/5.0-alpha1-tentative/NEWS.txt

Re: Tokenization and SAI query syntax

2023-08-07 Thread J. D. Jordan

I am also -1 on directly exposing lucene like syntax here. Besides being ugly, SAI is not lucene, I do not think we should start using lucene syntax for it, it will make people think they can do everything else lucene allows.On Aug 7, 2023, at 5:13 AM, Benedict  wrote:I’m strongly opposed to : It is very dissimilar to our current operators. CQL is already not the prettiest language, but let’s not make it a total mish mash.On 7 Aug 2023, at 10:59, Mike Adamson  wrote:I am also in agreement with 'column : token' in that 'I don't hate it' but I'd like to offer an alternative to this in 'column HAS token'. HAS is currently not a keyword that we use so wouldn't cause any brain conflicts.While I don't hate ':' I have a particular dislike of the lucene search syntax because of its terseness and lack of easy readability. Saying that, I'm happy to do with ':' if that is the decision. On Fri, 4 Aug 2023 at 00:23, Jon Haddad <rustyrazorbl...@apache.org> wrote:Assuming SAI is a superset of SASI, and we were to set up something so that SASI indexes auto convert to SAI, this gives even more weight to my point regarding how differing behavior for the same syntax can lead to issues.  Imo the best case scenario results in the user not even noticing their indexes have changed.

An (maybe better?) alternative is to add a flag to the index configuration for "compatibility mod", which might address the concerns around using an equality operator when it actually is a partial match.

For what it's worth, I'm in agreement that = should mean full equality and not token match.

On 2023/08/03 03:56:23 Caleb Rackliffe wrote:
> For what it's worth, I'd very much like to completely remove SASI from the
> codebase for 6.0. The only remaining functionality gaps at the moment are
> LIKE (prefix/suffix) queries and its limited tokenization
> capabilities, both of which already have SAI Phase 2 Jiras.
> 
> On Wed, Aug 2, 2023 at 7:20 PM Jeremiah Jordan <jerem...@datastax.com>
> wrote:
> 
> > SASI just uses “=“ for the tokenized equality matching, which is the exact
> > thing this discussion is about changing/not liking.
> >
> > > On Aug 2, 2023, at 7:18 PM, J. D. Jordan <jeremiah.jor...@gmail.com>
> > wrote:
> > >
> > > I do not think LIKE actually applies here. LIKE is used for prefix,
> > contains, or suffix searches in SASI depending on the index type.
> > >
> > > This is about exact matching of tokens.
> > >
> > >> On Aug 2, 2023, at 5:53 PM, Jon Haddad <rustyrazorbl...@apache.org>
> > wrote:
> > >>
> > >> Certain bits of functionality also already exist on the SASI side of
> > things, but I'm not sure how much overlap there is.  Currently, there's a
> > LIKE keyword that handles token matching, although it seems to have some
> > differences from the feature set in SAI.
> > >>
> > >> That said, there seems to be enough of an overlap that it would make
> > sense to consider using LIKE in the same manner, doesn't it?  I think it
> > would be a little odd if we have different syntax for different indexes.
> > >>
> > >> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
> > >>
> > >> I think one complication here is that there seems to be a desire, that
> > I very much agree with, to expose as much of the underlying flexibility of
> > Lucene as much as possible.  If it means we use Caleb's suggestion, I'd ask
> > that the queries that SASI and SAI both support use the same syntax, even
> > if it means there's two ways of writing the same query.  To use Caleb's
> > example, this would mean supporting both LIKE and the `expr` column.
> > >>
> > >> Jon
> > >>
> > >>>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote:
> > >>> Here are some additional bits of prior art, if anyone finds them
> > useful:
> > >>>
> > >>>
> > >>> The Stratio Lucene Index -
> > >>> https://github.com/Stratio/cassandra-lucene-index#examples
> > >>>
> > >>> Stratio was the reason C* added the "expr" functionality. They embedded
> > >>> something similar to ElasticSearch JSON, which probably isn't my
> > favorite
> > >>> choice, but it's there.
> > >>>
> > >>>
> > >>> The ElasticSearch match query syntax -
> > >>>
> > https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$
> > >>>

Re: Tokenization and SAI query syntax

2023-08-02 Thread J. D. Jordan

I do not think LIKE actually applies here. LIKE is used for prefix, contains, 
or suffix searches in SASI depending on the index type.

This is about exact matching of tokens.

> On Aug 2, 2023, at 5:53 PM, Jon Haddad  wrote:
> 
> Certain bits of functionality also already exist on the SASI side of things, 
> but I'm not sure how much overlap there is.  Currently, there's a LIKE 
> keyword that handles token matching, although it seems to have some 
> differences from the feature set in SAI.  
> 
> That said, there seems to be enough of an overlap that it would make sense to 
> consider using LIKE in the same manner, doesn't it?  I think it would be a 
> little odd if we have different syntax for different indexes.  
> 
> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
> 
> I think one complication here is that there seems to be a desire, that I very 
> much agree with, to expose as much of the underlying flexibility of Lucene as 
> much as possible.  If it means we use Caleb's suggestion, I'd ask that the 
> queries that SASI and SAI both support use the same syntax, even if it means 
> there's two ways of writing the same query.  To use Caleb's example, this 
> would mean supporting both LIKE and the `expr` column.  
> 
> Jon
> 
>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote:
>> Here are some additional bits of prior art, if anyone finds them useful:
>> 
>> 
>> The Stratio Lucene Index -
>> https://github.com/Stratio/cassandra-lucene-index#examples
>> 
>> Stratio was the reason C* added the "expr" functionality. They embedded
>> something similar to ElasticSearch JSON, which probably isn't my favorite
>> choice, but it's there.
>> 
>> 
>> The ElasticSearch match query syntax -
>> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$
>>  
>> 
>> Again, not my favorite. It's verbose, and probably too powerful for us.
>> 
>> 
>> ElasticSearch's documentation for the basic Lucene query syntax -
>> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$
>>  
>> 
>> One idea is to take the basic Lucene index, which it seems we already have
>> some support for, and feed it to "expr". This is nice for two reasons:
>> 
>> 1.) People can just write Lucene queries if they already know how.
>> 2.) No changes to the grammar.
>> 
>> Lucene has distinct concepts of filtering and querying, and this is kind of
>> the latter. I'm not sure how, for example, we would want "expr" to interact
>> w/ filters on other column indexes in vanilla CQL space...
>> 
>> 
>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie  wrote:
>>> 
>>> `column CONTAINS term`. Contains is used by both Java and Python for
>>> substring searches, so at least some users will be surprised by term-based
>>> behavior.
>>> 
>>> I wonder whether users are in their "programming language" headspace or in
>>> their "querying a database" headspace when interacting with CQL? i.e. this
>>> would only present confusion if we expected users to be thinking in the
>>> idioms of their respective programming languages. If they're thinking in
>>> terms of SQL, MATCHES would probably end up confusing them a bit since it
>>> doesn't match the general structure of the MATCH operator.
>>> 
>>> That said, I also think CONTAINS loses something important that you allude
>>> to here Jonathan:
>>> 
>>> with corresponding query-time tokenization and analysis.  This means that
>>> the query term is not always a substring of the original string!  Besides
>>> obvious transformations like lowercasing, you have things like
>>> PhoneticFilter available as well.
>>> 
>>> So to me, neither MATCHES nor CONTAINS are particularly great candidates.
>>> 
>>> So +1 to the "I don't actually hate it" sentiment on:
>>> 
>>> column : term`. Inspired by Lucene’s syntax
>>> 
>>> 
 On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote:
>>> 
>>> 
>>> I have a strong preference not to use the name of an SQL operator, since
>>> it precludes us later providing the SQL standard operator to users.
>>> 
>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term?
>>> 
>>> 
 On 24 Jul 2023, at 13:34, Andrés de la Peña  wrote:
>>> 
>>> 
>>> `column = term` is definitively problematic because it creates an
>>> ambiguity when the queried column belongs to the primary key. For some
>>> queries we wouldn't know whether the user wants a primary key query using
>>> regular equality or an index query using the analyzer.
>>> 
>>> `term_matches(column, term)` seems quite clear and hard to misinterpret,
>>> but it's quite long to write and its implementation will be challenging
>>> since we would need a bunch of special casing around

Re: August 5.0 Freeze (with waivers…) and a 5.0-alpha1

2023-07-26 Thread J. D. Jordan

I think this plan seems reasonable to me. +1

-Jeremiah

> On Jul 26, 2023, at 5:28 PM, Mick Semb Wever  wrote:
> 
> 
> 
> The previous thread¹ on when to freeze 5.0 landed on freezing the first week 
> of August, with a waiver in place for TCM and Accord to land later (but 
> before October).
> 
> With JDK8 now dropped and SAI and UCS merged, the only expected 5.0 work that 
> hasn't landed is Vector search (CEP-30).  
> 
> Are there any objections to a waiver on Vector search?  All the groundwork: 
> SAI and the vector type; has been merged, with all remaining work expected to 
> land in August.
> 
> I'm keen to freeze and see us shift gears – there's already SO MUCH in 5.0 
> and a long list of flakies.  It takes time and patience to triage and 
> identify the bugs that hit us before GA.  The freeze is about being "mostly 
> feature complete",  so we have room for things before our first beta 
> (precedence is to ask).   If we hope for a GA by December, account for the 6 
> weeks turnaround time for cutting and voting on one alpha, one beta, and one 
> rc release, and the quiet period that August is, we really only have 
> September and October left.  
> 
> I already feel this is asking a bit of a miracle from us given how 4.1 went 
> (and I'm hoping I will be proven wrong). 
> 
> In addition, are there any objections to cutting an 5.0-alpha1 release as 
> soon as we freeze?  
> 
> This is on the understanding vector, tcm and accord will become available in 
> later alphas.  Originally the discussion¹ was waiting for Accord for alpha1, 
> but a number of folk off-list have requested earlier alphas to help with 
> testing.
> 
> 
> ¹) https://lists.apache.org/thread/9c5cnn57c7oqw8wzo3zs0dkrm4f17lm3

Re: Status Update on CEP-7 Storage Attached Indexes (SAI)

2023-07-26 Thread J. D. Jordan

Thanks for all the work here!On Jul 26, 2023, at 1:57 PM, Caleb Rackliffe  wrote:Alright, the cep-7-sai branch is now merged to trunk!Now we move to addressing the most urgent items from "Phase 2" (CASSANDRA-18473) before (and in the case of some testing after) the 5.0 freeze...On Wed, Jul 26, 2023 at 6:07 AM Jeremy Hanna  wrote:Thanks Caleb and Mike and Zhao and Andres and Piotr and everyone else involved with the SAI implementation!On Jul 25, 2023, at 3:01 PM, Caleb Rackliffe  wrote:Just a quick update...With CASSANDRA-18670 complete, and all remaining items in the category of performance optimizations and further testing, the process of merging to trunk will likely start today, beginning with a final rebase on the current trunk and J11 and J17 test runs.On Tue, Jul 18, 2023 at 3:47 PM Caleb Rackliffe  wrote:Hello there!After much toil, the first phase of CEP-7 is nearing completion (see CASSANDRA-16052). There are presently two issues to resolve before we'd like to merge the cep-7-sai feature branch and all its goodness to trunk:CASSANDRA-18670 - Importer should build SSTable indexes successfully before making new SSTables readable (in review)CASSANDRA-18673 - Reduce size of per-SSTable index components (in progress)(We've been getting clean CircleCI runs for a while now, and have been using the multiplexer to sniff out as much flakiness as possible up front.)Once merged to trunk, the next steps are:1.) Finish a Harry model that we can use to further fuzz test SAI before 5.0 releases (see CASSANDRA-18275). We've done a fair amount of fuzz/randomized testing at the component level, but I'd still consider Harry (at least around single-partition query use-cases) a critical item for us to have confidence before release.2.) Start pursuing Phase 2 items as time and our needs allow. (see CASSANDRA-18473)A reminder, SAI is a secondary index, and therefore is by definition an opt-in feature, and has no explicit "feature flag". However, its availability to users is still subject to the secondary_indexes_enabled guardrail, which currently defaults to allowing creation.Any thoughts, questions, or comments on the pre-merge plan here?

Re: [DISCUSS] Using ACCP or tc-native by default

2023-07-26 Thread J. D. Jordan

Enabling ssl for the upgrade dtests would cover this use case. If those don’t 
currently exist I see no reason it won’t work so I would be fine for someone to 
figure it out post merge if there is a concern.  What JCE provider you use 
should have no upgrade concerns.

-Jeremiah

> On Jul 26, 2023, at 1:07 PM, Miklosovic, Stefan 
>  wrote:
> 
> Am I understanding it correctly that tests you are talking about are only 
> required in case we make ACCP to be default provider?
> 
> I can live with not making it default and still deliver it if tests are not 
> required. I do not think that these kind of tests were required couple mails 
> ago when opt-in was on the table.
> 
> While I tend to agree with people here who seem to consider testing this 
> scenario to be unnecessary exercise, I am afraid that I will not be able to 
> deliver that as testing something like this is quite complicated matter. 
> There is a lot of aspects which could be tested I can not even enumerate 
> right now ... so I try to meet you somewhere in the middle.
> 
> 
> From: Mick Semb Wever 
> Sent: Wednesday, July 26, 2023 17:34
> To: dev@cassandra.apache.org
> Subject: Re: [DISCUSS] Using ACCP or tc-native by default
> 
> NetApp Security WARNING: This is an external email. Do not click links or 
> open attachments unless you recognize the sender and know the content is safe.
> 
> 
> 
> 
> 
> Can you say more about the shape of your concern?
> 
> 
> Integration testing where some nodes are running JCE and others accp, and 
> various configurations that are and are not accp compatible/native.
> 
> I'm not referring to (re-) unit testing accp or jce themselves, or matrix 
> testing over them, but our commitment to always-on upgrades against all 
> possible configurations that integrate.  We've history with config changes 
> breaking upgrades, for as simple as they are.

Re: [DISCUSS] Using ACCP or tc-native by default

2023-07-26 Thread J. D. Jordan

I thought the crypto providers were supposed to “ask the next one down the 
line” if something is not supported?  Have you tried some unsupported thing and 
seen it break?  My understanding of the providers being an ordered list was 
that isn’t supposed to happen.

-Jeremiah

> On Jul 26, 2023, at 3:23 AM, Mick Semb Wever  wrote:
> 
> 
> 
> 
>  
> 
>> That means that if somebody is on 4.0 and they upgrade to 5.0, if they use 
>> some ciphers / protocols / algorithms which are not in Corretto, it might 
>> break their upgrade.
> 
> 
> 
> If there's any risk of breaking upgrades we have to go with (2).  We support 
> a variation of JCE configurations, and I don't see we have the test coverage 
> in place to de-risk it other than going with (2).  
> 
> Once the yaml configuration is in place we can then change the default in the 
> next major version 6.0.
> 
>

Re: [DISCUSS] Using ACCP or tc-native by default

2023-07-20 Thread J. D. Jordan

Maybe we could start providing Dockerfile’s and/or make arch specific rpm/deb 
packages that have everything setup correctly per architecture?
We could also download them all and have the startup scripts put stuff in the 
right places depending on the arch of the machine running them?
I feel like there are probably multiple ways we could solve this without 
requiring users to jump through a bunch of hoops?
But I do agree we can’t make the project x86 only.

-Jeremiah

> On Jul 20, 2023, at 2:01 AM, Miklosovic, Stefan 
>  wrote:
> 
> Hi,
> 
> as I was reviewing the patch for this feature (1), we realized that it is not 
> quite easy to bundle this directly into Cassandra.
> 
> The problem is that this was supposed to be introduced as a new dependency:
> 
> 
>software.amazon.cryptools
>AmazonCorrettoCryptoProvider
>2.2.0
>linux-x86_64
> 
> 
> Notice "classifier". That means that if we introduced this dependency into 
> the project, what about ARM users? (there is corresponding aarch classifier 
> as well). ACCP is platform-specific but we have to ship Cassandra 
> platform-agnostic. It just needs to run OOTB everywhere. If we shipped that 
> with x86 and a user runs Cassandra on ARM, I guess that would break things, 
> right?
> 
> We also can not just add both dependencies (both x86 and aarch) because how 
> would we differentiate between them in runtime? That all is just too tricky / 
> error prone.
> 
> So, the approach we want to take is this:
> 
> 1) nothing will be bundled in Cassandra by default
> 2) a user is supposed to download the library and put it to the class path
> 3) a user is supposed to put the implementation of ICryptoProvider interface 
> Cassandra exposes to the class path
> 3) a user is supposed to configure cassandra.yaml and its section 
> "crypto_provider" to reference the implementation he wants
> 
> That way, we avoid the situation when somebody runs x86 lib on ARM or vice 
> versa.
> 
> By default, NoOpProvider will be used, that means that the default crypto 
> provider from JRE will be used.
> 
> It can seem like we have not done too much progress here but hey ... we 
> opened the project to the custom implementations of crypto providers a 
> community can create. E.g. as 3rd party extensions etc ...
> 
> I want to be sure that everybody is aware of this change (that we plan to do 
> that in such a way that it will not be "bundled") and that everybody is on 
> board with this. Otherwise I am all ears about how to do that differently.
> 
> (1) https://issues.apache.org/jira/browse/CASSANDRA-18624
> 
> 
> From: German Eichberger via dev 
> Sent: Friday, June 23, 2023 22:43
> To: dev
> Subject: Re: [DISCUSS] Using ACCP or tc-native by default
> 
> NetApp Security WARNING: This is an external email. Do not click links or 
> open attachments unless you recognize the sender and know the content is safe.
> 
> 
> 
> +1 to ACCP - we love performance.
> 
> From: David Capwell 
> Sent: Thursday, June 22, 2023 4:21 PM
> To: dev 
> Subject: [EXTERNAL] Re: [DISCUSS] Using ACCP or tc-native by default
> 
> +1 to ACCP
> 
> On Jun 22, 2023, at 3:05 PM, C. Scott Andreas  wrote:
> 
> +1 for ACCP and can attest to its results. ACCP also optimizes for a range of 
> hash functions and other cryptographic primitives beyond TLS acceleration for 
> Netty.
> 
> On Jun 22, 2023, at 2:07 PM, Jeff Jirsa  wrote:
> 
> 
> Either would be better than today.
> 
> On Thu, Jun 22, 2023 at 1:57 PM Jordan West 
> mailto:jw...@apache.org>> wrote:
> Hi,
> 
> I’m wondering if there is appetite to change the default SSL provider for 
> Cassandra going forward to either ACCP [1] or tc-native in Netty? Our 
> deployment as well as others I’m aware of make this change in their fork and 
> it can lead to significant performance improvement. When recently qualifying 
> 4.1 without using ACCP (by accident) we noticed p99 latencies were 2x higher 
> than 3.0 w/ ACCP. Wiring up ACCP can be a bit of a pain and also requires 
> some amount of customization. I think it could be great for the wider 
> community to adopt it.
> 
> The biggest hurdle I foresee is licensing but ACCP is Apache 2.0 licensed. 
> Anything else I am missing before opening a JIRA and submitting a patch?
> 
> Jordan
> 
> 
> [1]
>

Re: [VOTE] CEP-30 ANN Vector Search

2023-05-25 Thread J. D. Jordan

+1 nbOn May 25, 2023, at 7:47 PM, Jasonstack Zhao Yang  wrote:+1On Fri, 26 May 2023 at 8:44 AM, Yifan Cai  wrote:






+1






From: Josh McKenzie 
Sent: Thursday, May 25, 2023 5:37:02 PM
To: dev 
Subject: Re: [VOTE] CEP-30 ANN Vector Search
 



+1


On Thu, May 25, 2023, at 8:33 PM, Jake Luciani wrote:


+1





On Thu, May 25, 2023 at 11:45 AM Jonathan Ellis  wrote:



Let's make this official.




CEP: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes




POC that demonstrates all the big rocks, including distributed queries: 
https://github.com/datastax/cassandra/tree/cep-vsearch





--




Jonathan Ellis

co-founder, http://www.datastax.com

@spyced








--

http://twitter.com/tjake

[DISCUSS] Feature branch version hygiene

2023-05-16 Thread J. D. Jordan

Process question/discussion. Should tickets that are merged to CEP feature 
branches, like  https://issues.apache.org/jira/browse/CASSANDRA-18204, have a 
fixver of 5.0 on them After merging to the feature branch?

For the SAI CEP which is also using the feature branch method the "reviewed and 
merged to feature branch" tickets seem to be given a version of NA.

Not sure that's the best “waiting for cep to merge” version either?  But it 
seems better than putting 5.0 on them to me.

Why I’m not keen on 5.0 is because if we cut the release today those tickets 
would not be there.

What do other people think?  Is there a better version designation we can use?

On a different project I have in the past made a “version number” in JIRA for 
each long running feature branch. Tickets merged to the feature branch got the 
epic ticket number as their version, and then it got updated to the “real” 
version when the feature branch was merged to trunk.

-Jeremiah

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread J. D. Jordan

Yes. Plugging in a new type server side is very easy. Adding that type to every client is not.Cassandra already supports plugging in custom types through a jar.  What a given client does when encountering a custom type it doesn’t know about depends on the client.I was recently looking at this for DynamicCompositeType, which is a type shipped in C* but exposed through the custom type machinery.  Very few drivers implement support for it. I saw one driver just crash upon encountering it in the schema at startup. One crashed if you included such a column in a query. And one threw warnings and treated it as a binary blob.So as David said, the client side is per driver. Also I would recommend thinking about using the existing custom type stuff if possible so that we don’t have to roll the native protocol version to add new type enums and even though unknown customs types act different in each driver, they do mostly allow someone to plug-in an implementation for them.On May 1, 2023, at 5:12 PM, David Capwell  wrote:A data type plug-in is actually really easy today, I think? Sadly not, the client reads the class from our schema tables and has to have duplicate logic to serialize/deserialize results… types are easy to add if you are ok with client not understanding them (and will some clients fail due to every language having its own logic?)On May 1, 2023, at 2:26 PM, Benedict  wrote:A data type plug-in is actually really easy today, I think? But, developing further hooks should probably be thought through as they’re necessary. I think in this case it would be simpler to deliver a general purpose type, which is why I’m trying to propose types that would be acceptable.I also think we’re pretty close to agreement, really?But if not, let’s flesh out potential plug-in requirements.On 1 May 2023, at 21:58, Josh McKenzie  wrote:If we want to make an ML-specific data type, it should be in an ML plug-in.How can we encourage a healthier plug-in ecosystem? As far as I know it's been pretty anemic historically:cassandra: https://cassandra.apache.org/doc/latest/cassandra/plugins/index.htmlpostgres: https://www.postgresql.org/docs/current/contrib.htmlI'm really interested to hear if there's more in the ecosystem I'm not aware of or if there's been strides made in this regard; users in the ecosystem being able to write durable extensions to Cassandra that they can then distribute and gain momentum could potentially be a great incubator for new features or functionality in the ecosystem.If our support for extensions remains as bare as I believe it to be, I wouldn't recommend anyone go that route.On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:I have explained repeatedly why I am opposed to ML-specific data types. If we want to make an ML-specific data type, it should be in an ML plug-in. We should not pollute the general purpose language with hastily-considered features that target specific bandwagons - at best partially - no matter how exciting the bandwagon.I think a simple and easy case can be made for fixed length array types that do not seem to create random bits of cruft in the language that dangle by themselves should this play not pan out. This is an easy way for this effort to make progress without negatively impacting the language.That is, unless we want to start supporting totally random types for every use case at the top level language layer. I don’t think this is a good idea, personally, and I’m quite confident we would now be regretting this approach had it been taken for earlier bandwagons.Nor do I think anyone’s priors about how successful this effort will be should matter. As a matter of principle, we should simply never deliver a specialist functionality as a high level CQL language feature without at least baking it for several years as a plug-in.On 1 May 2023, at 21:03, Mick Semb Wever  wrote:Yes!  What you (David) and Benedict write beautifully supports `VECTOR FLOAT[n]` imho.You are definitely bringing up valid implementation details, and that can be dealt with during patch review. This thread is about the CQL API addition.  No matter which way the technical review goes with the implementation details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML idiomatic approach and the best long-term CQL API.  It's a win-win situation – no matter how you look at it imho it is the best solution api wise.  Unless the suggestion is that an ideal implementation can give us a better CQL API – but I don't see what that could be.   Maybe the suggestion is we deny the possibility of using the VECTOR keyword and bring us back to something like `NON-NULL FROZEN`.   This is odd to me because `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the patch's audience and their idioms.  I have no problems with introducing such an alias to meet the ML crowd.Another way I think of this is `VECTOR FLOAT[n]` is the porcelain ML cql api, `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the

Re: Adding vector search to SAI with heirarchical navigable small world graph index

2023-04-26 Thread J. D. Jordan

If we look to postgresql it allows defining arrays using FLOAT[N] or FLOAT 
ARRAY[N].

So that is an extra point for me to just using FLOAT[N].

From my quick search neither oracle* nor MySQL directly support arrays in 
columns.

* oracle supports declaring a custom type using VARRAY and then using that type 
for a column.
CREATE TYPE float_array AS VARRAY(100) OF FLOAT;

> On Apr 26, 2023, at 12:17 PM, David Capwell  wrote:
> 
> 
>> 
>> DENSE seems to just be an array? So very similar to a frozen list, but with 
>> a fixed size?
> 
> How I read the doc, DENSE = ARRAY, but knew that couldn’t be the case, so 
> when I read the code its fixed size array…. So the real syntax was “DENSE 
> FLOAT32[42]”
> 
> Not a fan of the type naming, and feel that a fixed size array could be 
> useful for other cases as well, so think we can improve here (personally 
> prefer float[42], text[42], etc… vector maybe closer to our 
> existing syntax but not a fan).
> 
>> I guess this is an excellent example to explore the minima of what 
>> constitutes a CEP
> 
> The ANN change itself feels like a CEP makes sense.  Are we going to depend 
> on Lucene’s HNSW or build our own?  How do we validate this for correctness?  
> What does correctness mean in a distributed context?  Is this going to be 
> pluggable (big push recently to offer plugability)?
> 
> 
>> On Apr 26, 2023, at 7:37 AM, Patrick McFadin  wrote:
>> 
>> I guess this is an excellent example to explore the minima of what 
>> constitutes a CEP. So far, CEPs have been some large changes, so where does 
>> something like this fit? (Wait. Did I beat Benedict to a Bike Shed? I think 
>> I did.)
>> 
>> This is a list of everything needed for a CEP:
>> 
>> Status
>> Scope
>> Goals
>> Approach
>> Timeline
>> Mailing list / Slack channels
>> Related JIRA tickets
>> Motivation
>> Audience
>> Proposed Changes
>> New or Changed Public Interfaces
>> Compatibility, Deprecation, and Migration Plan
>> Test Plan
>> Rejected Alternatives
>> 
>> This is a big enough change to provide information for each element. Going 
>> back to the spirit of why we started CEPs, we wanted to avoid a mega-commit 
>> without some shaping and agreement before code goes into trunk. I don't have 
>> a clear indication of where that line lies. From our own wiki: "It is highly 
>> recommended to pursue a CEP for significant user-facing or changes that cut 
>> across multiple subsystems." That seems to fit here. Part of my motivation 
>> is being clear with potential new contributors by example and encouraging 
>> more awesomeness.  
>> 
>> The changes for operators:
>> - New drivers
>> - New gaurdrails?
>> - Indexing == storage requirements
>> 
>> Patrick
>> 
>> On Tue, Apr 25, 2023 at 10:53 PM Mick Semb Wever  wrote:
>> I was soo happy when I saw this, I know many users are going to be 
>> thrilled about it.
>> 
>> 
>> On Wed, 26 Apr 2023 at 05:15, Patrick McFadin  wrote:
>> Not sure if this is what you are saying, Josh, but I believe this needs to 
>> be its own CEP. It's a change in CQL syntax and changes how clusters 
>> operate. The change needs to be documented and voted on. Jonathan, you know 
>> how to find me if you want me to help write it. :) 
>> 
>> I'd be fine with just a DISCUSS thread to agree to the CQL change, since it: 
>> `DENSE FLOAT32` appears to be a minimal,  and the overall patch building on 
>> SAI. As Henrik mentioned there's other SAI extensions being added too 
>> without CEPs.  Can you elaborate on how you see this changing how the 
>> cluster operates?
>> 
>> This will be easier to decide once we have a patch to look at, but that 
>> depends on a CEP-7 base (e.g. no feature branch exists). If we do want a CEP 
>> we need to allow a few weeks to get it through, but that can happen in 
>> parallel and maybe drafting up something now will be valuable anyway for an 
>> eventual CEP that proposes the more complete features (e.g. 
>> cosine_similarity(…)). 
>> 
>

Re: [DISCUSS] Next release date

2023-04-18 Thread J. D. Jordan

That said I’m not opposed to Mick’s proposal. In Apache terms I am -0 on the proposal. So no need to try and convince me. If others think it is the way forward let’s go with it.On Apr 18, 2023, at 1:48 PM, J. D. Jordan wrote:I also don’t really see the value in “freezing with exceptions for two giant changes to come after the freeze”.-JeremiahOn Apr 18, 2023, at 1:08 PM, Caleb Rackliffe wrote:> Caleb, you appear to be the only one objecting, and it does not appear that you have made any compromises in this thread.All I'm really objecting to is making special exceptions for particular CEPs in relation to our freeze date. In other words, let's not have a pseudo-freeze date and a "real" freeze date, when the thing that makes the latter supposedly necessary is a very invasive change to the database that risks our desired GA date. Also, again, I don't understand how cutting a 5.0 branch makes anything substantially easier to start testing. Perhaps I'm the only one who thinks this. If so, I'm not going to make further noise about it.On Tue, Apr 18, 2023 at 7:26 AM Henrik Ingo <henrik.i...@datastax.com> wrote:I forgot one last night:From Benjamin we have a question that I think went unanswered?> Should it not facilitate the work if the branch stops changing heavily?This is IMO a good perspective. To me it seems weird to be too hung up on a "hard limit" on a specific day, when we are talking about merges where a single merge / rebase takes more than one day. We will have to stop merging smaller work to trunk anyway, when CEP-21 is being merged. No?henrikOn Tue, Apr 18, 2023 at 3:24 AM Henrik Ingo <henrik.i...@datastax.com> wrote:Trying to collect a few loose ends from across this thread> I'm receptive to another definition of "stabilize", I think the stabilization period implies more than just CI, which is mostly a function of unit tests working correctly. For example, at Datastax we have run a "large scale" test with >100 nodes, over several weeks, both for 4.0 and 4.1. For obvious reasons such tests can't run in nightly CI builds.Also it is not unusual that during the testing phase developers or specialized QA engineers can develop new tests (which are possibly added to CI) to improve coverage for and especially targeting new features in the release. For example the fixes to Paxos v2 were found by such work before 4.1.Finally, maybe it's a special case relevant only for this release, but as a significant part of the Datastax team has been focused on porting these large existing features from DSE, and to get them merged before the original May date, we also have tens of bug fixes waiting to be upstreamed too. (It used to be an even 100, but I'm unsure what the count is today.)In fact! If you are worried about how to occupy yourself between a May "soft freeze" and September'ish hard freeze, you are welcome to chug on that backlog. The bug fixes are already public and ASL licensed, in the 4.0 based branch here.Failed with an unknown error.> 3a. If we allow merge of CEP-15 / CEP-21 after branch, we risk invalidating stabilization and risk our 2023 GA dateI think this is the assumption that I personally disagree with. If this is true, why do we even bother running any CI before the CEP-21 merge? It will all be invalidated anyway, right?In my experience, it is beneficial to test as early as possible, and at different checkpoints during development. If we wouldn't do it, and we find some issue in late November, then the window to search for the commit that introduced the regression is all the way back to the 4.1 GA. If on the other hand the same test was already rune during the soft freeze, then we can know that we may focus our search onto CEP-15 and CEP-21.> get comfortable with cutting feature previews or snapshot alphas like we agreed to for earlier access to new stuffSnapshots are in fact a valid compromise proposal: A snapshot would provide a constant version / point in time to focus testing on, but on the other hand would allow trunk (or the 5.0 branch, in other proposals) to remain open to new commits. Somewhat "invalidating" the testing work, but presumably the branch will be relatively calm anyway. Which leads me to 2 important questions:WHO would be actively merging things into 5.0 during June-August? By my count at that point I expect most contributors to either furiously work on Acccord and TCM, or work on stabilization (tests, fixes).Also, if someone did contribute new feature code during this time, they might find it hard to get priority for reviews, if others are focused on the above tasks.Finally, I expect most Europeans to be on vacation 33% of that time. Non-Europeans may want to try it too!WHAT do we expect to get merged during June-August?Compared to the tens of thousands of lines of code being merged by Accord, SAI, UCS and Tries... I imagine even the worst case during a non-freeze in June-August wo

Re: [DISCUSS] Next release date

2023-04-18 Thread J. D. Jordan

I also don’t really see the value in “freezing with exceptions for two giant changes to come after the freeze”.-JeremiahOn Apr 18, 2023, at 1:08 PM, Caleb Rackliffe  wrote:> Caleb, you appear to be the only one objecting, and it does not appear that you have made any compromises in this thread.All I'm really objecting to is making special exceptions for particular CEPs in relation to our freeze date. In other words, let's not have a pseudo-freeze date and a "real" freeze date, when the thing that makes the latter supposedly necessary is a very invasive change to the database that risks our desired GA date. Also, again, I don't understand how cutting a 5.0 branch makes anything substantially easier to start testing. Perhaps I'm the only one who thinks this. If so, I'm not going to make further noise about it.On Tue, Apr 18, 2023 at 7:26 AM Henrik Ingo  wrote:I forgot one last night:From Benjamin we have a question that I think went unanswered?> Should it not facilitate the work if the branch stops changing heavily?This is IMO a good perspective. To me it seems weird to be too hung up on a "hard limit" on a specific day, when we are talking about merges where a single merge / rebase takes more than one day. We will have to stop merging smaller work to trunk anyway, when CEP-21 is being merged. No?henrikOn Tue, Apr 18, 2023 at 3:24 AM Henrik Ingo  wrote:Trying to collect a few loose ends from across this thread> I'm receptive to another definition of "stabilize", I think the stabilization period implies more than just CI, which is mostly a function of unit tests working correctly. For example, at Datastax we have run a "large scale" test with >100 nodes, over several weeks, both for 4.0 and 4.1. For obvious reasons such tests can't run in nightly CI builds.Also it is not unusual that during the testing phase developers or specialized QA engineers can develop new tests (which are possibly added to CI) to improve coverage for and especially targeting new features in the release. For example the fixes to Paxos v2 were found by such work before 4.1.Finally, maybe it's a special case relevant only for  this release, but as a significant part of the Datastax team has been focused on porting these large existing features from DSE, and to get them merged before the original May date, we also have tens of bug fixes waiting to be upstreamed too. (It used to be an even 100, but I'm unsure what the count is today.)In fact! If you are worried about how to occupy yourself between a May "soft freeze" and September'ish hard freeze, you are welcome to chug on that backlog. The bug fixes are already public and ASL licensed, in the 4.0 based branch here.Failed with an unknown error.> 3a. If we allow merge of CEP-15 / CEP-21 after branch, we risk invalidating stabilization and risk our 2023 GA dateI think this is the assumption that I personally disagree with. If this is true, why do we even bother running any CI before the CEP-21 merge? It will all be invalidated anyway, right?In my experience, it is beneficial to test as early as possible, and at different checkpoints during development. If we wouldn't  do it, and we find some issue in late November, then the window to search for the commit that introduced the regression is all the way back to the 4.1 GA. If on the other hand the same test was already rune during the soft freeze, then we can know that we may focus our search onto CEP-15 and CEP-21.> get comfortable with cutting feature previews or snapshot alphas like we agreed to for earlier access to new stuffSnapshots are in fact a valid compromise proposal: A snapshot would provide a constant version / point in time to focus testing on, but on the other hand would allow trunk (or the 5.0 branch, in other proposals) to remain open to new commits. Somewhat "invalidating" the testing work, but presumably the branch will be relatively calm anyway. Which leads me to 2 important questions:WHO would be actively merging things into 5.0 during June-August? By my count at that point I expect most contributors to either furiously work on Acccord and TCM, or work on stabilization (tests, fixes).Also, if someone did contribute new feature code during this time, they might find it hard to get priority for reviews, if others are focused on the above tasks.Finally, I expect most Europeans to be on vacation 33% of that time. Non-Europeans may want to try it too!WHAT do we expect to get merged during June-August?Compared to the tens of thousands of lines of code being merged by Accord, SAI, UCS and Tries... I imagine even the worst case during a non-freeze in June-August would be just a tiny percentage of the large CEPs.In this thread I only see Paulo announcing an intent to commit against trunk during a soft freeze, and even he agrees with a 5.0 branch freeze.This last question is basically a form of saying I hope we aren't discussing a problem that doesn't even exist?henrik-- Henrik

Re: [DISCUSS] CEP-29 CQL NOT Operator

2023-04-13 Thread J. D. Jordan

The documentation is wrong. ALLOW FILTERING has always meant that “rows will need to be materialized in memory and accepted or rejected by a column filter” aka the full primary key was not specified and some other column was specified.  It has never been about multiple partitions.Basically “will the server need to read from disk more data (possibly a lot more) than will be returned to the client”.Should we change how that works? Maybe. But let move such discussions to a new thread and keep this one about the CEP proposal.On Apr 13, 2023, at 6:00 AM, Andrés de la Peña  wrote:Indeed requiring AF for "select * from ks.tb where p1 = 1 and c1 = 2 and col2 = 1", where p1 and c1 are all the columns in the primary key, sounds like a bug. I think the criterion in the code is that we require AF if there is any column restriction that cannot be processed by the primary key or a secondary index. The error message indeed seems to reject any kind of filtering, independently of primary key filters. We can see this even without defined clustering keys:CREATE TABLE t (k int PRIMARY KEY, v int);SELECT * FROM  t WHERE  k = 1 AND v = 1; # requires AFThat clashes with documentation, where it's said that AF is required for filters that require scanning all partitions. If we were to adapt the code to the behaviour described in documentation we shouldn't require AF if there are restrictions specifying a partition key. Or possibly a group of partition keys, if a IN restriction is used. So both within row and within partition filtering wouldn't require AF.Regarding adding a new ALLOW FILTERING WITHIN PARTITION, I think we could just add a guardrail to directly disallow those queries, without needing to add the WITHIN PARTITION clause to the CQL grammar.On Thu, 13 Apr 2023 at 11:11, Henrik Ingo  wrote:On Thu, Apr 13, 2023 at 10:20 AM Miklosovic, Stefan  wrote:Somebody correct me if I am wrong but "partition key" itself is not enough (primary keys = partition keys + clustering columns). It will require ALLOW FILTERING when clustering columns are not specified either.

create table ks.tb (p1 int, c1 int, col1 int, col2 int, primary key (p1, c1));
select * from ks.tb where p1 = 1 and col1 = 2;     // this will require allow filtering

The documentation seems to omit this fact.It does seem so.That said, personally I was assuming, and would still argue it's the optimal choice, that the documentation was right and reality is wrong.If there is a partition key, then the query can avoid scanning the entire table, across all nodes, potentially petabytes.If a query specifies a partition key but not the full clustering key, of course there will be some scanning needed, but this is marginal compared to the need to scan the entire table. Even in the worst case, a partition with 2 billion cells, we are talking about seconds to filter the result from the single partition.> Aha I get what you all mean:No, I actually think both are unnecessary. But yeah, certainly this latter case is a bug?henrik-- Henrik Ingoc. +358 40 569 7354 w. www.datastax.com

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-11 Thread J. D. Jordan

Thanks for those. They are very helpful.I think the CEP needs to call out all of the classes/interfaces from the cassandra-all jar that the “Spark driver” is using.Given this CEP is exposing “sstables as an external API” I would think all the interfaces and code associated with using those would need to be treated as user API now?For example the spark driver is actually calling the compaction classes and using the internal C* objects to process the data. I don’t think any of those classes have previously been considered “public” in anyway.Is said spark driver also being donated as part of the CEP? Or just the code to implement the interfaces in the side car?-JeremiahOn Apr 10, 2023, at 5:37 PM, Doug Rohrer wrote:I’ve updated the CEP with two overview diagrams of the interactions between Sidecar, Cassandra, and the Bulk Analytics library. Hope this helps folks better understand how things work, and thanks for the patience as it took a bit longer than expected for me to find the time for this.DougOn Apr 5, 2023, at 11:18 AM, Doug Rohrer wrote:Sorry for the delay in responding here - yes, we can add some diagrams to the CEP - I’ll try to get that done by end-of-week.Thanks,DougOn Mar 28, 2023, at 1:14 PM, J. D. Jordan wrote:Maybe some data flow diagrams could be added to the cep showing some example operations for read/write?On Mar 28, 2023, at 11:35 AM, Yifan Cai wrote:A lot of great discussions! On the sidecar front, especially what the role sidecar plays in terms of this CEP, I feel there might be some confusion. Once the code is published, we should have clarity.Sidecar does not read sstables nor do any coordination for analytics queries. It is local to the companion Cassandra instance. For bulk read, it takes snapshots and streams sstables to spark workers to read. For bulk write, it imports the sstables uploaded from spark workers. All commands are existing jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface to them. It might be an over simplified description. The complex computation is performed in spark clusters only.In the long run, Cassandra might evolve into a database that does both OLTP and OLAP. (Not what this thread aims for) At the current stage, Spark is very suited for analytic purposes. On Tue, Mar 28, 2023 at 9:06 AM Benedict <bened...@apache.org> wrote:I disagree with the first claim, as the process has all the information it chooses to utilise about which resources it’s using and what it’s using those resources for.The inability to isolate GC domains is something we cannot address, but also probably not a problem if we were doing everything with memory management as well as we could be.But, not worth detailing this thread for. Today we do very little well on this front within the process, and a separate process is well justified given the state of play.On 28 Mar 2023, at 16:38, Derek Chen-Becker <de...@chen-becker.org> wrote:On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <joe.e.ly...@gmail.com> wrote:...
I think we might be underselling how valuable JVM isolation is,
especially for analytics queries that are going to pass the entire
dataset through heap somewhat constantly. Big +1 here. The JVM simply does not have significant granularity of control for resource utilization, but this is explicitly a feature of separate processes. Add in being able to separate GC domains and you can avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.Cheers,Derek-- +---+| Derek Chen-Becker || GPG Key available at https://keybase.io/dchenbecker and || https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org || Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC |+---+

Re: [VOTE] CEP-26: Unified Compaction Strategy

2023-04-04 Thread J. D. Jordan

+1On Apr 4, 2023, at 7:29 AM, Brandon Williams  wrote:+1On Tue, Apr 4, 2023, 7:24 AM Branimir Lambov  wrote:Hi everyone,I would like to put CEP-26 to a vote.Proposal:https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-26%3A+Unified+Compaction+StrategyJIRA and draft implementation:https://issues.apache.org/jira/browse/CASSANDRA-18397Up-to-date documentation:https://github.com/blambov/cassandra/blob/CASSANDRA-18397/src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.mdDiscussion:https://lists.apache.org/thread/8xf5245tclf1mb18055px47b982rdg4bThe vote will be open for 72 hours.A vote passes if there are at least three binding +1s and no binding vetoes.Thanks,Branimir

Re: [EXTERNAL] [DISCUSS] Next release date

2023-03-30 Thread J. D. Jordan

That was my understanding as well.On Mar 30, 2023, at 11:21 AM, Josh McKenzie  wrote:So to confirm, let's make sure we all agree on the definition of "stabilize".Using the definition as "green run of all tests on circleci, no regressions on ASF CI" that we used to get 4.1 out the door, and combined with the metric of "feature branches don't merge until their CI is green on at least CircleCI and don't regress on ASF CI"... that boils down to:a) do we have test failures on circle on trunk right now, andb) do we have regressions on trunk on ASF CI compared to 4.1Whether or not new features land near the cutoff date or not shouldn't impact the above right?I'm receptive to another definition of "stabilize", having a time-boxed on calendar window for people to run a beta or RC, or whatever. But my understanding is that the above was our general consensus in the 4.1 window.I definitely could be wrong. :)On Thu, Mar 30, 2023, at 5:22 AM, Benjamin Lerer wrote:but otherwise I don't recall anything that we could take as an indicator
 that a next release would take a comparable amount of time to 4.1?Do we have any indicator that proves that it will take less time? We never managed to do a release in 2 or 3 months so far. Until we have actually proven that we could do it, I simply prefer assuming that we cannot and plan for the worst.We have a lot of significant features that have or will land soon and our experience suggests that those merges usually bring their set of instabilities. The goal of the proposal was to make sure that we get rid of them before TCM and Accord land to allow us to more easily identify the root causes of problems. Gaining time on the overall stabilization process. I am fine with people not liking the proposal. Nevertheless, simply hoping that it will take us 2 months to stabilize the release seems pretty optimistic to me. Do people have another plan in mind for ensuring a short stabilization period?Le lun. 27 mars 2023 à 09:20, Henrik Ingo  a écrit :Not so fast...There's certainly value in spending that time stabilizing the already done features. It's valuable triaging information to say this used to work before CEP-21 and only broke after it.That said, having a very long freeze of trunk, or alternatively having a very long lived 5.0 branch that is waiting for Accord and diverging with a trunk that is not frozen... are both undesirable options. (A month or two could IMO be discussed though.) So I agree with the concern from that point of view, I just don't agree that having one batch of big features in stabilization period is zero value.henrikOn Fri, Mar 24, 2023 at 5:23 PM Jeremiah D Jordan  wrote:Given the fundamental change to how cluster operations work coming from CEP-21, I’m not sure what freezing early for “extra QA time” really buys us?  I wouldn’t trust any multi-node QA done pre commit.What “stabilizing” do we expect to be doing during this time?  How much of it do we just have to do again after those things merge?  I for one do not like to have release branches cut months before their expected release.  It just adds extra merge forward and “where should this go” questions/overhead.  It could make sense to me to branch branch when CEP-21 merges and only let in CEP-15 after that.  CEP-15 is mostly “net new stuff” and not “changes to existing stuff” from my understanding?  So no QA effort wasted if it is done before it merges.-JeremiahOn Mar 24, 2023, at 9:38 AM, Josh McKenzie  wrote:I would like to propose a partial freeze of 5.0 in JuneMy .02:+1 to:* partial freeze on an agreed upon date w/agreed upon other things that can optionally go in after* setting a hard limit on when we ship from that frozen branch regardless of whether the features land or not-1 to:* ever feature freezing trunk again. :)I worry about the labor involved with having very large work like this target a frozen branch and then also needing to pull it up to trunk. That doesn't sound fun.If we resurrected the discussion about cutting alpha snapshots from trunk, would that change people's perspectives on the weight of this current decision? We'd probably also have to re-open pandora's box talking about the solidity of our API's on trunk as well if we positioned those alphas as being stable enough to start prototyping and/or building future applications against.On Fri, Mar 24, 2023, at 9:59 AM, Brandon Williams wrote:I am +1 on a 5.0 branch freeze.Kind Regards,BrandonOn Fri, Mar 24, 2023 at 8:54 AM Benjamin Lerer  wrote: Would that be a trunk freeze, or freeze of a cassandra-5.0 branch?>>> I was thinking of a cassandra-5.0 branch freeze. So branching 5.0 and allowing only CEP-15 and 21 + bug fixes there.> Le ven. 24 mars 2023 à 13:55, Paulo Motta  a écrit : >  I would like to propose a partial freeze of 5.0 in June. Would that be a trunk freeze, or freeze of a cassandra-5.0 branch? I

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread J. D. Jordan

Maybe some data flow diagrams could be added to the cep showing some example operations for read/write?On Mar 28, 2023, at 11:35 AM, Yifan Cai wrote:A lot of great discussions! On the sidecar front, especially what the role sidecar plays in terms of this CEP, I feel there might be some confusion. Once the code is published, we should have clarity.Sidecar does not read sstables nor do any coordination for analytics queries. It is local to the companion Cassandra instance. For bulk read, it takes snapshots and streams sstables to spark workers to read. For bulk write, it imports the sstables uploaded from spark workers. All commands are existing jmx/nodetool functionalities from Cassandra. Sidecar adds the http interface to them. It might be an over simplified description. The complex computation is performed in spark clusters only.In the long run, Cassandra might evolve into a database that does both OLTP and OLAP. (Not what this thread aims for) At the current stage, Spark is very suited for analytic purposes. On Tue, Mar 28, 2023 at 9:06 AM Benedict wrote:I disagree with the first claim, as the process has all the information it chooses to utilise about which resources it’s using and what it’s using those resources for.The inability to isolate GC domains is something we cannot address, but also probably not a problem if we were doing everything with memory management as well as we could be.But, not worth detailing this thread for. Today we do very little well on this front within the process, and a separate process is well justified given the state of play.On 28 Mar 2023, at 16:38, Derek Chen-Becker wrote:On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch wrote:...
I think we might be underselling how valuable JVM isolation is,
especially for analytics queries that are going to pass the entire
dataset through heap somewhat constantly. Big +1 here. The JVM simply does not have significant granularity of control for resource utilization, but this is explicitly a feature of separate processes. Add in being able to separate GC domains and you can avoid a lot of noisy neighbor in-VM behavior for the disparate workloads.Cheers,Derek-- +---+| Derek Chen-Becker || GPG Key available at https://keybase.io/dchenbecker and || https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org || Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC |+---+

Re: Welcome our next PMC Chair Josh McKenzie

2023-03-23 Thread J. D. Jordan

Congrats Josh!And thanks Mick for your time spent as Chair!On Mar 23, 2023, at 8:21 AM, Aaron Ploetz  wrote:Congratulations, Josh!And of course, thank you Mick for all you've done for the project while in the PMC Chair role!On Thu, Mar 23, 2023 at 7:44 AM Derek Chen-Becker  wrote:Congratulations, Josh!On Thu, Mar 23, 2023, 4:23 AM Mick Semb Wever  wrote:It is time to pass the baton on, and on behalf of the Apache Cassandra Project Management Committee (PMC) I would like to welcome and congratulate our next PMC Chair Josh McKenzie (jmckenzie).Most of you already know Josh, especially through his regular and valuable project oversight and status emails, always presenting a balance and understanding to the various views and concerns incoming. Repeating Paulo's words from last year: The chair is an administrative position that interfaces with the Apache Software Foundation Board, by submitting regular reports about project status and health. Read more about the PMC chair role on Apache projects:- https://www.apache.org/foundation/how-it-works.html#pmc- https://www.apache.org/foundation/how-it-works.html#pmc-chair- https://www.apache.org/foundation/faq.html#why-are-PMC-chairs-officersThe PMC as a whole is the entity that oversees and leads the project and any PMC member can be approached as a representative of the committee. A list of Apache Cassandra PMC members can be found on: https://cassandra.apache.org/_/community.html

Re: [DISCUSS] Drop support for sstable formats m* (in trunk)

2023-03-14 Thread J. D. Jordan

Agreed. I also think it is worthwhile to keep that code around. Given how 
widespread C* 3.x use is, I do not think it is worthwhile dropping support for 
those sstable formats at this time.

-Jeremiah

> On Mar 14, 2023, at 9:36 AM, C. Scott Andreas  wrote:
> 
> 
> I agree with Aleksey's view here.
> 
> To expand on the final point he makes re: requiring SSTables be fully 
> rewritten prior to rev'ing from 4.x to 5.x (if the cluster previously ran 
> 3.x) –
> 
> This would also invalidate incremental backups. Operators would either be 
> required to perform a full snapshot backup of each cluster to object storage 
> prior to upgrading from 4.x to 5.x; or to enumerate the contents of all 
> snapshots from an incremental backup series to ensure that no m*-series 
> SSTables were present prior to upgrading.
> 
> If one failed to take on the work to do so, incremental backup snapshots 
> would not be restorable to a 5.x cluster if an m*-series SSTable were present.
> 
> – Scott
> 
>> On Mar 14, 2023, at 4:38 AM, Aleksey Yeshchenko  wrote:
>> 
>> 
>> Raising messaging service minimum, I have a less strong opinion on, but on 
>> dropping m* sstable code I’m strongly -1.
>> 
>> 1. This is code on a rarely touched path
>> 2. It’s very stable and battle tested at this point
>> 3. Removing it doesn’t reduce much complexity at all, just a few branches 
>> are affected
>> 4. Removing code comes with risk
>> 5. There are third-party tools that I know of which benefit from a single C* 
>> jar that can read all relevant stable versions, and relevant here includes 
>> 3.0 ones
>> 
>> Removing a little of battle-tested reliable code and a tinier amount of 
>> complexity is not, to me, a benefit enough to justify intentionally breaking 
>> perfectly good and useful functionality.
>> 
>> Oh, to add to that - if an operator wishes to upgrade from 3.0 to 5.0, and 
>> we don’t support it directly, I think most of us are fine with the 
>> requirement to go through a 4.X release first. But it’s one thing to require 
>> a two rolling restarts (3.0 to 4.0, 4.0 to 5.0), it’s another to require the 
>> operator to upgrade every single m* sstable to n*. Especially when we have 
>> perfectly working code to read those. That’s incredibly wasteful.
>> 
>> AY
>> 
>>> On 13 Mar 2023, at 22:54, Mick Semb Wever  wrote:
>>> 
>>> If we do not recommend and do not test direct upgrades from 3.x to
>>> 5.x, we can clean up a fair bit by removing code related to sstable
>>> formats m*, as Cassandra versions 4.x and 5.0 are all on sstable
>>> formats n*.
>>> 
>>> We don't allow mixed-version streaming, so it's not possible today to
>>> stream any such older sstable format between nodes. This
>>> compatibility-break impacts only node-local and/or offline.
>>> 
>>> Some arguments raised to keep m* sstable formats are:
>>> - offline cluster upgrade, e.g. direct from 3.x to 5.0,
>>> - single-invocation sstableupgrade usage
>>> - third-party tools based on the above
>>> 
>>> Personally I am not in favour of keeping, or recommending users use,
>>> code we don't test.
>>> 
>>> An _example_ of the code that can be cleaned up is in the patch
>>> attached to the ticket:
>>> CASSANDRA-18312 – Drop support for sstable formats before `na`
>>> 
>>> What do you think?
> 
> 
> 
> 
>

Re: [DISCUSS] New dependencies with Chronicle-Queue update

2023-03-13 Thread J. D. Jordan

Yes exactly. If we are updating a library for some reason, we should update it to the latest one that makes sense.On Mar 13, 2023, at 1:17 PM, Josh McKenzie  wrote:I think we should we use the most recent versions of all libraries where possible?”To clarify, are we talking "most recent versions of all libraries when we have to update them anyway for a dependency"? Not all libraries all libraries...If the former, I agree. If the latter, here be dragons. :)On Mon, Mar 13, 2023, at 1:13 PM, Ekaterina Dimitrova wrote:“ > Given we need to upgrade to support JDK17 it seems fine to me.  The only concern I have is that some of those libraries are already pretty old, for example the most recent jna-platform is 5.13.0 and 5.5.0 is almost 4 years old.  I think we should we use the most recent versions of all libraries where possible?”+1On Mon, 13 Mar 2023 at 12:10, Brandon Williams  wrote:I know it was just an example but we upgraded JNA to 5.13 in CASSANDRA-18050 as part of the JDK17 effort, so at least that is taken care of.  Kind Regards, Brandon  On Mon, Mar 13, 2023 at 10:39 AM Jeremiah D Jordan  wrote: > > Given we need to upgrade to support JDK17 it seems fine to me.  The only concern I have is that some of those libraries are already pretty old, for example the most recent jna-platform is 5.13.0 and 5.5.0 is almost 4 years old.  I think we should we use the most recent versions of all libraries where possible? > > > On Mar 13, 2023, at 7:42 AM, Mick Semb Wever  wrote: > > > > JDK17 requires us to update our chronicle-queue dependency: CASSANDRA-18049 > > > > We use chronicle-queue for both audit logging and fql. > > > > This update pulls in a number of new transitive dependencies. > > > > affinity-3.23ea1.jar > > asm-analysis-9.2.jar > > asm-commons-9.2.jar > > asm-tree-9.2.jar > > asm-util-9.2.jar > > jffi-1.3.9.jar > > jna-platform-5.5.0.jar > > jnr-a64asm-1.0.0.jar > > jnr-constants-0.10.3.jar > > jnr-ffi-2.2.11.jar > > jnr-x86asm-1.0.2.jar > > posix-2.24ea4.jar > > > > > > More info here: > > https://issues.apache.org/jira/browse/CASSANDRA-18049?focusedCommentId=17699393=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17699393 > > > > > > Objections? >

Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread J. D. Jordan

+1 from me to deprecate in 4.x and remove in 5.0.

-Jeremiah

> On Mar 9, 2023, at 11:53 AM, Brandon Williams  wrote:
> 
> I think if we reach consensus here that decides it. I too vote to
> deprecate in 4.1.x.  This means we would remove it in 5.0.
> 
> Kind Regards,
> Brandon
> 
>> On Thu, Mar 9, 2023 at 11:32 AM Ekaterina Dimitrova
>>  wrote:
>> 
>> Deprecation sounds good to me, but I am not completely sure in which version 
>> we can do it. If it is possible to add a deprecation warning in the 4.x 
>> series or at least 4.1.x - I vote for that.
>> 
>>> On Thu, 9 Mar 2023 at 12:14, Jacek Lewandowski 
>>>  wrote:
>>> 
>>> Is it possible to deprecate it in the 4.1.x patch release? :)
>>> 
>>> 
>>> - - -- --- -  -
>>> Jacek Lewandowski
>>> 
>>> 
>>> czw., 9 mar 2023 o 18:11 Brandon Williams  napisał(a):
 
 This is my feeling too, but I think we should accomplish this by
 deprecating it first.  I don't expect anything will change after the
 deprecation period.
 
 Kind Regards,
 Brandon
 
 On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
  wrote:
> 
> I vote for removing it entirely.
> 
> thanks
> - - -- --- -  -
> Jacek Lewandowski
> 
> 
> czw., 9 mar 2023 o 18:07 Miklosovic, Stefan 
>  napisał(a):
>> 
>> Derek,
>> 
>> I have couple more points ... I do not think that extracting it to a 
>> separate repository is "win". That code is on Hadoop 1.0.3. We would be 
>> spending a lot of work on extracting it just to extract 10 years old 
>> code with occasional updates (in my humble opinion just to make it 
>> compilable again if the code around changes). What good is in that? We 
>> would have one more place to take care of ... Now we at least have it 
>> all in one place.
>> 
>> I believe we have four options:
>> 
>> 1) leave it there so it will be like this is for next years with 
>> questionable and diminishing usage
>> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
>> 3) 2) and extract it to a separate repository but if we do 2) we can 
>> just leave it there
>> 4) remove it
>> 
>> 
>> From: Derek Chen-Becker 
>> Sent: Thursday, March 9, 2023 15:55
>> To: dev@cassandra.apache.org
>> Subject: Re: Role of Hadoop code in Cassandra 5.0
>> 
>> NetApp Security WARNING: This is an external email. Do not click links 
>> or open attachments unless you recognize the sender and know the content 
>> is safe.
>> 
>> 
>> 
>> I think the question isn't "Who ... is still using that?" but more "are 
>> we actually going to support it?" If we're on a version that old it 
>> would appear that we've basically abandoned it, although there do appear 
>> to have been refactoring (for other things) commits in the last couple 
>> of years. I would be in favor of removal from 5.0, but at the very 
>> least, could it be moved into a separate repo/package so that it's not 
>> pulling a relatively large dependency subtree from Hadoop into our main 
>> codebase?
>> 
>> Cheers,
>> 
>> Derek
>> 
>> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan 
>> mailto:stefan.mikloso...@netapp.com>> 
>> wrote:
>> Hi list,
>> 
>> I stumbled upon Hadoop package again. I think there was some discussion 
>> about the relevancy of Hadoop code some time ago but I would like to ask 
>> this again.
>> 
>> Do you think Hadoop code (1) is still relevant in 5.0? Who in the 
>> industry is still using that?
>> 
>> We might drop a lot of code and some Hadoop dependencies too (3) (even 
>> their scope is "provided"). The version of Hadoop we build upon is 1.0.3 
>> which was released 10 years ago. This code does not have any tests nor 
>> documentation on the website.
>> 
>> There seems to be issues like this (2) and it seems like the solution is 
>> to, basically, use Spark Cassandra connector instead which I would say 
>> is quite reasonable.
>> 
>> Regards
>> 
>> (1) 
>> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
>> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
>> (3) 
>> https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589
>> 
>> 
>> --
>> +---+
>> | Derek Chen-Becker |
>> | GPG Key available at 
>> https://urldefense.com/v3/__https://keybase.io/dchenbecker__;!!PbtH5S7Ebw!YbHPCIGqxJHtAbvxPSXFEvnZgLrmvIE2AQ3Aw3BAgvCksv9ALniyHYVvU42wxrAGSNybhgjhwoAeyss$
>>   and   |
>> | 
>>

Re: [DISCUSS] Next release date

2023-03-01 Thread J. D. Jordan

We have been talking a lot about the branch cutting date, but I agree with Benedict here, I think we should actually be talking about the expected release date. If we truly believe that we can release within 1-2 months of cutting the branch, and many people I have talked to think that is possible, then a May branch cut means we release by July. That would only be 7 months post 4.1 release, that seems a little fast to me. IIRC the last time we had release cadence discussions most people were for keeping to a release cadence of around 12 months, and many were against a 6 month cadence.So if we want to have a goal of “around 12 months” and also have a goal of “release before summit in December”. I would suggest we put our release date goal in October to give some runway for being late and still getting out by December.So if the release date goal is October, we can also hedge with the longer 2 month estimate on “time after branching” to again make sure we make our goals. This would put the branching in August. So if we do release in an October that gives us 10 months since 4.1, which while still shorter than 12 much closer than only 7 months would be.If people feel 1 month post branch cut is feasible we could cut the branch in September.-JeremiahOn Mar 1, 2023, at 10:34 AM, Henrik Ingo wrote:Hi Those are great questions Mick. It's good to recognize this discussion impacts a broad range of contributors and users, and not all of them might be aware of the discussion in the first place.More generally I would say that your questions brought to mind two fundamental principles with a "train model" release schedule: 1. If a feature isn't ready by the cut-off date, there's no reason to delay the release, because the next release is guaranteed to be just around the corner. 2. If there is a really important feature that won't make it, rather than delaying the planned release, we should (also) consider the opposite: we can do the next release earlier if there is a compelling feature ready to go. (Answers question 2b from Mick.)I have arguments both for and against moving the release date:The to stick with the current plan, is that we have a lot of big features now landing in trunk. If we delay the release for one feature, it will delay the GA of all the other features that were ready by May. For example, while SAI code is still being massaged based on review comments, we fully expect it to merge before May. Same for the work on tries, which is on its final stretch. Arguably Java 17 support can't come soon enough either. And so on... For some user it can be a simple feature, like just one specific guardrail, that they are waiting on. So just as we are excited to wait for that one feature to be ready and make it, I'm just as unexcited about the prospect of delaying the GA of several other features. If we had just one big feature that everyone was working on, this would be easier to decide...Note also that postponing the release for a single feature that is still in development is a risky bet, because you never know what unknowns are still ahead once the work is code complete and put to more serious testing. At first it might sound reasonable to delay 1-3 months, but what if on that 3rd month some unforeseen work is discovered, and now we need to discuss delaying another 3 months. Such a risk is inherent to any software project, and we should anticipate it to happen. Scott's re-telling of CASSANDRA-18110 is a great example: These delays can happen due to a single issue, and it can be hard to speed things up by e.g. assigning more engineers to the work. So, when we say that we'd like to move the branching date from May to August, and specifically in order for some feature to be ready by then, what do we do if it's not ready in August?`It's presumably closer to being ready at that point, so the temptation to wait just a little bit more is always there. (And this is also my answer to Mick's question 1b.)Now, let me switch to arguing the opposite opinion:My instinct here would be to stick to early May as the cut-off date, but also allow for exceptions. I'm curious to hear how this proposal is received? If this was a startup, there could be a CEO or let's say a build manager, that could make these kind of case by case decisions expediently. But I realize in a consensus based open source project like Cassandra, we may also have to consider issues like fairness: Why would some feature be allowed a later date than others? How do we choose which work gets such exceptions?Anyway, the fact is that we have several huge bodies of work in flight now. The Accord patch was about 28k lines of code when I looked at it, and note that this doesn't even include "accord itself", which is in a different repository. SAI, Tries (independently for memtable and sstables) and UCS are all in the 10k range too. And I presume the Java 17 support and transactional metadata are the same. Each of these pieces of code represent alone years of

Re: [DISCUSS] Next release date

2023-02-28 Thread J. D. Jordan

I think it makes sense to plan on cutting the branch later given when 4.1 
actually released. I would suggest either August or September as a good time to 
cut the branch, at the end of the summer.

-Jeremiah

> On Feb 28, 2023, at 7:42 AM, Benjamin Lerer  wrote:
> 
> 
> Hi,
> 
> We forked the 4.0 and 4.1 branches beginning of May. Unfortunately, for 4.1 
> we were only able to release GA in December which impacted how much time we 
> could spend focussing on the next release and the progress that we could do. 
> By consequence, I am wondering if it makes sense for us to branch 5.0 in May 
> or if we should postpone that date.
> 
> What is your opinion?
> 
> Benjamin

Re: Welcome Patrick McFadin as Cassandra Committer

2023-02-02 Thread J. D. Jordan

Congrats!On Feb 2, 2023, at 12:47 PM, Christopher Bradford  wrote:Congrats Patrick! Well done. On Thu, Feb 2, 2023 at 10:44 AM Aaron Ploetz  wrote:Patrick FTW!!!On Thu, Feb 2, 2023 at 12:32 PM Joseph Lynch  wrote:W! Congratulations Patrick!!-JoeyOn Thu, Feb 2, 2023 at 9:58 AM Benjamin Lerer  wrote:
The PMC members are pleased to announce that Patrick McFadin has accepted
the invitation to become committer today.

Thanks a lot, Patrick, for everything you have done for this project and its community through the years.

Congratulations and welcome!

The Apache Cassandra PMC members




-- Christopher Bradford

Re: Merging CEP-15 to trunk

2023-01-16 Thread J. D. Jordan

My only concern to merging (given all normal requirements are met) would be if there was a possibility that the feature would never be finished. Given all of the excitement and activity around accord, I do not think that is a concern here. So I see no reason not to merge incremental progress behind a feature flag.-JeremiahOn Jan 16, 2023, at 10:30 AM, Josh McKenzie  wrote:Did we document this or is it in an email thread somewhere?I don't see it on the confluence wiki nor does a cursory search of ponymail turn it up.What was it for something flagged experimental?1. Same tests pass on the branch as to the root it's merging back to2. 2 committers eyes on (author + reviewer or 2 reviewers, etc)3. Disabled by default w/flag to enableSo really only the 3rd thing is different right? Probably ought to add an informal step 4 which Benedict is doing here which is "hit the dev ML w/a DISCUSS thread about the upcoming merge so it's on people's radar and they can coordinate".On Mon, Jan 16, 2023, at 11:08 AM, Benedict wrote:My goal isn’t to ask if others believe we have the right to merge, only to invite feedback if there are any specific concerns. Large pieces of work like this cause headaches and concerns for other contributors, and so it’s only polite to provide notice of our intention, since probably many haven’t even noticed the feature branch developing.The relevant standard for merging a feature branch, if we want to rehash that, is that it is feature- and bug-neutral by default, ie that a release could be cut afterwards while maintaining our usual quality standards, and that the feature is disabled by default, yes. It is not however feature-complete or production read as a feature; that would prevent any incremental merging of feature development.> On 16 Jan 2023, at 15:57, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:> > I haven’t been following the progress of the feature branch, but I would think the requirements for merging it into master would be the same as any other merge.> > A subset of those requirements being:> Is the code to be merged in releasable quality? Is it disabled by a feature flag by default if not?> Do all the tests pass?> Has there been review and +1 by two committer?> > If the code in the feature branch meets all of the merging criteria of the project then I see no reason to keep it in a feature branch for ever.> > -Jeremiah> > >> On Jan 16, 2023, at 3:21 AM, Benedict <bened...@apache.org> wrote:>> >> Hi Everyone, I hope you all had a lovely holiday period. >> >> Those who have been following along will have seen a steady drip of progress into the cep-15-accord feature branch over the past year. We originally discussed that feature branches would merge periodically into trunk, and we are long overdue. With the release of 4.1, it’s time to rectify that. >> >> Barring complaints, I hope to merge the current state to trunk within a couple of weeks. This remains a work in progress, but will permit users to experiment with the alpha version of Accord and provide feedback, as well as phase the changes to trunk.

Re: Merging CEP-15 to trunk

2023-01-16 Thread J. D. Jordan

I haven’t been following the progress of the feature branch, but I would think 
the requirements for merging it into master would be the same as any other 
merge.

A subset of those requirements being:
Is the code to be merged in releasable quality? Is it disabled by a feature 
flag by default if not?
Do all the tests pass?
Has there been review and +1 by two committer?

If the code in the feature branch meets all of the merging criteria of the 
project then I see no reason to keep it in a feature branch for ever.

-Jeremiah

> On Jan 16, 2023, at 3:21 AM, Benedict  wrote:
> 
> Hi Everyone, I hope you all had a lovely holiday period. 
> 
> Those who have been following along will have seen a steady drip of progress 
> into the cep-15-accord feature branch over the past year. We originally 
> discussed that feature branches would merge periodically into trunk, and we 
> are long overdue. With the release of 4.1, it’s time to rectify that. 
> 
> Barring complaints, I hope to merge the current state to trunk within a 
> couple of weeks. This remains a work in progress, but will permit users to 
> experiment with the alpha version of Accord and provide feedback, as well as 
> phase the changes to trunk.

Re: [VOTE] CEP-25: Trie-indexed SSTable format

2022-12-19 Thread J. D. Jordan

+1 nb

> On Dec 19, 2022, at 7:07 AM, Brandon Williams  wrote:
> 
> +1
> 
> Kind Regards,
> Brandon
> 
>> On Mon, Dec 19, 2022 at 6:59 AM Branimir Lambov  wrote:
>> 
>> Hi everyone,
>> 
>> I'd like to propose CEP-25 for approval.
>> 
>> Proposal: 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A+Trie-indexed+SSTable+format
>> Discussion: https://lists.apache.org/thread/3dpdg6dgm3rqxj96cyhn58b50g415dyh
>> 
>> The vote will be open for 72 hours.
>> Votes by committers are considered binding.
>> A vote passes if there are at least three binding +1s and no binding vetoes.
>> 
>> Thank you,
>> Branimir

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread J. D. Jordan

 reject collections, we should also reject other data types such as, at least, blobs. That would require a deprecation plan.Also, it's not that the comparator used by MIN/MAX is an internal obscure thing. The action of that comparator is very visible when any of those data types is used in a clustering column, and it's used as the basis for "ORDER BY" clauses. Should we also reject blobs, collections, tuples and UDTs on "ORDER BY"? I don't think so.I rather think that basing MIN/MAX on the regular order of the column data type is consistent, easy to do and easy to understand. I don't see the need to add rules explicitly forbidding some data types on MIN/MAX functions just because we can't easily figure out a use case for their ordering. Especially when we are exposing that same ordering on clusterings and "ORDER BY".On Tue, 6 Dec 2022 at 18:56, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:If the functionality truly has never actually worked, then throwing an error that MAX is not supported for collections seems reasonable.

But we should throw an error, I do not think we should have functions that aggregate across rows and functions that operate within a row use the same name.

My expectation as a user would be that MAX either always aggregates across rows, so results in a single row of output or always operates within a row, so returns the full set of rows matching the query.

So if we want a max that aggregates across rows that works for collections we could change it to return the aggregated max across all rows. Or we just leave it as an error and if someone wants the max across all rows they would ask for MAX(COLLECTION_MAX(column)). Yes I still agree COLLECTION_MAX may be a bad name.

> On Dec 6, 2022, at 11:55 AM, Benedict <bened...@apache.org> wrote:
> 
> As far as I am aware it has never worked in a release, and so deprecating it is probably not as challenging as you think. Only folk that have been able to parse the raw bytes of the collection in storage format would be affected - which we can probably treat as zero.
> 
> 
>> On 6 Dec 2022, at 17:31, Jeremiah D Jordan <jeremiah.jor...@gmail.com> wrote:
>> 
>> 
>>> 
>>> 1. I think it is a mistake to offer a function MAX that operates over rows containing collections, returning the collection with the most elements. This is just a nonsensical operation to support IMO. We should decide as a community whether we “fix” this aggregation, or remove it.
>> 
>> The current MAX function does not work this way afaik?  It returns the row with the column that has the highest value in clustering order sense, like if the collection was used as a clustering key.  While that also may have limited use, I don’t think it worth while to deprecate such use and all the headache that comes with doing so.
>> 
>>> 2. I think “collection_" prefixed methods are non-intuitive for discovery, and all-else equal it would be better to use MAX,MIN, etc, same as for aggregations.
>> 
>> If we actually wanted to move towards using the existing names with new meanings, then I think that would take us multiple major releases.  First deprecate existing use in current releases.  Then make it an error in the next major release X.  Then change the behavior in major release X+1.  Just switching the behavior without having a major where such queries error out would make a bunch of user queries start returning “wrong” data.
>> Also I don’t think those functions being cross row aggregations for some column types, but within row collection operations for other types, is any more intuitive, and actually would be more confusing.  So I am -1 on using the same names.
>> 
>>> 3. I think it is peculiar to permit methods named collection_ to operate over non-collection types when they are explicitly collection variants.
>> 
>> While I could see some point to this, I do not think it would be confusing for something named collection_XXX to treat a non-collection as a collection of 1.  But maybe there is a better name for these function.  Rather than seeing them as collection variants, we should see them as variants that operate on the data in a single row, rather than aggregating across multiple rows.  But even with that perspective I don’t know what the best name would be.
>> 
>>>> On Dec 6, 2022, at 7:30 AM, Benedict <bened...@apache.org> wrote:
>>> 
>>> Thanks Andres, I think community input on direction here will be invaluable. There’s a bunch of interrelated tickets, and my opinions are as follows:
>>> 
>>> 1. I think it is a mistake to offer a function MAX that operates over rows containing collections, returning the collection with the most elements. This is just a nonsensical operation to support IMO. We should decide as a community whether we “fix” this aggregat

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-06 Thread J. D. Jordan

If the functionality truly has never actually worked, then throwing an error 
that MAX is not supported for collections seems reasonable.

But we should throw an error, I do not think we should have functions that 
aggregate across rows and functions that operate within a row use the same name.

My expectation as a user would be that MAX either always aggregates across 
rows, so results in a single row of output or always operates within a row, so 
returns the full set of rows matching the query.

So if we want a max that aggregates across rows that works for collections we 
could change it to return the aggregated max across all rows. Or we just leave 
it as an error and if someone wants the max across all rows they would ask for 
MAX(COLLECTION_MAX(column)). Yes I still agree COLLECTION_MAX may be a bad name.

> On Dec 6, 2022, at 11:55 AM, Benedict  wrote:
> 
> As far as I am aware it has never worked in a release, and so deprecating it 
> is probably not as challenging as you think. Only folk that have been able to 
> parse the raw bytes of the collection in storage format would be affected - 
> which we can probably treat as zero.
> 
> 
>> On 6 Dec 2022, at 17:31, Jeremiah D Jordan  wrote:
>> 
>> 
>>> 
>>> 1. I think it is a mistake to offer a function MAX that operates over rows 
>>> containing collections, returning the collection with the most elements. 
>>> This is just a nonsensical operation to support IMO. We should decide as a 
>>> community whether we “fix” this aggregation, or remove it.
>> 
>> The current MAX function does not work this way afaik?  It returns the row 
>> with the column that has the highest value in clustering order sense, like 
>> if the collection was used as a clustering key.  While that also may have 
>> limited use, I don’t think it worth while to deprecate such use and all the 
>> headache that comes with doing so.
>> 
>>> 2. I think “collection_" prefixed methods are non-intuitive for discovery, 
>>> and all-else equal it would be better to use MAX,MIN, etc, same as for 
>>> aggregations.
>> 
>> If we actually wanted to move towards using the existing names with new 
>> meanings, then I think that would take us multiple major releases.  First 
>> deprecate existing use in current releases.  Then make it an error in the 
>> next major release X.  Then change the behavior in major release X+1.  Just 
>> switching the behavior without having a major where such queries error out 
>> would make a bunch of user queries start returning “wrong” data.
>> Also I don’t think those functions being cross row aggregations for some 
>> column types, but within row collection operations for other types, is any 
>> more intuitive, and actually would be more confusing.  So I am -1 on using 
>> the same names.
>> 
>>> 3. I think it is peculiar to permit methods named collection_ to operate 
>>> over non-collection types when they are explicitly collection variants.
>> 
>> While I could see some point to this, I do not think it would be confusing 
>> for something named collection_XXX to treat a non-collection as a collection 
>> of 1.  But maybe there is a better name for these function.  Rather than 
>> seeing them as collection variants, we should see them as variants that 
>> operate on the data in a single row, rather than aggregating across multiple 
>> rows.  But even with that perspective I don’t know what the best name would 
>> be.
>> 
 On Dec 6, 2022, at 7:30 AM, Benedict  wrote:
>>> 
>>> Thanks Andres, I think community input on direction here will be 
>>> invaluable. There’s a bunch of interrelated tickets, and my opinions are as 
>>> follows:
>>> 
>>> 1. I think it is a mistake to offer a function MAX that operates over rows 
>>> containing collections, returning the collection with the most elements. 
>>> This is just a nonsensical operation to support IMO. We should decide as a 
>>> community whether we “fix” this aggregation, or remove it.
>>> 2. I think “collection_" prefixed methods are non-intuitive for discovery, 
>>> and all-else equal it would be better to use MAX,MIN, etc, same as for 
>>> aggregations.
>>> 3. I think it is peculiar to permit methods named collection_ to operate 
>>> over non-collection types when they are explicitly collection variants.
>>> 
>>> Given (1), (2) becomes simple except for COUNT which remains ambiguous, but 
>>> this could be solved by either providing a separate method for collections 
>>> (e.g. SIZE) which seems fine to me, or by offering a precedence order for 
>>> matching and a keyword for overriding the precedence order (e.g. 
>>> COUNT(collection AS COLLECTION)).
>>> 
>>> Given (2), (3) is a little more difficult. However, I think this can be 
>>> solved several ways. 
>>> - We could permit explicit casts to collection types, that for a collection 
>>> type would be a no-op, and for a single value would create a collection
>>> - With precedence orders, by always selecting the scalar function last
>>> - By permitting

Re: [DISCUSS] Adding dependency to the Big-Math library from eobermuhlner

2022-11-28 Thread J. D. Jordan

Seems well maintained and MIT licensed. +1 from me.

> On Nov 28, 2022, at 6:35 PM, Derek Chen-Becker  wrote:
> 
> Overall the library appears to be high quality, even going so far as
> to have regression tests between versions. I do, however, think that
> the long-term maintenance risk needs to be acknowledged here. I think
> the *absolute* worst case scenario would be corruption or deletion of
> source and artifacts, requiring the Cassandra community to either
> re-implement the functions or remove them from CQL support. More
> likely (having seen it happen once or twice) would be abandonment by
> the author, requiring a fork and maintenance along with some
> dependency modification (assuming the fork would have to be published
> under a different group/artifact ID). I'm +1 for the addition of the
> dependency on the basis of the apparent stability of the project, and
> with the understanding that the functionality offered here is
> ancillary to core Cassandra functionality.
> 
> Cheers,
> 
> Derek
> 
>> On Mon, Nov 28, 2022 at 5:30 AM Josh McKenzie  wrote:
>> 
>> I'm pleased with the rigor he shows on his explanations of implementation 
>> and performance: http://obermuhlner.ch/wordpress/2016/06/02/bigdecimalmath/
>> 
>> Seems like it's probably stable given the infrequency of changes to it and 
>> he's still actively merging patches submit by others: 
>> https://github.com/eobermuhlner/big-math/commits/master as of 8 days ago. 
>> Only 4 issues open on the repo at this time as well for a reasonably starred 
>> / forked library.
>> 
>> I guess my one concern: this appears to be a library maintained primarily by 
>> 1 person; that's a worst-case bus factor. Should he abandon the project is 
>> it something we'd plan to fork and bring into tree and maintain ourselves? 
>> Given how mature and stable it is I wouldn't be too worried, but worth 
>> considering the worst-case.
>> 
>> 
>> On Mon, Nov 28, 2022, at 3:48 AM, Benjamin Lerer wrote:
>> 
>> Hi everybody,
>> 
>> I wanted to discuss the addition of the Big-Math 
>> library(http://eobermuhlner.github.io/big-math/)  as a dependency by  
>> CASSANDRA-17221 which add support for abs, exp, log, log10, and round Math 
>> function. The library was added for providing those functions for the 
>> Cassandra decimal type (java BigDecimal).
>> 
>> This patch has been started a long time ago and went through multiple rounds 
>> of reviews and rework. In my enthusiasm to finally commit this patch I 
>> forgot to raise the discussion to the mailing list about the dependency. I 
>> apologize for that.
>> 
>> Does anybody have some concerns with the addition of that Library as a 
>> dependency?
>> 
>> 
> 
> 
> -- 
> +---+
> | Derek Chen-Becker |
> | GPG Key available at https://keybase.io/dchenbecker and   |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---+

Re: [VOTE] Release Apache Cassandra 4.1-rc1

2022-11-18 Thread J. D. Jordan

+1 nbOn Nov 18, 2022, at 10:55 AM, Benjamin Lerer  wrote:+1Le ven. 18 nov. 2022 à 16:50, Mick Semb Wever  a écrit :+1Checked- signing correct- checksums are correct- source artefact builds (JDK 8+11)- binary artefact runs (JDK 8+11)- debian package runs (JDK 8+11)- debian repo runs (JDK 8+11)- redhat* package runs (JDK 8+11)- redhat* repo runs (JDK 8+11)On Fri, 18 Nov 2022 at 14:27, Berenguer Blasi  wrote:
  

  
  
+1

On 18/11/22 13:37, Aleksey Yeshchenko
  wrote:


  
  +1
  

  On 18 Nov 2022, at 12:10, Mick Semb Wever
 wrote:
  
  

  Proposing the test
build of Cassandra 4.1-rc1 for release.

sha1: d6822c45ae3d476bc2ff674cedf7d4107b8ca2d0
Git: https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.1-rc1-tentative
Maven Artifacts: https://repository.apache.org/content/repositories/orgapachecassandra-1280/org/apache/cassandra/cassandra-all/4.1-rc1/

The Source and Build Artifacts, and the Debian and RPM
packages and repositories, are available here: https://dist.apache.org/repos/dist/dev/cassandra/4.1-rc1/

The vote will be open for 72 hours (longer if needed).
Everyone who has tested the build is invited to vote.
Votes by PMC members are considered binding. A vote
passes if there are at least three binding +1s and no
-1's.

[1]: CHANGES.txt: https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/4.1-rc1-tentative
[2]: NEWS.txt: https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/4.1-rc1-tentative

Re: Should we change 4.1 to G1 and offheap_objects ?

2022-11-17 Thread J. D. Jordan

-1 on providing a bunch of choices and forcing users to pick one. We should 
have a default and it should be “good enough” for most people. The people who 
want to dig in and try other gc settings can still do it, and we could provide 
them some profiles to start from, but there needs to be a default.  We need to 
be asking new operators less questions on install, not more.

Re:experience with Shenandoah under high load, I have in the past seen the 
exact same thing for both Shenandoah and ZGC. Both of them have issues at high 
loads while performing great at moderate loads. I have not seen G1 ever have 
such issues. So I would not be fine with a switch to Shenandoah or ZGC as the 
default without extensive testing on current JVM versions that have hopefully 
improved the behavior under load.

> On Nov 17, 2022, at 9:39 AM, Joseph Lynch  wrote:
> It seems like this is a choice most users might not know how to make?
> 
> On Thu, Nov 17, 2022 at 7:06 AM Josh McKenzie  wrote:
>> 
>> Have we ever discussed including multiple profiles that are simple to swap 
>> between and documented for their tested / intended use cases?
>> 
>> Then the burden of having a “sane” default for the wild variance of 
>> workloads people use it for would be somewhat mitigated. Sure, there’s 
>> always going to be folks that run the default and never think to change it 
>> but the UX could be as simple as a one line config change to swap between GC 
>> profiles and we could add and deprecate / remove over time.
>> 
>> Concretely, having config files such as:
>> 
>> jvm11-CMS-write.options
>> jvm11-CMS-mixed.options
>> jvm11-CMS-read.options
>> jvm11-G1.options
>> jvm11-ZGC.options
>> jvm11-Shen.options
>> 
>> 
>> Arguably we could take it a step further and not actually allow a C* node to 
>> startup without pointing to one of the config files from your primary 
>> config, and provide a clean mechanism to integrate that selection on 
>> headless installs.
>> 
>> Notably, this could be a terrible idea. But it does seem like we keep 
>> butting up against the complexity and mixed pressures of having the One True 
>> Way to GC via the default config and the lift to change that.
>> 
>> On Wed, Nov 16, 2022, at 9:49 PM, Derek Chen-Becker wrote:
>> 
>> I'm fine with not including G1 in 4.1, but would we consider inclusion
>> for 4.1.X down the road once validation has been done?
>> 
>> Derek
>> 
>> 
>> On Wed, Nov 16, 2022 at 4:39 PM David Capwell  wrote:
>>> Getting poked in Slack to be more explicit in this thread…
>>> Switching to G1 on trunk, +1
>>> Switching to G1 on 4.1, -1.  4.1 is about to be released and this isn’t a 
>>> bug fix but a perf improvement ticket and as such should go through 
>>> validation that the perf improvements are seen, there is not enough time 
>>> left for that added performance work burden so strongly feel it should be 
>>> pushed to 4.2/5.0 where it has plenty of time to be validated against.  The 
>>> ticket even asks to avoid validating the claims; saying 'Hoping we can skip 
>>> due diligence on this ticket because the data is "in the past” already”'.  
>>> Others have attempted both shenandoah and ZGC and found mixed results, so 
>>> nothing leads me to believe that won’t be true here either.
>>>> On Nov 16, 2022, at 9:15 AM, J. D. Jordan  
>>>> wrote:
>>>> Heap -
>>>> +1 for G1 in trunk
>>>> +0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I 
>>>> understand pushback against changing this so late in the game.
>>>> Memtable -
>>>> -1 for off heap in 4.1. I think this needs more testing and isn’t 
>>>> something to change at the last minute.
>>>> +1 for running performance/fuzz tests against the alternate memtable 
>>>> choices in trunk and switching if they don’t show regressions.
>>>>> On Nov 16, 2022, at 10:48 AM, Josh McKenzie  wrote:
>>>>> 
>>>>> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to 
>>>>> prioritize digging into G1's behavior on small heaps vs. CMS w/our 
>>>>> default tuning sooner rather than later. With that info I'd likely be a 
>>>>> strong +1 on the shift.
>>>>> -1 on switching to offheap_objects for 4.1 RC; again, think this is just 
>>>>> a small step away from being a +1 w/some more rigor around seeing the 
>>>>> current state of the technology's intersections.
>>>>> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshche

Re: Should we change 4.1 to G1 and offheap_objects ?

2022-11-16 Thread J. D. Jordan

Heap -+1 for G1 in trunk+0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I understand pushback against changing this so late in the game.Memtable --1 for off heap in 4.1. I think this needs more testing and isn’t something to change at the last minute.+1 for running performance/fuzz tests against the alternate memtable choices in trunk and switching if they don’t show regressions.On Nov 16, 2022, at 10:48 AM, Josh McKenzie  wrote:To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to prioritize digging into G1's behavior on small heaps vs. CMS w/our default tuning sooner rather than later. With that info I'd likely be a strong +1 on the shift.-1 on switching to offheap_objects for 4.1 RC; again, think this is just a small step away from being a +1 w/some more rigor around seeing the current state of the technology's intersections.On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:All right. I’ll clarify then.-0 on switching the default to G1 *this late* just before RC1.-1 on switching the default offheap_objects *for 4.1 RC1*, but all for it in principle, for 4.2, after we run some more test and resolve the concerns raised by Jeff.Let’s please try to avoid this kind of super late defaults switch going forward?—AY> On 16 Nov 2022, at 03:27, Derek Chen-Becker  wrote:> > For the record, I'm +100 on G1. Take it with whatever sized grain of> salt you think appropriate for a relative newcomer to the list, but> I've spent my last 7-8 years dealing with the intersection of> high-throughput, low latency systems and their interaction with GC and> in my personal experience G1 outperforms CMS in all cases and with> significantly less work (zero work, in many cases). The only things> I've seen perform better *with a similar heap footprint* are GenShen> (currently experimental) and Rust (beyond the scope of this topic).> > Derek> > On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad  wrote:>> >> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?>> >> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.>> >> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.>> >> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.>> >> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).>> >> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.>> >> Jon>> >> On 2022/11/10 21:49:36 Mick Semb Wever wrote:  In case of GC, reasonably extensive performance testing should be the expectations. Potentially revisiting some of the G1 params for the 4.1 reality - quite a lot has changed since those optional defaults where picked. >>> >>> >>> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)>>> in the patch for CASSANDRA-18027>>> >>> In reality it is really not much of a change, g1 does make it simple.>>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to>>> the new heap (XX:NewSize) is still required, though we could do a much>>> better job of dynamic defaults to them.>>> >>> Alex Dejanovski's blog is a starting point:>>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html>>> where this gc opt set was used (though it doesn't prove why those options>>> are chosen)>>> >>> The bar for objection to sneaking these into 4.1 was intended to be low,>>> and I stand by those that raise concerns.>>> > > > > -- > +---+> | Derek Chen-Becker |> | GPG Key available at

Re: CEP-23: Enhancement for Sparse Data Serialization

2022-10-27 Thread J. D. Jordan

No vote required. Just add a comment on it.On Oct 25, 2022, at 10:51 AM, Claude Warren, Jr via dev  wrote:I see that there is one proposal that was discarded.  I wonder how that got there.On Tue, Oct 25, 2022 at 2:52 PM Josh McKenzie  wrote:... I don't know that we've navigate that question before. My immediate reaction is as the proposer you should be able to close it down unless it's gone to a vote and/or a vote has passed.If someone else wants to pick it up later that's fine.On Tue, Oct 25, 2022, at 7:35 AM, Claude Warren, Jr via dev wrote:I would like to discard CEP-23.  As I am the proposer, is a vote required?What is the process?Claude

Re: [VOTE] CEP-20: Dynamic Data Masking

2022-09-19 Thread J. D. Jordan

+1 nb

> On Sep 19, 2022, at 6:50 AM, Berenguer Blasi  wrote:
> 
> +1
> 
>> On 19/9/22 13:39, Brandon Williams wrote:
>> +1
>> 
>> Kind Regards,
>> Brandon
>> 
>>> On Mon, Sep 19, 2022 at 6:39 AM Andrés de la Peña  
>>> wrote:
>>> Hi everyone,
>>> 
>>> I'd like to propose CEP-20 for approval.
>>> 
>>> Proposal: 
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
>>> Discussion: https://lists.apache.org/thread/qsmxsymozymy6dy9tp5xw9gn5fhz9nt4
>>> 
>>> The vote will be open for 72 hours.
>>> Votes by committers are considered binding.
>>> A vote passes if there are at least three binding +1s and no binding vetoes.
>>> 
>>> Thank you,

Re: [DISCUSS] Removing support for java 8

2022-08-29 Thread J. D. Jordan

+1 for removing on trunk. Pretty sure we already discussed that in the Java 17 
thread?  That trunk will move to 11+17?

> On Aug 29, 2022, at 3:40 PM, Blake Eggleston  wrote:
> 
> Sorry, I meant trunk, not 4.1 :)
> 
>> On Aug 29, 2022, at 1:09 PM, Blake Eggleston  wrote:
>> 
>> Hi all, I wanted to propose removing jdk8 support for 4.1. Active support 
>> ended back in March of this year, and I believe the community has built 
>> enough confidence in java 11 to make it an uncontroversial change for our 
>> next major release. Let me know what you think.
>> 
>> Thanks,
>> 
>> Blake
>

Re: Inclusive/exclusive endpoints when compacting token ranges

2022-07-26 Thread J. D. Jordan

I like the third option, especially if it makes it consistent with repair, 
which has supported ranges longer and I would guess most people would think the 
compact ranges work the same as the repair ranges.

-Jeremiah Jordan

> On Jul 26, 2022, at 6:49 AM, Andrés de la Peña  wrote:
> 
> 
> Hi all,
> 
> CASSANDRA-17575 has detected that token ranges in nodetool compact are 
> interpreted as closed on both sides. For example, the command "nodetool 
> compact -st 10 -et 50" will compact the tokens in [10, 50]. This way of 
> interpreting token ranges is unusual since token ranges are usually 
> half-open, and I think that in the previous example one would expect that the 
> compacted tokens would be in (10, 50]. That's for example the way nodetool 
> repair works, and indeed the class org.apache.cassandra.dht.Range is always 
> half-open.
> 
> It's worth mentioning that, differently from nodetool repair, the help and 
> doc for nodetool compact doesn't specify whether the supplied start/end 
> tokens are inclusive or exclusive.
> 
> I think that ideally nodetool compact should interpret the provided token 
> ranges as half-open, to be consistent with how token ranges are usually 
> interpreted. However, this would change the way the tool has worked until 
> now. This change might be problematic for existing users relying on the old 
> behaviour. That would be especially severe for the case where the begin and 
> end token are the same, because interpreting [x, x] we would compact a single 
> token, whereas I think that interpreting (x, x] would compact all the tokens. 
> As for compacting ranges including multiple tokens, I think the change 
> wouldn't be so bad, since probably the supplied token ranges come from tools 
> that are already presenting the ranges as half-open. Also, if we are 
> splitting the full ring into smaller ranges, half-open intervals would still 
> work and would save us some repetitions.
> 
> So my question is: Should we change the behaviour of nodetool compact to 
> interpret the token ranges as half-opened, aligning it with the usual 
> interpretation of ranges? Or should we just document the current odd 
> behaviour to prevent compatibility issues?
> 
> A third option would be changing to half-opened ranges and also forbidding 
> ranges where the begin and end token are the same, to prevent the accidental 
> compaction of the entire ring. Note that nodetool repair also forbids this 
> type of token ranges.
> 
> What do you think?

Re: Welcome Jacek Lewandowski as Cassandra committer

2022-07-06 Thread J. D. Jordan

Congrats!

> On Jul 6, 2022, at 7:20 AM, Berenguer Blasi  wrote:
> 
> Congrsss Sir! :-)
> 
>> On 6/7/22 14:00, Benjamin Lerer wrote:
>> The PMC members are pleased to announce that  Jacek Lewandowski has accepted
>> the invitation to become committer.
>> 
>> Thanks a lot, Jacek,  for everything you have done!
>> 
>> Congratulations and welcome
>> 
>> The Apache Cassandra PMC members

Re: How we flag tickets as blockers during freeze

2022-05-10 Thread J. D. Jordan

+1 from me.

> On May 10, 2022, at 9:17 AM, Josh McKenzie  wrote:
> 
> 
>> 
>> at some later point it needs to be "easy" for
>> someone else to correct it.
> I don't want to optimize for cleaning up later; I want to optimize for our 
> ability to know our workload blocking our next release and encouraging 
> contributors to focus their efforts if they're so inclined.
> 
> That said, I'm in favor now of adding the unreleased versions for -alpha, 
> -beta, and -rc, and flipping to the major/minor on resolution. We should also 
> codify this in our release lifecycle wiki article so we don't have to revisit 
> the topic.
> 
> I think this solution is compatible with what everyone on the thread has said 
> thus far, so if nobody has any major concerns, later today I will:
> 
> 1. Add a 4.1-alpha, 4.1-beta, and 4.1-rc FixVersion (unreleased)
> 2. Update fixversion on tickets that are blocking each release respectively 
> based on our lifecycle process
> 3. Update our kanban board to have swimlanes for each phase of the release
> 4. Update the lifecycle cwiki w/this process for future releases
> 
> ~Josh
> 
>> On Tue, May 10, 2022, at 2:23 AM, Mick Semb Wever wrote:
>> > Why do you need to change anything post release?  The whole point is to 
>> > set the version to the release the ticket blocks. So you don’t need to 
>> > change anything.
>> >
>> 
>> 
>> There's always many issues left with the wrong fixVersion. And we
>> can't police that. So at some later point it needs to be "easy" for
>> someone else to correct it.
>> 
>

Re: How we flag tickets as blockers during freeze

2022-05-09 Thread J. D. Jordan

Why do you need to change anything post release?  The whole point is to set the 
version to the release the ticket blocks. So you don’t need to change anything.

> On May 9, 2022, at 8:03 PM, Mick Semb Wever  wrote:
> 
> Jeremiah, around when was this? I can see that it makes sense (works in 
> theory), but trying to correct fixVersions in jira post release can be quite 
> the headache, without having to reach out to people to understand if 
> something is intentional or a mistake. So long as there's a way to bulk 
> change issues after a release I am happy.

Re: How we flag tickets as blockers during freeze

2022-05-09 Thread J. D. Jordan

I would vote for option 1. We have done similar in the past and if something is 
a blocker it means it will be in that version before it is released. So there 
should not be any confusion of things getting bumped forward to another patch 
number because they were not committed in time, which is where confusion 
usually arises from.

> On May 9, 2022, at 4:07 PM, Mick Semb Wever  wrote:
> 
> 
>> Any other opinions or ideas out there? Would like to tidy our tickets up as 
>> build lead and scope out remaining work for 4.1.
> 
>  
> My request is that we don't overload fixVersions. That is, a fixVersion is 
> either for resolved tickets, or a placeholder for unresolved, but never both.
> This makes it easier with jira hygiene post release, ensuring issues do get 
> properly assigned their correct fixVersion. (This work can be many tickets 
> and already quite cumbersome, but it is valued by users.)
> 
> It would also be nice to try keep what is a placeholder fixVersion as 
> intuitively as possible. The easiest way I see us doing this is to avoid 
> using patch numbers. This rules out Option 1. 
> 
> While the use of 4.0 and 4.1 as resolved fixVersions kinda breaks the above 
> notion of "if it doesn't have a patch version then it's a placeholder". The 
> precedence here is that all resolved tickets before the first .0 of a major 
> gets this short-hand version (and often in addition to the alpha1, beta1, rc1 
> fixVersions).
> 
> 
>

Re: Adding a security role to grant/revoke with no access to the data itself

2022-03-30 Thread J. D. Jordan

I think these are very interesting ideas for another new feature. Would one of 
you like to write it up as a JIRA and start a new thread to discuss details?  I 
think it would be good to keep this thread about the simpler proposal from 
CASSANDRA-17501 unless you all are against implementing that without the new 
abilities you are proposing?  This “requires N grants” idea seems to me to be 
orthogonal to the original ticket.

> On Mar 30, 2022, at 10:00 AM, Stefan Miklosovic 
>  wrote:
> 
> btw there is also an opposite problem, you HAVE TO have two guys (out
> of two) to grant access. What if one of them is not available because
> he went on holiday? So it might be wise to say "if three out of five
> admins grants access that is enough", how would you implement it?
> 
>> On Wed, 30 Mar 2022 at 16:56, Stefan Miklosovic
>>  wrote:
>> 
>> Why not N guys instead of two? Where does this stop? "2" seems to be
>> an arbitrary number. This starts to remind me of Shamir's shared
>> secrets.
>> 
>> https://en.wikipedia.org/wiki/Shamir%27s_Secret_Sharing
>> 
>>> On Wed, 30 Mar 2022 at 16:36, Tibor Répási  wrote:
>>> 
>>> … TWO_MAN_RULE could probably be poor naming and a boolean option not 
>>> flexible enough, let’s change that to an integer option like GRANTORS 
>>> defaulting 1 and could be any higher defining the number of grantors needed 
>>> for the role to become active.
>>> 
 On 30. Mar 2022, at 16:11, Tibor Répási  wrote:
>>> 
>>> Having two-man rules in place for authorizing access to highly sensitive 
>>> data is not uncommon. I think about something like:
>>> 
>>> As superuser:
>>> 
>>> CREATE KEYSPACE patientdata …;
>>> 
>>> CREATE ROLE patientdata_access WITH TWO_MAN_RULE=true;
>>> 
>>> GRANT SELECT, MODIFY ON patientdata TO patientdata_access;
>>> 
>>> CREATE ROLE security_admin;
>>> GRANT AUTHORIZE patientdata_access TO security_admin;
>>> 
>>> GRANT security_admin TO admin_guy1;
>>> 
>>> GRANT security_admin TO admin_guy2;
>>> 
>>> As admin_guy1:
>>> 
>>> GRANT patientdata_access TO doctor_house;
>>> 
>>> at this point doctor_house doesn’t have access to patientdata, it needs 
>>> admin_guy2 to:
>>> 
>>> GRANT patientdata_access TO doctor_house;
>>> 
>>> 
>>> 
>>> 
 On 30. Mar 2022, at 15:13, Benjamin Lerer  wrote:
>>> 
 What would prevent the security_admin from self-authorizing himself?
>>> 
>>> 
>>> It is a valid point. :-) The idea is to have some mechanisms in place to 
>>> prevent that kind of behavior.
>>> Of course people might still be able to collaborate to get access to some 
>>> data but a single person should not be able to do that all by himself.
>>> 
>>> 
>>> Le mer. 30 mars 2022 à 14:52, Tibor Répási  a écrit 
>>> :

 I like the idea of separation of duties. But, wouldn’t be a security_admin 
 role not just a select and modify permission on system_auth? What would 
 prevent the security_admin from self-authorizing himself?

 Would it be possible to add some sort of two-man rule?

 On 30. Mar 2022, at 10:44, Berenguer Blasi  
 wrote:

 Hi all,

 I would like to propose to add support for a sort of a security role that 
 can grant/revoke
 permissions to a user to a resource (KS, table,...) but _not_ access the 
 data in that resource itself. Data may be sensitive,
 have legal constrains, etc but this separation of duties should enable 
 that. Think of a hospital where
 IT can grant/revoke permissions to doctors but IT should _not_ have access 
 to the data itself.

 I have created https://issues.apache.org/jira/browse/CASSANDRA-17501 with 
 more details. If anybody has
 any concerns or questions with this functionality I will be happy to 
 discuss them.

 Thx in advance.

>>> 
>>>

Re: Adding a security role to grant/revoke with no access to the data itself

2022-03-30 Thread J. D. Jordan

I think this is an important step in the authorization model of C*.  It brings 
parity with many other databases.

While further restrictions might make such restrictions less likely to be 
worked around, in most places I have heard of using audit logging of user 
management statements is how you prevent that.  With this type of restriction + 
audit logs of all user management you can show that an admin has not accessed 
data through their admin account.

The ability to have an even more restrictive mode would be a nice future add on.

-Jeremiah

> On Mar 30, 2022, at 8:13 AM, Benjamin Lerer  wrote:
> 
> 
>> What would prevent the security_admin from self-authorizing himself?
> 
> It is a valid point. :-) The idea is to have some mechanisms in place to 
> prevent that kind of behavior.
> Of course people might still be able to collaborate to get access to some 
> data but a single person should not be able to do that all by himself. 
> 
> 
>> Le mer. 30 mars 2022 à 14:52, Tibor Répási  a écrit :
>> I like the idea of separation of duties. But, wouldn’t be a security_admin 
>> role not just a select and modify permission on system_auth? What would 
>> prevent the security_admin from self-authorizing himself?
>> 
>> Would it be possible to add some sort of two-man rule?
>> 
>>> On 30. Mar 2022, at 10:44, Berenguer Blasi  wrote:
>>> 
>>> Hi all,
>>> 
>>> I would like to propose to add support for a sort of a security role that 
>>> can grant/revoke 
>>> permissions to a user to a resource (KS, table,...) but _not_ access the 
>>> data in that resource itself. Data may be sensitive,
>>> have legal constrains, etc but this separation of duties should enable 
>>> that. Think of a hospital where
>>> IT can grant/revoke permissions to doctors but IT should _not_ have access 
>>> to the data itself.
>>> 
>>> I have created https://issues.apache.org/jira/browse/CASSANDRA-17501 with 
>>> more details. If anybody has 
>>> any concerns or questions with this functionality I will be happy to 
>>> discuss them.
>>> 
>>> Thx in advance.
>>

Re: Welcome Aleksandr Sorokoumov as Cassandra committer

2022-03-16 Thread J. D. Jordan

Congratulations!

> On Mar 16, 2022, at 8:43 AM, Ekaterina Dimitrova  
> wrote:
> 
> 
> Great news! Well deserved! Congrats and thank you for all your support!
> 
>> On Wed, 16 Mar 2022 at 9:41, Paulo Motta  wrote:
>> Congratulations Alex, well deserved! :-)
>> 
>>> Em qua., 16 de mar. de 2022 às 10:15, Benjamin Lerer  
>>> escreveu:
>>> The PMC members are pleased to announce that Aleksandr Sorokoumov has 
>>> accepted
>>> the invitation to become committer.
>>> 
>>> Thanks a lot, Aleksandr , for everything you have done for the project.
>>> 
>>> Congratulations and welcome
>>> 
>>> The Apache Cassandra PMC members

Re: [VOTE] CEP-7: Storage Attached Index

2022-02-17 Thread J. D. Jordan

+1 nb

> On Feb 17, 2022, at 4:25 PM, Brandon Williams  wrote:
> 
> +1
> 
>> On Thu, Feb 17, 2022 at 4:23 PM Caleb Rackliffe
>>  wrote:
>> 
>> Hi everyone,
>> 
>> I'd like to call a vote to approve CEP-7.
>> 
>> Proposal: 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index
>> 
>> Discussion: https://lists.apache.org/thread/hh67k3t86m7299qkt61gmzb4h96bl90w
>> 
>> The vote will be open for 72 hours.
>> Votes by committers are considered binding.
>> A vote passes if there are at least three binding +1s and no binding vetoes.
>> 
>> Thanks!
>> Caleb

Re: [VOTE] CEP-19: Trie memtable implementation

2022-02-16 Thread J. D. Jordan

+1 nb

> On Feb 16, 2022, at 7:30 AM, Josh McKenzie  wrote:
> 
> 
> +1
> 
>> On Wed, Feb 16, 2022, at 7:33 AM, Ekaterina Dimitrova wrote:
>> +1nb
>> 
>> On Wed, 16 Feb 2022 at 7:30, Brandon Williams  wrote:
>> +1
>> 
>> On Wed, Feb 16, 2022 at 3:00 AM Branimir Lambov  wrote:
>> >
>> > Hi everyone,
>> >
>> > I'd like to propose CEP-19 for approval.
>> >
>> > Proposal: 
>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-19%3A+Trie+memtable+implementation
>> > Discussion: 
>> > https://lists.apache.org/thread/fdvf1wmxwnv5jod59jznbnql23nqosty
>> >
>> > The vote will be open for 72 hours.
>> > Votes by committers are considered binding.
>> > A vote passes if there are at least three binding +1s and no binding 
>> > vetoes.
>> >
>> > Thank you,
>> > Branimir

Re: Client password hashing

2022-02-16 Thread J. D. Jordan

Can we have the discussion on the ticket?

Thanks
-Jeremiah

> On Feb 16, 2022, at 6:23 AM, Bowen Song  wrote:
> 
> To me this doesn't sound very useful. Here's a few threat model I can think 
> of that may be related to this proposal, and why is this not addressing the 
> issues & what should be done instead.
> 
> 1. passwords are send over network in plaintext allows passive packet 
> sniffier to learn about the password
> 
> When the user logging in and authenticating themselves, they will have to 
> send both the username and password to the server in plaintext anyway.
> 
> Securing the connection with TLS should address this concern.
> 
> 2. malicious intermediaries (external loadbancer, middleware, etc.) are able 
> learn about the password
> 
> The admin user must login against the intermediary before creating/altering 
> other users, this exposes the admin user's credentials to the malicious 
> intermediary.
> 
> Only use trusted intermediaries, and use TLS between the client & Cassandra 
> server wherever possible (e.g. don't terminate TLS at the loadbalancer).
> 
> 3. accidentally logging the password to an insecure log file
> 
> Logging a hashed password to an insecure log file is still very bad
> 
> The logger module should correctly redact the data
> 
> 
> If this proposal helps mitigating a different threat model that you have in 
> mind, please kindly share it with us.
> 
> 
>> On 16/02/2022 07:44, Berenguer Blasi wrote:
>> Hi all,
>> 
>> I would like to propose to add support for client password hashing 
>> (https://issues.apache.org/jira/browse/CASSANDRA-17334). If anybody has any 
>> concerns or question with this functionality I will be happy to discuss them.
>> 
>> Thx in advance.
>>

Re: Welcome Anthony Grasso, Erick Ramirez and Lorina Poland as Cassandra committers

2022-02-15 Thread J. D. Jordan

Congratulations all of you! Well deserved additions.

> On Feb 15, 2022, at 12:30 PM, Brandon Williams  wrote:
> 
> Congratulations, well deserved!
> 
>> On Tue, Feb 15, 2022 at 12:13 PM Benjamin Lerer  wrote:
>> 
>> The PMC members are pleased to announce that Anthony Grasso, Erick Ramirez 
>> and Lorina Poland have accepted the invitation to become committers.
>> 
>> Thanks a lot, Anthony, Erick and Lorina for all the work you have done on 
>> the website and documentation.
>> 
>> Congratulations and welcome
>> 
>> The Apache Cassandra PMC members

Re: [DISCUSS] Hotfix release procedure

2022-02-15 Thread J. D. Jordan

Correct. No need to revert anything or keep extra branches around. You just 
checkout the tag and then make a branch with the single fix on it.

> On Feb 15, 2022, at 10:08 AM, Josh McKenzie  wrote:
> 
> 
> Was thinking that too after I wrote this. Means we'd only need to change our 
> process for future hotfixes and keep everything else as-is.
> 
>> On Tue, Feb 15, 2022, at 10:55 AM, Brandon Williams wrote:
>> On Tue, Feb 15, 2022 at 9:53 AM Josh McKenzie  wrote:
>> >
>> > The only way I'd be in favor of a release that removes all other committed 
>> > patches
>> >
>> > Couldn't we just have a snapshot branch for each supported major/minor 
>> > release branch that we patch for hotfixes and we bump up whenever we have 
>> > a GA on a parent branch?
>> 
>> I think you could just checkout the tag of the previous release into a
>> new branch and apply the fix to it.
>> 
>> Kind Regards,
>> Brandon
>> 
>

Re: [DISCUSS] Hotfix release procedure

2022-02-15 Thread J. D. Jordan

We already advertise that we are preparing a security release when ever we 
release all of our patch versions at the same time. So I don’t think there is 
an issue there.
I was not involved in any PMC discussions and had no knowledge of the CVE, but 
when three branches got release votes at the same moment I knew one of the 
final couple patches that was on all three must be an un-announced CVE. It is 
especially more obvious when said patches mention JIRA ticket numbers with no 
information in the ticket. Nobody is being sneaky here as long as the vote and 
code are in the open.

> On Feb 15, 2022, at 9:15 AM, bened...@apache.org wrote:
> 
> 
> One issue with this approach is that we are advertising that we are preparing 
> a security release by preparing such a release candidate.
>  
> I wonder if we need to find a way to produce binaries without leaving an 
> obvious public mark (i.e. private CI, private branch)
>  
>  
> From: Josh McKenzie 
> Date: Tuesday, 15 February 2022 at 14:09
> To: dev@cassandra.apache.org 
> Subject: [DISCUSS] Hotfix release procedure
> 
> On the release thread for 4.0.2 Jeremiah brought up a point about hotfix 
> releases and CI: 
> https://lists.apache.org/thread/7zc22z5vw5b58hdzpx2nypwfzjzo3qbr
>  
> If we are making this release for a security incident/data loss/hot fix 
> reason, then I would expect to see the related change set only containing 
> those patches. But the change set in the tag here the latest 4.0-dev commits.
>  
> I'd like to propose that in the future, regardless of the state of CI, if we 
> need to cut a hotfix release we do so from the previous released SHA + only 
> the changes required to address the hotfix to minimally impact our end users 
> and provide them with as minimally disruptive a fix as possible.

Re: [DISCUSS] Hotfix release procedure

2022-02-15 Thread J. D. Jordan

+1. If we want to take our release quality seriously then I think this would be 
a great policy to have.

> On Feb 15, 2022, at 8:09 AM, Josh McKenzie  wrote:
> 
> 
> On the release thread for 4.0.2 Jeremiah brought up a point about hotfix 
> releases and CI: 
> https://lists.apache.org/thread/7zc22z5vw5b58hdzpx2nypwfzjzo3qbr
> 
>> If we are making this release for a security incident/data loss/hot fix 
>> reason, then I would expect to see the related change set only containing 
>> those patches. But the change set in the tag here the latest 4.0-dev commits.
> 
> I'd like to propose that in the future, regardless of the state of CI, if we 
> need to cut a hotfix release we do so from the previous released SHA + only 
> the changes required to address the hotfix to minimally impact our end users 
> and provide them with as minimally disruptive a fix as possible.

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-07 Thread J. D. Jordan

Given this discussion +1 from me to move OR to its own CEP separate from the 
new index implementation.

> On Feb 7, 2022, at 6:51 AM, Benjamin Lerer  wrote:
> 
> 
>> This was since extended to also support ALLOW FILTERING mode as well as OR 
>> with clustering key columns.
> 
> If the code is able to support query using clustering columns  without the 
> need for filtering + filtering queries then it should be relatively easy to 
> have full support for CQL.
> We also need some proper test coverage and ideally some validation with Harry.
> 
>>   * Since OR support nevertheless is a feature of SAI, it needs to be at 
>> least unit tested, but ideally even would be exposed so that it is possible 
>> to test on the CQL level. Is there some mechanism such as experimental 
>> flags, which would allow the SAI-only OR support to be merged into trunk, 
>> while a separate CEP is focused on implementing "proper" general purpose OR 
>> support? I should note that there is no guarantee that the OR CEP would be 
>> implemented in time for the next release. So the answer to this point needs 
>> to be something that doesn't violate the desire for good user experience.
> 
> This is currently what we have with SASI. Currently SASI is behind an 
> experimental flag but nevertheless the LIKE restriction code has been 
> introduced as part of the code base and its use will result in an error 
> without a SASI index.
> SASI has been there for multiple years and we still do not support LIKE 
> restrictions for other use cases.
> I am against that approach because I do believe that it is what has led us 
> where we are today. We need to stop adding bits of CQL grammar to fulfill the 
> need of a given feature and start considering CQL as a whole.
> 
> I am in favor of moving forward with SAI without OR support until OR can be 
> properly added to CQL. 
> 
>  
>  
> 
>> Le lun. 7 févr. 2022 à 13:11, Henrik Ingo  a écrit 
>> :
>> Thanks Benjamin for reviewing and raising this.
>> 
>> While I don't speak for the CEP authors, just some thoughts from me:
>> 
>>> On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer  wrote:
>> 
>>> I would like to raise 2 points regarding the current CEP proposal:
>>> 
>>> 1. There are mention of some target versions and of the removal of SASI 
>>> 
>>> At this point, we have not agreed on any version numbers and I do not feel 
>>> that removing SASI should be part of the proposal for now.
>>> It seems to me that we should see first the adoption surrounding SAI before 
>>> talking about deprecating other solutions.
>>> 
>> 
>> This seems rather uncontroversial. I think the CEP template and previous 
>> CEPs invite  the discussion on whether the new feature will or may replace 
>> an existing feature. But at the same time that's of course out of scope for 
>> the work at hand. I have no opinion one way or the other myself.
>> 
>>  
>>> 2. OR queries
>>> 
>>> It is unclear to me if the proposal is about adding OR support only for SAI 
>>> index or for other types of queries too.
>>> In the past, we had the nasty habit for CQL to provide only partialially 
>>> implemented features which resulted in a bad user experience.
>>> Some examples are:
>>> * LIKE restrictions which were introduced for the need of SASI and were not 
>>> never supported for other type of queries
>>> * IS NOT NULL restrictions for MATERIALIZED VIEWS that are not supported 
>>> elsewhere
>>> * != operator only supported for conditional inserts or updates
>>> And there are unfortunately many more.
>>> 
>>> We are currenlty slowly trying to fix those issue and make CQL a more 
>>> mature language. By consequence, I would like that we change our way of 
>>> doing things. If we introduce support for OR it should also cover all the 
>>> other type of queries and be fully tested.
>>> I also believe that it is a feature that due to its complexity fully 
>>> deserves its own CEP.
>>> 
>> 
>> The current code that would be submitted for review after the CEP is 
>> adopted, contains OR support beyond just SAI indexes. An initial 
>> implementation first targeted only such queries where all columns in a WHERE 
>> clause using OR needed to be backed by an SAI index. This was since extended 
>> to also support ALLOW FILTERING mode as well as OR with clustering key 
>> columns. The current implementation is by no means perfect as a general 
>> purpose OR support, the focus all the time was on implementing OR support in 
>> SAI. I'll leave it to others to enumerate exactly the limitations of the 
>> current implementation.
>> 
>> Seeing that also Benedict supports your point of view, I would steer the 
>> conversation more into a project management perspective:
>> * How can we advance CEP-7 so that the bulk of the SAI code can still be 
>> added to Cassandra, so that  users can benefit from this new index type, 
>> albeit without OR?
>> * This is also an important question from the point of view that this is a 
>> large block of code that will

Re: Recent log4j vulnerability

2021-12-14 Thread J. D. Jordan

Doesn’t hurt to upgrade. But no exploit there as far as I can see?  If someone 
can update your config files to point them to JNDI, you have worse problems 
than that.  Like they can probably update your config files to just completely 
open up JMX access or what ever also.

> On Dec 14, 2021, at 9:17 AM, Brandon Williams  wrote:
> 
> The POC seems to require the attacker be able to upload a file that
> overwrites the configuration, with hot reloading enabled.  We do have
> hot reloading enabled but there's no inherent way to overwrite the
> config.
> 
> That said with logback currently at 1.2.3 (in trunk), perhaps we
> should consider an upgrade for safety.
> 
>> On Tue, Dec 14, 2021 at 8:50 AM Steinmaurer, Thomas
>>  wrote:
>> 
>> Any thoughts what the logback folks have been filed here?
>> https://jira.qos.ch/browse/LOGBACK-1591
>> 
>> Thanks!
>> 
>> -Original Message-
>> From: Brandon Williams 
>> Sent: Sonntag, 12. Dezember 2021 18:56
>> To: dev@cassandra.apache.org
>> Subject: Recent log4j vulnerability
>> 
>> I replied to a user- post about this, but thought it was worth repeating it 
>> here.
>> 
>> In 
>> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FCASSANDRA-5883data=04%7C01%7Cthomas.steinmaurer%40dynatrace.com%7C8016a1aeed8c4589cbe408d9bd9a0920%7C70ebe3a35b30435d9d677716d74ca190%7C1%7C0%7C637749291586596208%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=0klDN4WmFkt876OCsXL%2FX%2FUXa%2FrsxmwCKFgmnP4Lctw%3Dreserved=0
>>  you can see where Apache Cassandra never chose to use log4j2 (preferring 
>> logback instead), and thus is not, and has never been, vulnerable to this 
>> RCE.
>> 
>> Kind Regards,
>> Brandon
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>> 
>> This email may contain confidential information. If it appears this message 
>> was sent to you by mistake, please let us know of the error. In this case, 
>> we also ask that you do not further forward the content and delete it. Thank 
>> you for your cooperation and understanding. Dynatrace Austria GmbH 
>> (registration number FN 91482h) is a company registered in Linz whose 
>> registered office is at 4020 Linz, Austria, Am Fünfundzwanziger Turm 20.
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] Disabling MIME-part filtering on this mailing list

2021-12-04 Thread J. D. Jordan

+1 (nb) from me.

I have always wondered why the signatures were broken. That JIRA thread is very 
enlightening on how those email features work :).

> On Dec 4, 2021, at 11:18 AM, C. Scott Andreas  wrote:
> 
> +1, this would be great to have fixed. Thanks for talking with Infra about 
> this, Bowen.
> 
>> On Dec 4, 2021, at 9:16 AM, Bowen Song  wrote:
>> 
>> Hello,
>> 
>> 
>> Currently this mailing list has MIME-part filtering turned on, which will 
>> results in "From:" address munging (appending ".INVALID" to the sender's 
>> email address) for domains enforcing strict DMARC rules, such as apple.com, 
>> zoho.com and all Yahoo.** domains. This behaviour may cause some emails 
>> being treated as spam by the recipients' email service providers, because 
>> the result "From:" address, such as "some...@yahoo.com.INVALID" is not valid 
>> and cannot be verified.
>> 
>> I have created a Jira ticket INFRA-22548 
>>  asking to change this, 
>> but the Infra team said dropping certain MIME part types is to prevent spam 
>> and harmful attachments, and would require a consensus from the project 
>> before they can make the change. Therefore I'm sending this email asking for 
>> your opinions on this.
>> 
>> To be clear, turning off the MIME-part filtering will not turn off the 
>> anti-spam and anti-virus feature on the mailing list, all emails sent to the 
>> list will still need to pass the checks before being forwarded to 
>> subscribers. Morden (since 90s?) anti-spam and anti-virus software will scan 
>> the MIME parts too, in addition to the plain-text and/or HTML email body. 
>> Your email service provider is also almost certainly going to have their own 
>> anti-spam and anti-virus software, in addition to the one on the mailing 
>> list. The difference is whether the mailing list proactively removing MIME 
>> parts not in the predefined whitelist.
>> 
>> To help you understand the change, here's the difference between the two 
>> behaviours:
>> 
>> 
>> With the MIME-part filtering enabled (current behaviour)
>> 
>> * the mailing list will remove certain MIME-part types, such as executable 
>> file attachments, before forwarding it
>> 
>> * the mailing list will append ".INVALID" to some senders' email address
>> 
>> * the emails from the "*@*.INVALID" sender address are more likely to end up 
>> in recipients' spam folder
>> 
>> * it's harder for people to directly reply to someone who's email address 
>> has been modified in this way
>> 
>> * recipients running their own email server without anti-spam and/or 
>> anti-virus software on it have some extra protections
>> 
>> 
>> With MIME-part filtering disabled
>> 
>> * the mailing list forward all non-spam and non-infected emails as it is 
>> without changing them
>> 
>> * the mailing list will not change senders' email address
>> 
>> * the emails from this mailing list are less likely to end up in recipients' 
>> spam folder
>> 
>> * it's easier for people to directly reply to anyone in this mailing list
>> 
>> * recipients running their own email server without anti-spam and/or 
>> anti-virus software on it may be exposed to some threats
>> 
>> 
>> What's your opinion on this? Do you support or oppose disabling the 
>> MIME-part filtering on the Cassandra-dev mailing list?
>> 
>> 
>> p.s.: as you can see, my email address has the ".INVALID" appended to it by 
>> this mailing list.
>> 
>> 
>> Regards,
>> 
>> Bowen
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Resurrection of CASSANDRA-9633 - SSTable encryption

2021-11-15 Thread J. D. Jordan

Another comment here. I tried to find the patch to check but couldn’t find it 
linked to the ticket. If it is not already, given the TDE key class is 
pluggable in the yaml, when a file is written everything need to instantiate 
the class to decrypt it should be in the metadata. Just like happens now for 
compression. So if someone switches to a different TDE class you can still know 
to instantiate the old one to read existing files.  The class from the yaml 
config should just be used for encrypting new files.

> On Nov 14, 2021, at 3:54 PM, Stefan Miklosovic 
>  wrote:
> 
> Hey,
> 
> there are two points we are not completely sure about.
> 
> The first one is streaming. If there is a cluster of 5 nodes, each
> node has its own unique encryption key. Hence, if a SSTable is stored
> on a disk with the key for node 1 and this is streamed to node 2 -
> which has a different key - it would not be able to decrypt that. Our
> idea is to actually send data over the wire _decrypted_ however it
> would be still secure if internode communication is done via TLS. Is
> this approach good with you?
> 
> The second question is about key rotation. If an operator needs to
> roll the key because it was compromised or there is some policy around
> that, we should be able to provide some way to rotate it. Our idea is
> to write a tool (either a subcommand of nodetool (rewritesstables)
> command or a completely standalone one in tools) which would take the
> first, original key, the second, new key and dir with sstables as
> input and it would literally took the data and it would rewrite it to
> the second set of sstables which would be encrypted with the second
> key. What do you think about this?
> 
> Regards
> 
>> On Sat, 13 Nov 2021 at 19:35,  wrote:
>> 
>> Same reaction here - great to have traction on this ticket. Shylaja, thanks 
>> for your work on this and to Stefan as well! It would be wonderful to have 
>> the feature complete.
>> 
>> One thing I’d mention is that a lot’s changed about the project’s testing 
>> strategy since the original patch was written. I see that the 2016 version 
>> adds a couple round-trip unit tests with a small amount of static data. It 
>> would be good to see randomized tests fleshed out that exercise more of the 
>> read/write path; or which add variants of existing read/write path tests 
>> that enable encryption.
>> 
>> – Scott
>> 
 On Nov 13, 2021, at 7:53 AM, Brandon Williams  wrote:
>>> 
>>> We already have a ticket and this predated CEPs, and being an
>>> obviously good improvement to have that many have been asking for for
>>> some time now, I don't see the need for a CEP here.
>>> 
>>> On Sat, Nov 13, 2021 at 5:01 AM Stefan Miklosovic
>>>  wrote:

 Hi list,

 an engineer from Intel - Shylaja Kokoori (who is watching this list
 closely) has retrofitted the original code from CASSANDRA-9633 work in
 times of 3.4 to the current trunk with my help here and there, mostly
 cosmetic.

 I would like to know if there is a general consensus about me going to
 create a CEP for this feature or what is your perception on this. I
 know we have it a little bit backwards here as we should first discuss
 and then code but I am super glad that we have some POC we can
 elaborate further on and CEP would just cement  and summarise the
 approach / other implementation aspects of this feature.

 I think that having 9633 merged will fill quite a big operational gap
 when it comes to security. There are a lot of enterprises who desire
 this feature so much. I can not remember when I last saw a ticket with
 50 watchers which was inactive for such a long time.

 Regards

 -
 To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
 For additional commands, e-mail: dev-h...@cassandra.apache.org

>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [VOTE] CEP-17: SSTable format API

2021-11-15 Thread J. D. Jordan

+1 nb

> On Nov 15, 2021, at 1:47 PM, Brandon Williams  wrote:
> 
> +1
> 
>> On Mon, Nov 15, 2021 at 1:43 PM Branimir Lambov  wrote:
>> 
>> Hi everyone,
>> 
>> I would like to start a vote on this CEP.
>> 
>> Proposal:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-17%3A+SSTable+format+API
>> 
>> Discussion:
>> https://lists.apache.org/thread.html/r636bebcab4e678dbee042285449193e8e75d3753200a1b404fcc7196%40%3Cdev.cassandra.apache.org%3E
>> 
>> The vote will be open for 72 hours.
>> A vote passes if there are at least three binding +1s and no binding vetoes.
>> 
>> Regards,
>> Branimir
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [VOTE] CEP-3: Guardrails

2021-11-11 Thread J. D. Jordan

+1 (nb)

> On Nov 11, 2021, at 7:12 AM, Brandon Williams  wrote:
> 
> +1
> 
>> On Thu, Nov 11, 2021 at 5:37 AM Andrés de la Peña  
>> wrote:
>> 
>> Hi everyone,
>> 
>> I would like to start a vote on this CEP.
>> 
>> Proposal:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-3%3A+Guardrails
>> 
>> Discussion:
>> https://lists.apache.org/thread/7f6lntfdnkpqr7o0h2d2jlg8q7gf54w2
>> https://lists.apache.org/thread/0bd6fo4hdnwc8q2sq4xwvv4nqpxw10ds
>> 
>> The vote will be open for 72 hours.
>> A vote passes if there are at least three binding +1s and no binding vetoes.
>> 
>> Thanks,
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Welcome Sumanth Pasupuleti as Apache Cassandra Committer

2021-11-05 Thread J. D. Jordan

Congrats!

> On Nov 5, 2021, at 1:52 PM, Yifan Cai  wrote:
> 
> Congratulations Sumanth!
> 
> - Yifan
> 
>> On Nov 5, 2021, at 11:37 AM, Patrick McFadin  wrote:
>> 
>> Great to see this. Congrats Sumanth!
>> 
 On Fri, Nov 5, 2021 at 11:34 AM Brandon Williams  wrote:
>>> 
>>> Congratulations Sumanth!
>>> 
 On Fri, Nov 5, 2021 at 1:17 PM Oleksandr Petrov
  wrote:
 
 The PMC members are pleased to announce that Sumanth Pasupuleti has
 recently accepted the invitation to become committer.
 
 Sumanth, thank you for all your contributions to the project over the
>>> years.
 
 Congratulations and welcome!
 
 The Apache Cassandra PMC members
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>> 
>>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [VOTE] CEP-15: General Purpose Transactions

2021-10-17 Thread J. D. Jordan

1. +1(nb)
2. +1(nb)
3. +1(nb)

> On Oct 17, 2021, at 5:19 AM, Gary Dusbabek  wrote:
> 
> +1 for all three.
> 
>> On Thu, Oct 14, 2021 at 11:31 AM bened...@apache.org 
>> wrote:
>> 
>> Hi everyone,
>> 
>> I would like to start a vote on this CEP, split into three sub-decisions,
>> as discussion has been circular for some time.
>> 
>> 1. Do you support adopting this CEP?
>> 2. Do you support the transaction semantics proposed by the CEP for
>> Cassandra?
>> 3. Do you support an incremental approach to developing transactions in
>> Cassandra, leaving scope for future development?
>> 
>> The first vote is a consensus vote of all committers, the second and third
>> however are about project direction and therefore are simple majority votes
>> of the PMC.
>> 
>> Recall that all -1 votes must be accompanied by an explanation. If you
>> reject the CEP only on grounds (2) or (3) you should not veto the proposal.
>> If a majority reject grounds (2) or (3) then transaction developments will
>> halt for the time being.
>> 
>> This vote will be open for 72 hours.
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [VOTE] CEP-16 - Auth Plugin Support for CQLSH

2021-10-11 Thread J. D. Jordan

+1 nb

> On Oct 11, 2021, at 6:16 AM, Ekaterina Dimitrova  
> wrote:
> 
> +1
> 
>> On Mon, 11 Oct 2021 at 6:54, Benjamin Lerer  wrote:
>> 
>> +1
>> 
>> Le lun. 11 oct. 2021 à 11:50, Stefan Miklosovic <
>> stefan.mikloso...@instaclustr.com> a écrit :
>> 
>>> Hi list,
>>> 
>>> based on the discussion thread about CEP-16 (1), I would like to have
>>> a vote on that.
>>> 
>>> It seems to me CEP-16 is so straightforward there is more or less
>>> nothing to discuss in more depth as the feedback it gathered was
>>> mostly formal and nobody has had any objections so far having the
>>> discussion thread open for such a long time.
>>> 
>>> The vote is open for 72 hours based on the guidelines, it needs at
>>> least 3 binding +1's and no vetoes.
>>> 
>>> I am +1 on this.
>>> 
>>> Regards
>>> 
>>> (1)
>>> 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-16%3A+Auth+Plugin+Support+for+CQLSH
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>> 
>>> 
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] Java support roadmap

2021-08-26 Thread J. D. Jordan

+1 from me on both as well.

> On Aug 26, 2021, at 12:44 PM, Paulo Motta  wrote:
> 
> +1 to both removal of experimental for 11, and moving trunk to 11+17
> 
>> Em qui., 26 de ago. de 2021 às 14:40, Brandon Williams 
>> escreveu:
>> 
>> +1 to both removal of experimental for 11, and moving trunk to 11+17
>> 
>>> On Thu, Aug 26, 2021 at 12:35 PM Mick Semb Wever  wrote:
>>> 
 
 I and contributors I work with have deployed 4.0 + JDK11 in production,
 have found no issues, and would treat any issues that arise as ones
>> we’re
 able to jump on and contribute development + review resources to
>> resolve in
 the project.
 
>>> 
>>> 
>>> That's everything I need to hear. Let's remove the experimental label
>> with
>>> 4.0.1
>>> 
>>> And I'm all for moving trunk to support only JDK11 and JDK17 (which was
>>> someone else's suggestion).
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>> 
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [VOTE] CEP-11: Pluggable memtable implementations

2021-08-19 Thread J. D. Jordan

+1

> On Aug 19, 2021, at 12:35 PM, Andrés de la Peña  
> wrote:
> 
> +1
> 
>> On Thu, 19 Aug 2021 at 18:33, Joshua McKenzie  wrote:
>> 
>> +1
>> 
>> On Thu, Aug 19, 2021 at 12:19 PM bened...@apache.org 
>> wrote:
>> 
>>> +1
>>> 
>>> From: Brandon Williams 
>>> Date: Thursday, 19 August 2021 at 17:16
>>> To: dev@cassandra.apache.org 
>>> Subject: Re: [VOTE] CEP-11: Pluggable memtable implementations
>>> +1
>>> 
>>> On Thu, Aug 19, 2021 at 11:11 AM Branimir Lambov 
>>> wrote:
 
 Hello everyone,
 
 I am proposing the CEP-11 (Pluggable memtable implementations) for
>>> adoption
 
 Discussion thread:
 
>>> 
>> https://lists.apache.org/thread.html/rb5e950f882196764744c31bc3c13dfbf0603cb9f8bc2f6cfb976d285%40%3Cdev.cassandra.apache.org%3E
 
 
 The vote will be open for 72 hours.
 Votes by PMC members are considered binding.
 A vote passes if there are at least three binding +1s and no binding
>>> vetoes.
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>> 
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Welcome Adam Holmberg as Cassandra committer

2021-08-16 Thread J. D. Jordan

Congrats Adam! Thanks for all your work getting 4.0 out the door!

> On Aug 16, 2021, at 8:03 AM, Brandon Williams  wrote:
> 
> Congratulations, Adam!
> 
>> On Mon, Aug 16, 2021 at 5:57 AM Benjamin Lerer  wrote:
>> 
>> The PMC members are pleased to announce that Adam Holmberg has accepted
>> the invitation to become committer.
>> 
>> Thanks a lot, Adam, for everything you have done for the project all these
>> years.
>> 
>> Congratulations and welcome
>> 
>> The Apache Cassandra PMC members
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Welcome Jon Meredith as Cassandra committer

2021-07-30 Thread J. D. Jordan

Congrats Jon!

> On Jul 30, 2021, at 9:26 AM, Paulo Motta  wrote:
> 
> Congratulations and welcome Jon! Always exciting to see the project
> recognizing more committers!
> 
>> Em sex., 30 de jul. de 2021 às 11:20, Benjamin Lerer 
>> escreveu:
>> 
>> Congratulations Jon. :-)
>> 
>> Le ven. 30 juil. 2021 à 15:42, Ekaterina Dimitrova 
>> a écrit :
>> 
>>> Congrats!!! Well deserved!!!  
>>> 
 On Fri, 30 Jul 2021 at 9:32, Jonathan Ellis  wrote:
>>> 
 Congratulations, Jon!
 
 On Fri, Jul 30, 2021 at 8:29 AM Brandon Williams 
>>> wrote:
 
> The Project Management Committee (PMC) for Apache Cassandra
> has invited Jon Meredith to become a committer and we are pleased
> to announce that he has accepted.
> 
> Thanks for all helping make Cassandra great!
> 
> Congratulations,
> The Apache Cassandra PMC members
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 
> 
 
 --
 Jonathan Ellis
 co-founder, http://www.datastax.com
 @spyced
 
>>> 
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] Virtual Tables and the future of NodeTool/JMX

2021-07-15 Thread J. D. Jordan

As long as the implemented nodetool commands are still going over JMX I see no 
issue.

The problem lies when you need native transport access. In a secured cluster 
you possibly need an entirely new set of access information (username, 
password, ssl certificate) depending on how JMX access and native transport 
access are setup. You also possibly need access to different ports allowed if 
the admin client is remote.  I think it is burdensome on the end user if we 
require them to have both JMX and native transport setup and configured for 
operational access and then also to know which set of access information to use 
on a per command basis.

-Jeremiah

> On Jul 15, 2021, at 2:59 PM, Brandon Williams  wrote:
> 
> On Thu, Jul 15, 2021 at 8:59 AM Paulo Motta  wrote:
>> 
>> Perhaps one approach to expose VirtualTables via nodetool without requiring
>> the user to provide CQL credentials would be to provide a generic
>> StorageServiceMBean.queryVirtualTable(String name) JMX method returning a
>> TabularData result. This would allow to keep a consistent nodetool frontend
>> to users while progressively switching the backend from JMX to
>> VirtualTables.
> 
> I like this idea. +1
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] Virtual Tables and the future of NodeTool/JMX

2021-07-15 Thread J. D. Jordan

I see no problem with continuing to add JMX commands for the foreseeable future.

> On Jul 15, 2021, at 2:07 PM, Stefan Miklosovic 
>  wrote:
> 
> Can I have a clear response from you, community, if my work on 16725
> is rendered totally useless in the light of this discussion? The time
> on that was already spent and I honestly can not see why it would be a
> problem to merge that command in.
> 
> I am particularly objecting to Paulo's idea about dropping JMX command
> implementations altogether, I find it quite radical without any
> meaningful justification except "wasting somebody's time" but since it
> is my time I spent on this, I am not sure why anybody would care?
> While I do understand that we are trying to move forward with cql and
> so on, I find it quite ridiculous to stop "5 minutes before 12" just
> because somebody happened to drop an email to the dev list about this
> before I managed to finish it.
> 
> In other words, I find it just easier to finish it and voila, we can
> query audit's config, when we are super close to it and all who spend
> time on that was me - rather than waiting for weeks and months until
> this discussion settles, living without that until then.
> 
> Regards
> 
>> On Thu, 15 Jul 2021 at 20:38, J. D. Jordan  wrote:
>> 
>> I also am in favor of continuing to support nodetool in parallel with 
>> developing a command line tool and associated virtual tables to replace 
>> nodetool/JMX at some point in the future.
>> I don’t think “native transport is not currently available during startup” 
>> is something to halt progress towards this goal. There are many ways to 
>> change the system to make that a non-problem.  But it is something to 
>> remember while moving towards the goal of node management without using JMX.
>> 
>> -Jeremiah
>> 
>>>> On Jul 15, 2021, at 12:21 PM, Mick Semb Wever  wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> 
>>>> What is your opinion on this?
>>>> 
>>> 
>>> 
>>> This discussion was touched when implementing Diagnostics Events, at least
>>> the discussion of JMX vs native (rather than nodetool vs cqlsh).  At that
>>> time JMX was chosen because there was no way for a client to specify the
>>> host you wanted the information from. Some more info in CASSANDRA-13459
>>> and CASSANDRA-13472.
>>> 
>>> The java and python drivers have since added this functionality. But if
>>> it's not widely adopted by all the drivers, and the functionality may have
>>> programmatic uses, this can be problematic.
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Welcome Dinesh Joshi as Cassandra PMC member

2021-06-02 Thread J. D. Jordan

Congrats! Well deserved addition 

> On Jun 2, 2021, at 6:10 PM, Paulo Motta  wrote:
> 
> Very happy to see Dinesh as a PMC member, congratulations!
> 
>> Em qua., 2 de jun. de 2021 às 17:52, Joseph Lynch 
>> escreveu:
>> 
>> Congratulations Dinesh! Well deserved!
>> 
>> -Joey
>> 
>>> On Wed, Jun 2, 2021 at 12:23 PM Benjamin Lerer  wrote:
>>> 
>>> The PMC's members are pleased to announce that Dinesh Joshi has accepted
>>> the invitation to become a PMC member.
>>> 
>>> Thanks a lot, Dinesh, for everything you have done for the project all
>>> these years.
>>> 
>>> Congratulations and welcome
>>> 
>>> The Apache Cassandra PMC members
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>> 
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Welcome Stefan Miklosovic as Cassandra committer

2021-05-03 Thread J. D. Jordan

Well deserved!  Congrats Stefan.

> On May 3, 2021, at 10:46 AM, Sumanth Pasupuleti 
>  wrote:
> 
> Congratulations Stefan!!
> 
>> On Mon, May 3, 2021 at 8:41 AM Brandon Williams  wrote:
>> 
>> Congratulations, Stefan!
>> 
>>> On Mon, May 3, 2021 at 10:38 AM Benjamin Lerer  wrote:
>>> 
>>> The PMC's members are pleased to announce that Stefan Miklosovic has
>>> accepted the invitation to become committer last Wednesday.
>>> 
>>> Thanks a lot, Stefan,  for all your contributions!
>>> 
>>> Congratulations and welcome
>>> 
>>> The Apache Cassandra PMC members
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>> 
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] Releases after 4.0

2021-03-29 Thread J. D. Jordan

+1 that deprecation schedule seems reasonable and a good thing to move to.

> On Mar 29, 2021, at 10:23 AM, Benjamin Lerer  wrote:
> 
> The proposal sounds good to me too.
> 
>> Le lun. 29 mars 2021 à 16:48, Brandon Williams  a écrit :
>> 
>>> On Mon, Mar 29, 2021 at 9:41 AM Joseph Lynch 
>>> wrote:
>>> I like the idea of the 3-year support cycles, but I think since
>>> 3.0/3.11/4.0 took so long to stabilize to a point folks could upgrade
>>> to, we should reset the clock somewhat.
>> 
>> I agree, the length of time to release 4.0 and the initialization of a
>> new release cycle requires some special consideration for current
>> releases.
>> 
>>> 4.0: Fully supported until April 2023 and high severity bugs until
>>> April 2024 (2 year full, 1 year bugfix)
>>> 3.11: Fully supported until April 2022 and high severity bugs until
>>> April 2023 (1 year full, 1 year bugfix).
>>> 3.0: Supported for high severity correctness/performance bugs until
>>> April 2022 (1 year bugfix)
>>> 2.2+2.1: EOL immediately.
>>> 
>>> Then going forward we could have this nice pattern when we cut the
>>> yearly release:
>>> Y(n-0): Support for 3 years from now (2 full, 1 bugfix)
>>> Y(n-1): Fully supported for 1 more year and supported for high
>>> severity correctness/perf bugs 1 year after that (1 full, 1 bugfix)
>>> Y(n-2): Supported for high severity correctness/bugs for 1 more year (1
>> bugfix)
>> 
>> This sounds excellent to me, +1.
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>> 
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Welcome Berenguer Blasi as Cassandra committer

2021-03-25 Thread J. D. Jordan

Congratulations  

> On Mar 25, 2021, at 5:10 AM, Benjamin Lerer  wrote:
> 
>  The PMC's members are pleased to announce that Berenguer Blasi has
> accepted the invitation to become committer today.
> 
> Thanks a lot,  Berenguer,  for all the work you have done!
> 
> Congratulations and welcome
> 
> The Apache Cassandra PMC members

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Welcome Paulo Motta as Cassandra PMC member

2021-02-09 Thread J. D. Jordan

Congrats Paulo! A great addition to the PMC.

> On Feb 9, 2021, at 9:59 AM, Jonathan Ellis  wrote:
> 
> Congratulations, Paulo!  Well deserved.
> 
>> On Tue, Feb 9, 2021 at 9:54 AM Benjamin Lerer 
>> wrote:
>> 
>> The PMC's members are pleased to announce that Paulo Motta has accepted
>> the invitation to become a PMC member yesterday.
>> 
>> Thanks a lot, Paulo, for everything you have done for the project all these
>> years.
>> 
>> Congratulations and welcome
>> 
>> The Apache Cassandra PMC members
>> 
> 
> 
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Welcome Yifan Cai as Cassandra committer

2020-12-21 Thread J. D. Jordan

Congrats!

> On Dec 21, 2020, at 11:36 AM, Paulo Motta  wrote:
> 
> Congratulations and welcome! :)
> 
>> Em seg., 21 de dez. de 2020 às 14:27, sankalp kohli 
>> escreveu:
>> 
>> Congratulations Yifan.
>> 
>> On Mon, Dec 21, 2020 at 9:10 AM Benjamin Lerer <
>> benjamin.le...@datastax.com>
>> wrote:
>> 
>>> The PMC's members are pleased to announce that Yifan Cai has accepted
>> the
>>> invitation to become committer last Friday.
>>> 
>>> Thanks a lot, Yifan,  for everything you have done!
>>> 
>>> Congratulations and welcome
>>> 
>>> The Apache Cassandra PMC members
>>> 
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] Revisiting Java 11's experimental status

2020-08-19 Thread J. D. Jordan

This makes sense to me. A bug is a bug regardless of the JVM that exposes it.

Java 11 still considered experimental.  Users should understand they are on the 
less trodden path when using it.

-Jeremiah

> On Aug 19, 2020, at 7:36 PM, David Capwell  wrote:
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] Future of MVs

2020-06-30 Thread J. D. Jordan

>>> Instead of ripping it out, we could instead disable them in the yaml
>>> with big fat warning comments around it. 


FYI we have already disabled use of materialized views, SASI, and transient 
replication by default in 4.0

https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L1393

> On Jun 30, 2020, at 6:53 PM, joshua.mcken...@gmail.com wrote:
> 
> I followed up with the clarification about unit and dtests for that reason 
> Dinesh. We test experimental features now.
> 
> If we’re talking about adding experimental features to the 40 quality testing 
> effort, how does that differ from just saying “we won’t release until we’ve 
> tested and stabilized these features and they’re no longer experimental”?
> 
> Maybe I’m just misunderstanding something here?
> 
>> On Jun 30, 2020, at 7:12 PM, Dinesh Joshi  wrote:
>> 
>> 
>>> 
 On Jun 30, 2020, at 4:05 PM, Brandon Williams  wrote:
>>> 
>>> Instead of ripping it out, we could instead disable them in the yaml
>>> with big fat warning comments around it.  That way people already
>>> using them can just enable them again, but it will raise the bar for
>>> new users who ignore/miss the warnings in the logs and just use them.
>> 
>> Not a bad idea. Although, the real issue is that users enable MV on a 3 node 
>> cluster with a few megs of data and conclude that MVs will horizontally 
>> scale with the size of data. This is what causes issues for users who 
>> naively roll it out in production and discover that MVs do not scale with 
>> their data growth. So whatever we do, the big fat warning should educate the 
>> unsuspecting operator.
>> 
>> Dinesh
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>

Re: [VOTE] Project governance wiki doc (take 2)

2020-06-22 Thread J. D. Jordan

+1 non-binding

> On Jun 22, 2020, at 1:18 PM, Stefan Podkowinski  wrote:
> 
> +1
> 
>> On 22.06.20 20:12, Blake Eggleston wrote:
>> +1
>> 
 On Jun 20, 2020, at 8:12 AM, Joshua McKenzie  wrote:
>>> 
>>> Link to doc:
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/Apache+Cassandra+Project+Governance
>>> 
>>> Change since previous cancelled vote:
>>> "A simple majority of this electorate becomes the low-watermark for votes
>>> in favour necessary to pass a motion, with new PMC members added to the
>>> calculation."
>>> 
>>> This previously read "super majority". We have lowered the low water mark
>>> to "simple majority" to balance strong consensus against risk of stall due
>>> to low participation.
>>> 
>>> 
>>>   - Vote will run through 6/24/20
>>>   - pmc votes considered binding
>>>   - simple majority of binding participants passes the vote
>>>   - committer and community votes considered advisory
>>> 
>>> Lastly, I propose we take the count of pmc votes in this thread as our
>>> initial roll call count for electorate numbers and low watermark
>>> calculation on subsequent votes.
>>> 
>>> Thanks again everyone (and specifically Benedict and Jon) for the time and
>>> collaboration on this.
>>> 
>>> ~Josh
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] Switch to using GitHub pull requests?

2020-01-22 Thread J. D. Jordan

Doesn’t this github review workflow as described work right now?  It’s just not 
the “only” way people do things?

I don’t think we need to forbid other methods of contribution as long as the 
review and testing needs are met.

-Jeremiah

> On Jan 22, 2020, at 6:35 PM, Yifan Cai  wrote:
> 
> +1 nb to the PR approach for reviewing.
> 
> 
> And thanks David for initiating the discussion. I would like to put my 2
> cents in it.
> 
> 
> IMO, reviews comments are better associated with the changes, precisely to
> the line level, if they are put in the PR rather than in the JIRA comments.
> Discussions regarding each review comments are naturally organized into
> this own dedicated thread. I agree that JIRA comments are more suitable for
> high-level discussion regarding the design. But review comments in PR can
> do a better job at code-level discussion.
> 
> 
> Another benefit is to relief reviewers’ work. In the PR approach, we can
> leverage the PR build step to perform an initial qualification. The actual
> review can be deferred until the PR build passes. So reviewers are sure
> that the change is good at certain level, i.e. it builds and the tests can
> pass. Right now, contributors volunteer for providing the link to CI test
> (however, one still needs to open the link to see the result).
> 
>> On Wed, Jan 22, 2020 at 3:16 PM David Capwell  wrote:
>> 
>> Thanks for the links Benedict!
>> 
>> Been reading the links and see the following points being made
>> 
>> *) enabling the spark process would lower the process to enter the project
>> *) high level discussions should be in JIRA [1]
>> *) not desirable to annotation JIRA and Github; should only annotate JIRA
>> (reviewer, labels, etc.)
>> *) given the multi branch nature, pull requires are not intuitive [2]
>> *) merging is problematic and should keep the current merge process
>> *) commits@ is not usable with PRs
>> *) commits@ is better because of PRs
>> *) people are more willing to nit-pick with PRs, less likely with current
>> process [3]
>> *) opens potential to "prevent commits that don't pass the tests" [4]
>> *) prefer the current process
>> http://cassandra.apache.org/doc/latest/development/patches.html [5]
>> *) current process is annoying since you have to take the link in github
>> and attach to JIRA for each comment in review
>> *) missed notifications, more trust in commits@
>> *) if someone rewrites history, comments could be hard to see
>> *) its better to leave comments in the source code so people don't need to
>> lookup github
>> 
>> Here is how i see some of the points
>> 
>> 1) I agree with the point that the high level discussions should be in
>> JIRA; PRs are better at specific review and offer no real benefit over JIRA
>> for larger structural changes
>> 2) there are different patterns with multiple branches as well, but some of
>> it is possible to codify and include in CI.  For example, you could take
>> the diff, attempt to apply to 2.2 (maybe if [dtest] in commit?) and forward
>> merge; of any conflicts are found, could annotate JIRA that the change is
>> complex and may be best to submit multiple PRs.  Assuming we want something
>> like this, it is also possible to run the tests against those branches as
>> well.  I am not saying we do this, but saying that it is possible to
>> improve or solve this problem, so doesn't appear a blocker to me.
>> 3) by marking it easier to comment i can definitely see this happen, but
>> don't see this as a reason not to.  I find that you are more willing to
>> actually talk about small sections of the code in PR than in other forms
>> and that its easier to track.  One of the things i see now is that the
>> conversation moves to slack, so is it better not happening, happening in
>> slack, or happening in github?
>> 4) This is actually why i started this thread.  I created a patch a while
>> back that passed review, got merged, and has been failing the build ever
>> since.  I would like to make it more clear that code is likely to do this
>> or not.
>> 5) The link documents the process as submitting patches generate by "git
>> format-patch", which i was told not to do my first patch
>> 
>> Think i summarized all I saw.
>> 
>>> On Wed, Jan 22, 2020 at 2:30 PM Dinesh Joshi  wrote:
>>> 
>>> I personally use Github PRs to discuss the changes if there is feedback
>> on
>>> the code. The discussion does get linked with the JIRA ticket. However,
>>> committing is manual.
>>> 
>>> Dinesh
>>> 
 On Jan 22, 2020, at 2:20 PM, David Capwell  wrote:
 
 When submitting or reviewing a change in JIRA I notice that we have
>> three
 main patterns for doing this: link branch, link diff, and link GitHub
>>> pull
 request (PR); I wanted to bring up the idea of switching over to GitHub
 pull requests as the norm.
 
 
 Why should we do this?  The main reasons I can think of are:
>> consistency
 within the project, common pattern outside and inside Apache (not a new

Re: Offering some project management services

2020-01-10 Thread J. D. Jordan

Isn’t doing such things the way people who are not writing code become part of 
a project?  By offering their time to do things that benefit the project?

Why does anyone “with a formal role” need to agree that Patrick is allowed to 
use his time to try and get some people together to discuss contributing?

-Jeremiah Jordan
Person with no formal role in the Apache Cassandra project.

> On Jan 10, 2020, at 7:44 PM, Benedict Elliott Smith  
> wrote:
> 
> This is also great.  But it's a bit of a weird look to have two people, 
> neither of whom have formal roles on the project, making decisions like this 
> without the involvement of the community.  I'm sure everyone will be 
> supportive, but it would help to democratise the decision-making.
> 
> 
> On 11/01/2020, 01:39, "Patrick McFadin"  wrote:
> 
>   Scott and I had a talk this week and we are starting the contributor
>   meetings on 1/22 as we talked about at NGCC. (Yeah that was back in
>   September) Stay tuned for the details and agenda in the project confluence
>   page.
> 
>   Patrick
> 
>>>   On Fri, Jan 10, 2020 at 3:21 PM Jeff Jirsa  wrote:
>>> On Fri, Jan 10, 2020 at 3:19 PM Jeff Jirsa  wrote:
>>> On Fri, Jan 10, 2020 at 2:35 PM Benedict Elliott Smith <
>>> bened...@apache.org> wrote:
   Yes, I also miss those fortnightly (or monthly) summaries that Jeff
 used to do. They were very useful "glue" in the community. I imagine
>> they'd
 also make writing the board report easier.
 +1, those were great
>>> I'll try to either do more of these, or nudge someone else into doing
>> them
>>> from time to time.
>> (I meant ^ if Josh doesnt volunteer. Would love to have Josh do them if
>> he's got time).
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: fixing paging state for 4.0

2019-09-24 Thread J. D. Jordan

And as I said, that would be a bug in the driver that did this. Any driver 
implementing a protocol that has a “new” paging state, that supports mixed 
version connections, would need to handle that correctly and not send new 
states over the old protocol or old states over the new protocol.

As far as I know the current java driver does not support mixed protocol 
versions across its connections.  So it would not need such logic. But and 
driver that supported mixed versions would need it.

> On Sep 24, 2019, at 9:20 PM, Blake Eggleston  
> wrote:
> 
> Yes, but if a client is connected to 2 different nodes, and is using a 
> different protocol for each, the paging state formats aren’t going to match 
> if it tries to use the paging date from one connection on the other.

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: fixing paging state for 4.0

2019-09-24 Thread J. D. Jordan

It is inherently versioned by the protocol version being used for the 
connection.

> On Sep 24, 2019, at 9:06 PM, Jon Haddad  wrote:
> 
> The problem is that the payload isn't versioned, because the individual
> fields aren't really part of the protocol.  I think the long term fix
> should be to add the fields of the paging state to the protocol itself
> rather than have it just be some serialized blob.  Then we don't have to
> deal with separately versioning the paging state.
> 
> I think recognizing max int as special number that just means "a lot" is
> fine for now till we have time to rework it is a reasonable approach.
> 
> Jon
> 
>> On Tue, Sep 24, 2019 at 6:52 PM J. D. Jordan 
>> wrote:
>> 
>> Are their drivers that try to do mixed protocol version connections?  If
>> so that would be a mistake on the drivers part if it sent the new paging
>> state to an old server.  Pretty easily protected against in said driver
>> when it implements support for the new protocol version.  The payload is
>> opaque, but that doesn’t mean a driver would send the new payload to an old
>> server.
>> 
>> Many of the drivers I have looked at don’t do mixed version connections.
>> If they start at a higher version they will not connect to older nodes that
>> don’t support it. Or they will connect to the newer nodes with the older
>> protocol version. In either of those cases there is no problem.
>> 
>> Protocol changes aside, I would suggest fixing the bug starting back on
>> 3.x by changing the meaning of MAX. Whether or not the limit is switched to
>> a var int in a bumped protocol version.
>> 
>> -Jeremiah
>> 
>> 
>>> On Sep 24, 2019, at 8:28 PM, Blake Eggleston
>>  wrote:
>>> 
>>> Right, that's the problem with changing the paging state format. It
>> doesn't work in mixed mode.
>>> 
>>>> On Sep 24, 2019, at 4:47 PM, Jeremiah Jordan 
>> wrote:
>>>> 
>>>> Clients do negotiate the protocol version they use when connecting. If
>> the server bumped the protocol version then this larger paging state could
>> be part of the new protocol version. But that doesn’t solve the problem for
>> existing versions.
>>>> 
>>>> The special treatment of Integer.MAX_VALUE can be done back to 3.x and
>> fix the bug in all versions, letting users requests to receive all of their
>> data.  Which realistically is probably what someone who sets the protocol
>> level query limit to Integer.MAX_VALUE is trying to do.
>>>> 
>>>> -Jeremiah
>>>> 
>>>>>> On Sep 24, 2019, at 4:09 PM, Blake Eggleston
>>  wrote:
>>>>> 
>>>>> Right, mixed version clusters. The opaque blob isn't versioned, and
>> there isn't an opportunity for min version negotiation that you have with
>> the messaging service. The result is situations where a client begins a
>> read on one node, and attempts to read the next page from a different node
>> over a protocol version where the paging state serialization format has
>> changed. This causes an exception deserializing the paging state and the
>> read fails.
>>>>> 
>>>>> There are ways around this, but they're not comprehensive (I think),
>> and they're much more involved than just interpreting Integer.MAX_VALUE as
>> unlimited. The "right" solution would be for the paging state to be
>> deserialized/serialized on the client side, but that won't happen in 4.0.
>>>>> 
>>>>>> On Sep 24, 2019, at 1:12 PM, Jon Haddad  wrote:
>>>>>> 
>>>>>> What's the pain point?  Is it because of mixed version clusters or is
>> there
>>>>>> something else that makes it a problem?
>>>>>> 
>>>>>>> On Tue, Sep 24, 2019 at 11:03 AM Blake Eggleston
>>>>>>>  wrote:
>>>>>>> 
>>>>>>> Changing paging state format is kind of a pain since the driver
>> treats it
>>>>>>> as an opaque blob. I'd prefer we went with Sylvain's suggestion to
>> just
>>>>>>> interpret Integer.MAX_VALUE as "no limit", which would be a lot
>> simpler to
>>>>>>> implement.
>>>>>>> 
>>>>>>>> On Sep 24, 2019, at 10:44 AM, Jon Haddad  wrote:
>>>>>>>> 
>>>>>>>> I'm working with a team who just ran into CASSANDRA-14683 [1],
>> which I
>>>>>>>> didn't realize was an issue till now.

1 2 >

1 - 100 of 143 matches

Mail list logo