from:"Benedict"

Fwiw Chris, I agree with your concerns, but I think the introduction of a CBO - done right - is in principle a good thing in its own right. It’s independent of the issues you mention, even if it might enable features that exacerbate them.It should also help enable secondary indexes work better, which is I think their main justification. Otherwise there isn’t really a good reason for it today. Though I personally anticipate ongoing issues around 2i that SAI is perhaps over exuberantly sold as solving, and that a CBO will not fix. But we’ll see how that evolves.On 14 Dec 2023, at 15:49, Chris Lohfink wrote:I don't wanna be a blocker for this CEP or anything but did want to put my 2 cents in. This CEP is horrifying to me.I have seen thousands of clusters across multiple companies and helped them get working successfully. A vast majority of that involved blocking the use of MVs, GROUP BY, secondary indexes, and even just simple _range queries_. The "unncessary restrictions of cql" are not only necessary IMHO, more restrictions are necessary to be successful at scale. The idea of just opening up CQL to general purpose relational queries and lines like "supporting queries with joins in an efficient way" ... I would really like us to make secondary indexes be a viable option before we start opening up floodgates on stuff like this.ChrisOn Thu, Dec 14, 2023 at 9:37 AM Benedict <bened...@apache.org> wrote:> So yes, this physical plan is the structure that you have in mind but the idea of sharing it is not part of the CEP.

I think it should be. This should form a major part of the API on which any CBO is built.

> It seems that there is a difference between the goal of your proposal and the one of the CEP. The goal of the CEP is first to ensure optimal performance. It is ok to change the execution plan for one that delivers better performance. What we want to minimize is having a node performing queries in an inefficient way for a long period of time.

You have made a goal of the CEP synchronising summary statistics across the whole cluster in order to achieve some degree of uniformity of query plan. So this is explicitly a goal of the CEP, and synchronising summary statistics is a hard problem and won’t provide strong guarantees.

> The client side proposal targets consistency for a given query on a given driver instance. In practice, it would be possible to have 2 similar queries with 2 different execution plans on the same driver

This would only be possible if the driver permitted it. A driver could (and should) enforce that it only permits one query plan per query.

The opposite is true for your proposal: some queries may begin degrading because they touch specific replicas that optimise the query differently, and this will be hard to debug.On 14 Dec 2023, at 15:30, Benjamin Lerer <b.le...@gmail.com> wrote:The binding of the parser output to the schema (what is today the Raw.prepare call) will create the logical plan, expressed as a tree of relational operators. Simplification and normalization will happen on that tree to produce a new equivalent logical plan. That logical plan will be used as input to the optimizer. The output will be a physical plan
producing the output specified by the logical plan. A tree of physical operators specifying how the operations should be performed.That physical plan will be stored as part of the statements (SelectStatement, ModificationStatement, ...) in the prepared statement cache. Upon execution, variables will be bound and the RangeCommands/Mutations will be created based on the physical plan.The string representation of a physical plan will effectively represent the output of an EXPLAIN statement but outside of that the physical plan will stay encapsulated within the statement classes. Hints will be parameters provided to the optimizer to enforce some specific choices. Like always using an Index Scan instead of a Table Scan, ignoring the cost comparison.So yes, this physical plan is the structure that you have in mind but the idea of sharing it is not part of the CEP. I did not document it because it will simply be a tree of physical operators used internally.
My
proposal is that the execution plan of the coordinator that prepares a
query gets serialised to the client, which then provides the execution
plan to all future coordinators, and coordinators provide it to replicas
as necessary. This
means it is not possible for any conflict to arise for a single client.
It would guarantee consistency of execution for any single client (and
avoid any drift over the client’s sessions), without necessarily
guaranteeing consistency for all clients.

It seems that there is a difference between the goal of your proposal and the one of the CEP. The goal of the CEP is first to ensure optimal performance. It is ok to change the execution plan for one that delivers better performance. What we want to minimize is having a node

Re: [DISCUSS] CEP-39: Cost Based Optimizer

> I think it should be. This should part of the API on which any CBO is built.To expand on this a bit: one of the stated goals of the CEP is to support multiple CBOs, and this is a required component of any CBO. If this doesn’t form part of the shared machinery, we aren’t really enabling new CBOs, we’re just refactoring the codebase to implement the CBO intended by this CEP.This would also mean that all of the additional machinery, like EXPLAIN, HINT etc would need to be implemented for each CBO independently. This is a high hurdle for any new CBO, a lot of wasted work and would lead to a less consistent experience for the user, as each CBO would do this differently. These facilities make sense to build as a shared feature on top of a single execution model - and I think this is true regardless of your stance on my suggestion for managing query execution.If you disagree, it would help to understand how you expect these facilities to be built in the ecosystem of CBOs envisaged by the CEP, and how we would maintain a consistency of user experience.On 14 Dec 2023, at 15:37, Benedict  wrote:> So yes, this physical plan is the structure that you have in mind but the idea of sharing it is not part of the CEP.

I think it should be. This should form a major part of the API on which any CBO is built.

> It seems that there is a difference between the goal of your proposal and the one of the CEP. The goal of the CEP is first to ensure optimal performance. It is ok to change the execution plan for one that delivers better performance. What we want to minimize is having a node performing queries in an inefficient way for a long period of time.

You have made a goal of the CEP synchronising summary statistics across the whole cluster in order to achieve some degree of uniformity of query plan. So this is explicitly a goal of the CEP, and synchronising summary statistics is a hard problem and won’t provide strong guarantees.

> The client side proposal targets consistency for a given query on a given driver instance. In practice, it would be possible to have 2 similar queries with 2 different execution plans on the same driver

This would only be possible if the driver permitted it. A driver could (and should) enforce that it only permits one query plan per query.

The opposite is true for your proposal: some queries may begin degrading because they touch specific replicas that optimise the query differently, and this will be hard to debug.On 14 Dec 2023, at 15:30, Benjamin Lerer  wrote:The binding of the parser output to the schema (what is today the Raw.prepare call) will create the logical plan, expressed as a tree of relational operators. Simplification and normalization will happen on that tree to produce a new equivalent logical plan. That logical plan will be used as input to the optimizer. The output will be a physical plan 
producing the output specified by the logical plan. A tree of physical operators specifying how the operations should be performed.That physical plan will be stored as part of the statements (SelectStatement, ModificationStatement, ...) in the prepared statement cache. Upon execution, variables will be bound and the RangeCommands/Mutations will be created based on the physical plan.The string representation of a physical plan will effectively represent the output of an EXPLAIN statement but outside of that the physical plan will stay encapsulated within the statement classes.    Hints will be parameters provided to the optimizer to enforce some specific choices. Like always using an Index Scan instead of a Table Scan, ignoring the cost comparison.So yes, this physical plan is the structure that you have in mind but the idea of sharing it is not part of the CEP. I did not document it because it will simply be a tree of physical operators used internally.
My
 proposal is that the execution plan of the coordinator that prepares a 
query gets serialised to the client, which then provides the execution 
plan to all future coordinators, and coordinators provide it to replicas
 as necessary. This
 means it is not possible for any conflict to arise for a single client.
 It would guarantee consistency of execution for any single client (and 
avoid any drift over the client’s sessions), without necessarily 
guaranteeing consistency for all clients.

 It seems that there is a difference between the goal of your proposal and the one of the CEP. The goal of the CEP is first to ensure optimal performance. It is ok to change the execution plan for one that delivers better performance. What we want to minimize is having a node performing queries in an inefficient way for a long period of time.The client side proposal targets consistency for a given query on a given driver instance. In practice, it would be possible to have 2 similar queries with 2 different execution plans on the same driver making things really confusing. Identifying the source of an inefficient query will al

Re: [DISCUSS] CEP-39: Cost Based Optimizer

> So yes, this physical plan is the structure that you have in mind but the idea of sharing it is not part of the CEP.

I think it should be. This should form a major part of the API on which any CBO is built.

> It seems that there is a difference between the goal of your proposal and the one of the CEP. The goal of the CEP is first to ensure optimal performance. It is ok to change the execution plan for one that delivers better performance. What we want to minimize is having a node performing queries in an inefficient way for a long period of time.

You have made a goal of the CEP synchronising summary statistics across the whole cluster in order to achieve some degree of uniformity of query plan. So this is explicitly a goal of the CEP, and synchronising summary statistics is a hard problem and won’t provide strong guarantees.

> The client side proposal targets consistency for a given query on a given driver instance. In practice, it would be possible to have 2 similar queries with 2 different execution plans on the same driver

This would only be possible if the driver permitted it. A driver could (and should) enforce that it only permits one query plan per query.

The opposite is true for your proposal: some queries may begin degrading because they touch specific replicas that optimise the query differently, and this will be hard to debug.On 14 Dec 2023, at 15:30, Benjamin Lerer  wrote:The binding of the parser output to the schema (what is today the Raw.prepare call) will create the logical plan, expressed as a tree of relational operators. Simplification and normalization will happen on that tree to produce a new equivalent logical plan. That logical plan will be used as input to the optimizer. The output will be a physical plan 
producing the output specified by the logical plan. A tree of physical operators specifying how the operations should be performed.That physical plan will be stored as part of the statements (SelectStatement, ModificationStatement, ...) in the prepared statement cache. Upon execution, variables will be bound and the RangeCommands/Mutations will be created based on the physical plan.The string representation of a physical plan will effectively represent the output of an EXPLAIN statement but outside of that the physical plan will stay encapsulated within the statement classes.    Hints will be parameters provided to the optimizer to enforce some specific choices. Like always using an Index Scan instead of a Table Scan, ignoring the cost comparison.So yes, this physical plan is the structure that you have in mind but the idea of sharing it is not part of the CEP. I did not document it because it will simply be a tree of physical operators used internally.
My
 proposal is that the execution plan of the coordinator that prepares a 
query gets serialised to the client, which then provides the execution 
plan to all future coordinators, and coordinators provide it to replicas
 as necessary. This
 means it is not possible for any conflict to arise for a single client.
 It would guarantee consistency of execution for any single client (and 
avoid any drift over the client’s sessions), without necessarily 
guaranteeing consistency for all clients.

 It seems that there is a difference between the goal of your proposal and the one of the CEP. The goal of the CEP is first to ensure optimal performance. It is ok to change the execution plan for one that delivers better performance. What we want to minimize is having a node performing queries in an inefficient way for a long period of time.The client side proposal targets consistency for a given query on a given driver instance. In practice, it would be possible to have 2 similar queries with 2 different execution plans on the same driver making things really confusing. Identifying the source of an inefficient query will also be pretty hard.Interestingly, having 2 nodes with 2 different execution plans might not be a serious problem. It simply means that based on cardinality at t1, the optimizer on node 1 chose plan 1 while the one on node 2 chose plan 2 at t2. In practice if the cost estimates reflect properly the actual cost those 2 plans should have pretty similar efficiency. The problem is more about the fact that you would ideally want a uniform behavior around your cluster.Changes of execution plans should only occur at certain points. So the main problematic scenario is when the data distribution is around one of those points. Which is also the point where the change should have the least impact.Le jeu. 14 déc. 2023 à 11:38, Benedict <bened...@apache.org> a écrit :There surely needs to be a more succinct and abstract representation in order to perform transformations on the query plan? You don’t intend to manipulate the object graph directly as you apply any transformations when performing simplification or cost based analysis? This would also (I expect) be the form used to support EXPLAIN functionality, an

Re: [DISCUSS] CEP-39: Cost Based Optimizer

d be worth considering providing the execution plan
 to the client as part of query preparation, as an opaque payload to 
supply to coordinators on first contact, as this might simplify the 
problem of ensuring queries behave the same without adopting a lot of 
complexity for synchronising statistics (which will never provide strong
 guarantees). Of course, re-preparing a query might lead to a new
 plan, though any coordinators with the query in their cache should be 
able to retrieve it cheaply. If the execution model is efficiently 
serialised this might have the ancillary benefit of improving the 
occupancy of our prepared query cache.

I am not sure that I understand your proposal. If 2 nodes build a different execution plan how do you solve that conflict?Le mer. 13 déc. 2023 à 09:55, Benedict <bened...@apache.org> a écrit :A CBO can only make worse decisions than the status quo for what I presume are the majority of queries - i.e. those that touch only primary indexes. In general, there are plenty of use cases that prefer determinism. So I agree that there should at least be a CBO implementation that makes the same decisions as the status quo, deterministically.I do support the proposal, but would like to see some elements discussed in more detail. The maintenance and distribution of summary statistics in particular is worthy of its own CEP, and it might be preferable to split it out. The proposal also seems to imply we are aiming for coordinators to all make the same decision for a query, which I think is challenging, and it would be worth fleshing out the design here a little (perhaps just in Jira).While I’m not a fan of ALLOW FILTERING, I’m not convinced that this CEP deprecates it. It is a concrete qualitative guard rail, that I expect some users will prefer to a cost-based guard rail. Perhaps this could be left to the CBO to decide how to treat.There’s also not much discussion of the execution model: I think it would make most sense for this to be independent of any cost and optimiser models (though they might want to operate on them), so that EXPLAIN and hints can work across optimisers (a suitable hint might essentially bypass the optimiser, if the optimiser permits it, by providing a standard execution model)I think it would be worth considering providing the execution plan to the client as part of query preparation, as an opaque payload to supply to coordinators on first contact, as this might simplify the problem of ensuring queries behave the same without adopting a lot of complexity for synchronising statistics (which will never provide strong guarantees). Of course, re-preparing a query might lead to a new plan, though any coordinators with the query in their cache should be able to retrieve it cheaply. If the execution model is efficiently serialised this might have the ancillary benefit of improving the occupancy of our prepared query cache.On 13 Dec 2023, at 00:44, Jon Haddad <j...@jonhaddad.com> wrote:I think it makes sense to see what the actual overhead is of CBO before making the assumption it'll be so high that we need to have two code paths.  I'm happy to provide thorough benchmarking and analysis when it reaches a testing phase.  I'm excited to see where this goes.  I think it sounds very forward looking and opens up a lot of possibilities.JonOn Tue, Dec 12, 2023 at 4:25 PM guo Maxwell <cclive1...@gmail.com> wrote:Nothing expresses my thoughts better than +1，It feels like it means a lot to Cassandra.I have a question. Is it easy to turn off cbo's optimizer or by pass in some way? Because some simple read and write requests will have better performance without cbo, which is also the advantage of Cassandra compared to some rdbms.David Capwell <dcapw...@apple.com>于2023年12月13日 周三上午3:37写道：Overall LGTM.  On Dec 12, 2023, at 5:29 AM, Benjamin Lerer <ble...@apache.org> wrote:Hi everybody,I would like to open the discussion on the introduction of a cost based optimizer to allow Cassandra to pick the best execution plan based on the data distribution.Therefore, improving the overall query performance.This CEP should also lay the groundwork for the future addition of features like joins, subqueries, OR/NOT and index ordering.


The proposal is here: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-39%3A+Cost+Based+OptimizerThank you in advance for your feedback.

Re: [DISCUSS] CEP-39: Cost Based Optimizer

2023-12-13 Thread Benedict

 If 2 nodes build a different execution plan how do you solve that conflict?Le mer. 13 déc. 2023 à 09:55, Benedict <bened...@apache.org> a écrit :A CBO can only make worse decisions than the status quo for what I presume are the majority of queries - i.e. those that touch only primary indexes. In general, there are plenty of use cases that prefer determinism. So I agree that there should at least be a CBO implementation that makes the same decisions as the status quo, deterministically.I do support the proposal, but would like to see some elements discussed in more detail. The maintenance and distribution of summary statistics in particular is worthy of its own CEP, and it might be preferable to split it out. The proposal also seems to imply we are aiming for coordinators to all make the same decision for a query, which I think is challenging, and it would be worth fleshing out the design here a little (perhaps just in Jira).While I’m not a fan of ALLOW FILTERING, I’m not convinced that this CEP deprecates it. It is a concrete qualitative guard rail, that I expect some users will prefer to a cost-based guard rail. Perhaps this could be left to the CBO to decide how to treat.There’s also not much discussion of the execution model: I think it would make most sense for this to be independent of any cost and optimiser models (though they might want to operate on them), so that EXPLAIN and hints can work across optimisers (a suitable hint might essentially bypass the optimiser, if the optimiser permits it, by providing a standard execution model)I think it would be worth considering providing the execution plan to the client as part of query preparation, as an opaque payload to supply to coordinators on first contact, as this might simplify the problem of ensuring queries behave the same without adopting a lot of complexity for synchronising statistics (which will never provide strong guarantees). Of course, re-preparing a query might lead to a new plan, though any coordinators with the query in their cache should be able to retrieve it cheaply. If the execution model is efficiently serialised this might have the ancillary benefit of improving the occupancy of our prepared query cache.On 13 Dec 2023, at 00:44, Jon Haddad <j...@jonhaddad.com> wrote:I think it makes sense to see what the actual overhead is of CBO before making the assumption it'll be so high that we need to have two code paths.  I'm happy to provide thorough benchmarking and analysis when it reaches a testing phase.  I'm excited to see where this goes.  I think it sounds very forward looking and opens up a lot of possibilities.JonOn Tue, Dec 12, 2023 at 4:25 PM guo Maxwell <cclive1...@gmail.com> wrote:Nothing expresses my thoughts better than +1，It feels like it means a lot to Cassandra.I have a question. Is it easy to turn off cbo's optimizer or by pass in some way? Because some simple read and write requests will have better performance without cbo, which is also the advantage of Cassandra compared to some rdbms.David Capwell <dcapw...@apple.com>于2023年12月13日 周三上午3:37写道：Overall LGTM.  On Dec 12, 2023, at 5:29 AM, Benjamin Lerer <ble...@apache.org> wrote:Hi everybody,I would like to open the discussion on the introduction of a cost based optimizer to allow Cassandra to pick the best execution plan based on the data distribution.Therefore, improving the overall query performance.This CEP should also lay the groundwork for the future addition of features like joins, subqueries, OR/NOT and index ordering.


The proposal is here: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-39%3A+Cost+Based+OptimizerThank you in advance for your feedback.

Re: [DISCUSS] CEP-39: Cost Based Optimizer

2023-12-13 Thread Benedict

A CBO can only make worse decisions than the status quo for what I presume are the majority of queries - i.e. those that touch only primary indexes. In general, there are plenty of use cases that prefer determinism. So I agree that there should at least be a CBO implementation that makes the same decisions as the status quo, deterministically.I do support the proposal, but would like to see some elements discussed in more detail. The maintenance and distribution of summary statistics in particular is worthy of its own CEP, and it might be preferable to split it out. The proposal also seems to imply we are aiming for coordinators to all make the same decision for a query, which I think is challenging, and it would be worth fleshing out the design here a little (perhaps just in Jira).While I’m not a fan of ALLOW FILTERING, I’m not convinced that this CEP deprecates it. It is a concrete qualitative guard rail, that I expect some users will prefer to a cost-based guard rail. Perhaps this could be left to the CBO to decide how to treat.There’s also not much discussion of the execution model: I think it would make most sense for this to be independent of any cost and optimiser models (though they might want to operate on them), so that EXPLAIN and hints can work across optimisers (a suitable hint might essentially bypass the optimiser, if the optimiser permits it, by providing a standard execution model)I think it would be worth considering providing the execution plan to the client as part of query preparation, as an opaque payload to supply to coordinators on first contact, as this might simplify the problem of ensuring queries behave the same without adopting a lot of complexity for synchronising statistics (which will never provide strong guarantees). Of course, re-preparing a query might lead to a new plan, though any coordinators with the query in their cache should be able to retrieve it cheaply. If the execution model is efficiently serialised this might have the ancillary benefit of improving the occupancy of our prepared query cache.On 13 Dec 2023, at 00:44, Jon Haddad  wrote:I think it makes sense to see what the actual overhead is of CBO before making the assumption it'll be so high that we need to have two code paths.  I'm happy to provide thorough benchmarking and analysis when it reaches a testing phase.  I'm excited to see where this goes.  I think it sounds very forward looking and opens up a lot of possibilities.JonOn Tue, Dec 12, 2023 at 4:25 PM guo Maxwell  wrote:Nothing expresses my thoughts better than +1，It feels like it means a lot to Cassandra.I have a question. Is it easy to turn off cbo's optimizer or by pass in some way? Because some simple read and write requests will have better performance without cbo, which is also the advantage of Cassandra compared to some rdbms.David Capwell 于2023年12月13日 周三上午3:37写道：Overall LGTM.  On Dec 12, 2023, at 5:29 AM, Benjamin Lerer  wrote:Hi everybody,I would like to open the discussion on the introduction of a cost based optimizer to allow Cassandra to pick the best execution plan based on the data distribution.Therefore, improving the overall query performance.This CEP should also lay the groundwork for the future addition of features like joins, subqueries, OR/NOT and index ordering.


The proposal is here: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-39%3A+Cost+Based+OptimizerThank you in advance for your feedback.

Re: Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?

2023-12-12 Thread Benedict

Could you give (or link to) some examples of how this would actually benefit our test suites?On 12 Dec 2023, at 10:51, Jacek Lewandowski  wrote:I have two major pros for JUnit 5:- much better support for parameterized tests- global test hooks (automatically detectable extensions) + multi-inheritancepon., 11 gru 2023 o 13:38 Benedict <bened...@apache.org> napisał(a):Why do we want to move to JUnit 5? I’m generally opposed to churn unless well justified, which it may be - just not immediately obvious to me.On 11 Dec 2023, at 08:33, Jacek Lewandowski <lewandowski.ja...@gmail.com> wrote:Nobody referred so far to the idea of moving to JUnit 5, what are the opinions?niedz., 10 gru 2023 o 11:03 Benedict <bened...@apache.org> napisał(a):Alex’s suggestion was that we meta randomise, ie we randomise the config parameters to gain better rather than lesser coverage overall. This means we cover these specific configs and more - just not necessarily on any single commit.I strongly endorse this approach over the status quo.On 8 Dec 2023, at 13:26, Mick Semb Wever <m...@apache.org> wrote:     I think everyone agrees here, but…. these variations are still catching failures, and until we have an improvement or replacement we do rely on them.   I'm not in favour of removing them until we have proof /confidence that any replacement is catching the same failures.  Especially oa, tries, vnodes. (Not tries and offheap is being replaced with "latest", which will be valuable simplification.)  What kind of proof do you expect? I cannot imagine how we could prove that because the ability of detecting failures results from the randomness of those tests. That's why when such a test fail you usually cannot reproduce that easily. Unit tests that fail consistently but only on one configuration, should not be removed/replaced until the replacement also catches the failure. We could extrapolate that to - why we only have those configurations? why don't test trie / oa + compression, or CDC, or system memtable? Because, along the way, people have decided a certain configuration deserves additional testing and it has been done this way in lieu of any other more efficient approach.

Re: Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?

2023-12-11 Thread Benedict

Why do we want to move to JUnit 5? I’m generally opposed to churn unless well justified, which it may be - just not immediately obvious to me.On 11 Dec 2023, at 08:33, Jacek Lewandowski  wrote:Nobody referred so far to the idea of moving to JUnit 5, what are the opinions?niedz., 10 gru 2023 o 11:03 Benedict <bened...@apache.org> napisał(a):Alex’s suggestion was that we meta randomise, ie we randomise the config parameters to gain better rather than lesser coverage overall. This means we cover these specific configs and more - just not necessarily on any single commit.I strongly endorse this approach over the status quo.On 8 Dec 2023, at 13:26, Mick Semb Wever <m...@apache.org> wrote:     I think everyone agrees here, but…. these variations are still catching failures, and until we have an improvement or replacement we do rely on them.   I'm not in favour of removing them until we have proof /confidence that any replacement is catching the same failures.  Especially oa, tries, vnodes. (Not tries and offheap is being replaced with "latest", which will be valuable simplification.)  What kind of proof do you expect? I cannot imagine how we could prove that because the ability of detecting failures results from the randomness of those tests. That's why when such a test fail you usually cannot reproduce that easily. Unit tests that fail consistently but only on one configuration, should not be removed/replaced until the replacement also catches the failure. We could extrapolate that to - why we only have those configurations? why don't test trie / oa + compression, or CDC, or system memtable? Because, along the way, people have decided a certain configuration deserves additional testing and it has been done this way in lieu of any other more efficient approach.

Re: Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?

2023-12-10 Thread Benedict

Alex’s suggestion was that we meta randomise, ie we randomise the config 
parameters to gain better rather than lesser coverage overall. This means we 
cover these specific configs and more - just not necessarily on any single 
commit.

I strongly endorse this approach over the status quo.

> On 8 Dec 2023, at 13:26, Mick Semb Wever  wrote:
> 
> 
>  
>  
>  
>> 
>>> I think everyone agrees here, but…. these variations are still catching 
>>> failures, and until we have an improvement or replacement we do rely on 
>>> them.   I'm not in favour of removing them until we have proof /confidence 
>>> that any replacement is catching the same failures.  Especially oa, tries, 
>>> vnodes. (Not tries and offheap is being replaced with "latest", which will 
>>> be valuable simplification.)  
>> 
>> What kind of proof do you expect? I cannot imagine how we could prove that 
>> because the ability of detecting failures results from the randomness of 
>> those tests. That's why when such a test fail you usually cannot reproduce 
>> that easily.
> 
> 
> Unit tests that fail consistently but only on one configuration, should not 
> be removed/replaced until the replacement also catches the failure.
> 
>  
>> We could extrapolate that to - why we only have those configurations? why 
>> don't test trie / oa + compression, or CDC, or system memtable? 
> 
> 
> Because, along the way, people have decided a certain configuration deserves 
> additional testing and it has been done this way in lieu of any other more 
> efficient approach.
>

Re: Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?

2023-12-07 Thread Benedict

I think the biggest impediment to that is that most tests are probably not sufficiently robust for simulation. If things happen in a surprising order many tests fail, as they implicitly rely on the normal timing of things.Another issue is that the simulator does potentially slow things down a little at the moment. Not sure what the impact would be overall.It would be great to setup a JUnitRunner using the simulator and find out though.On 7 Dec 2023, at 15:43, Alex Petrov  wrote:We have been extensively using simulator for TCM, and I think we have make simulator tests more approachable. I think many of the existing tests should be ran under simulator instead of CQLTester, for example. This will both strengthen the simulator, and make things better in terms of determinism. Of course not to say that CQLTester tests are the biggest beneficiary there.On Thu, Dec 7, 2023, at 4:09 PM, Benedict wrote:To be fair, the lack of coherent framework doesn’t mean we can’t merge them from a naming perspective. I don’t mind losing one of burn or fuzz, and merging them.Today simulator tests are kept under the simulator test tree but that primarily exists for the simulator itself and testing it. It’s quite a complex source tree, as you might expect, and it exists primarily for managing its own complexity. It might make sense to bring the Paxos and Accord simulator entry points out into the burn/fuzz trees, though not sure it’s all that important.> On 7 Dec 2023, at 15:05, Benedict <bened...@apache.org> wrote:> > Yes, the only system/real-time timeout is a progress one, wherein if nothing happens for ten minutes we assume the simulation has locked up. Hitting this is indicative of a bug, and the timeout is so long that no realistic system variability could trigger it.> >> On 7 Dec 2023, at 14:56, Brandon Williams <dri...@gmail.com> wrote:>> >> On Thu, Dec 7, 2023 at 8:50 AM Alex Petrov <al...@coffeenco.de> wrote:>>>> I've noticed many "sleeps" in the tests - is it possible with simulation tests to artificially move the clock forward by, say, 5 seconds instead of sleeping just to test, for example whether TTL works?)>>> >>> Yes, simulator will skip the sleep and do a simulated sleep with a simulated clock instead.>> >> Since it uses an artificial clock, does this mean that the simulator>> is also impervious to timeouts caused by the underlying environment?>> >> Kind Regards,>> Brandon

Re: Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?

2023-12-07 Thread Benedict

To be fair, the lack of coherent framework doesn’t mean we can’t merge them 
from a naming perspective. I don’t mind losing one of burn or fuzz, and merging 
them.

Today simulator tests are kept under the simulator test tree but that primarily 
exists for the simulator itself and testing it. It’s quite a complex source 
tree, as you might expect, and it exists primarily for managing its own 
complexity. It might make sense to bring the Paxos and Accord simulator entry 
points out into the burn/fuzz trees, though not sure it’s all that important.

> On 7 Dec 2023, at 15:05, Benedict  wrote:
> 
> Yes, the only system/real-time timeout is a progress one, wherein if nothing 
> happens for ten minutes we assume the simulation has locked up. Hitting this 
> is indicative of a bug, and the timeout is so long that no realistic system 
> variability could trigger it.
> 
>> On 7 Dec 2023, at 14:56, Brandon Williams  wrote:
>> 
>> On Thu, Dec 7, 2023 at 8:50 AM Alex Petrov  wrote:
>>>> I've noticed many "sleeps" in the tests - is it possible with simulation 
>>>> tests to artificially move the clock forward by, say, 5 seconds instead of 
>>>> sleeping just to test, for example whether TTL works?)
>>> 
>>> Yes, simulator will skip the sleep and do a simulated sleep with a 
>>> simulated clock instead.
>> 
>> Since it uses an artificial clock, does this mean that the simulator
>> is also impervious to timeouts caused by the underlying environment?
>> 
>> Kind Regards,
>> Brandon

Re: Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?

2023-12-07 Thread Benedict

Yes, the only system/real-time timeout is a progress one, wherein if nothing 
happens for ten minutes we assume the simulation has locked up. Hitting this is 
indicative of a bug, and the timeout is so long that no realistic system 
variability could trigger it.

> On 7 Dec 2023, at 14:56, Brandon Williams  wrote:
> 
> On Thu, Dec 7, 2023 at 8:50 AM Alex Petrov  wrote:
>>> I've noticed many "sleeps" in the tests - is it possible with simulation 
>>> tests to artificially move the clock forward by, say, 5 seconds instead of 
>>> sleeping just to test, for example whether TTL works?)
>> 
>> Yes, simulator will skip the sleep and do a simulated sleep with a simulated 
>> clock instead.
> 
> Since it uses an artificial clock, does this mean that the simulator
> is also impervious to timeouts caused by the underlying environment?
> 
> Kind Regards,
> Brandon

Re: Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?

2023-11-30 Thread Benedict

I don’t know - I’m not sure what fuzz test means in this context. It’s a newer concept that I didn’t introduce.On 30 Nov 2023, at 20:06, Jacek Lewandowski  wrote:How those burn tests then compare to the fuzz tests? (the new ones)czw., 30 lis 2023, 20:22 użytkownik Benedict <bened...@apache.org> napisał:By “could run indefinitely” I don’t mean by default they run forever. There will be parameters that change how much work is done for a given run, but just running repeatedly (each time with a different generated seeds) is the expected usage. Until you run out of compute or patience.I agree they are only of value pre-commit to check they haven’t been broken in some way by changes. On 30 Nov 2023, at 18:36, Josh McKenzie <jmcken...@apache.org> wrote:that may be long-running and that could be run indefinitelyPerfect. That was the distinction I wasn't aware of. Also means having the burn target as part of regular CI runs is probably a mistake, yes? i.e. if someone adds a burn tests that runs indefinitely, are there any guardrails or built-in checks or timeouts to keep it from running right up to job timeout and then failing?On Thu, Nov 30, 2023, at 1:11 PM, Benedict wrote:A burn test is a randomised test targeting broad coverage of a single system, subsystem or utility, that may be long-running and that could be run indefinitely, each run providing incrementally more assurance of quality of the system.A long test is a unit test that sometimes takes a long time to run, no more no less. I’m not sure any of these offer all that much value anymore, and perhaps we could look to deprecate them.On 30 Nov 2023, at 17:20, Josh McKenzie <jmcken...@apache.org> wrote:Strongly agree. I started working on a declarative refactor out of our CI configuration so circle, ASFCI, and other systems could inherit from it (for instance, see pre-commit pipeline declaration here); I had to set that down while I finished up implementing an internal CI system since the code in neither the ASF CI structure nor circle structure (.sh embedded in .yml /cry) was re-usable in their current form.Having a jvm.options and cassandra.yaml file per suite and referencing them from a declarative job definition would make things a lot easier to wrap our heads around and maintain I think.As for what qualifies as burn vs. long... /shrug couldn't tell you. Would have to go down the git blame + dev ML + JIRA rabbit hole. :) Maybe someone else on-list knows.On Thu, Nov 30, 2023, at 4:25 AM, Jacek Lewandowski wrote:Hi,I'm getting a bit lost - what are the exact differences between those test scenarios? What are the criteria for qualifying a test to be part of a certain scenario?I'm working a little bit with tests and build scripts and the number of different configurations for which we have a separate target in the build starts to be problematic, I cannot imagine how problematic it is for a new contributor.It is not urgent, but we should at least have a plan on how to simplify and unify things.I'm in favour of reducing the number of test targets to the minimum - for different configurations I think we should provide a parameter pointing to jvm options file and maybe to cassandra.yaml. I know that we currently do some super hacky things with cassandra yaml for different configs - like concatenting parts of it. I presume it is not necessary - we can have a default test config yaml and a directory with overriding yamls; while building we could have a tool which is able to load the default configuration, apply the override and save the resulting yaml somewhere in the build/test/configs for example. That would allows us to easily use those yamls in IDE as well - currently it is impossible.What do you think?Thank you and my apologize for bothering about lower priority stuff while we have a 5.0 release headache...Jacek

Re: Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?

2023-11-30 Thread Benedict

By “could run indefinitely” I don’t mean by default they run forever. There will be parameters that change how much work is done for a given run, but just running repeatedly (each time with a different generated seeds) is the expected usage. Until you run out of compute or patience.I agree they are only of value pre-commit to check they haven’t been broken in some way by changes. On 30 Nov 2023, at 18:36, Josh McKenzie  wrote:that may be long-running and that could be run indefinitelyPerfect. That was the distinction I wasn't aware of. Also means having the burn target as part of regular CI runs is probably a mistake, yes? i.e. if someone adds a burn tests that runs indefinitely, are there any guardrails or built-in checks or timeouts to keep it from running right up to job timeout and then failing?On Thu, Nov 30, 2023, at 1:11 PM, Benedict wrote:A burn test is a randomised test targeting broad coverage of a single system, subsystem or utility, that may be long-running and that could be run indefinitely, each run providing incrementally more assurance of quality of the system.A long test is a unit test that sometimes takes a long time to run, no more no less. I’m not sure any of these offer all that much value anymore, and perhaps we could look to deprecate them.On 30 Nov 2023, at 17:20, Josh McKenzie  wrote:Strongly agree. I started working on a declarative refactor out of our CI configuration so circle, ASFCI, and other systems could inherit from it (for instance, see pre-commit pipeline declaration here); I had to set that down while I finished up implementing an internal CI system since the code in neither the ASF CI structure nor circle structure (.sh embedded in .yml /cry) was re-usable in their current form.Having a jvm.options and cassandra.yaml file per suite and referencing them from a declarative job definition would make things a lot easier to wrap our heads around and maintain I think.As for what qualifies as burn vs. long... /shrug couldn't tell you. Would have to go down the git blame + dev ML + JIRA rabbit hole. :) Maybe someone else on-list knows.On Thu, Nov 30, 2023, at 4:25 AM, Jacek Lewandowski wrote:Hi,I'm getting a bit lost - what are the exact differences between those test scenarios? What are the criteria for qualifying a test to be part of a certain scenario?I'm working a little bit with tests and build scripts and the number of different configurations for which we have a separate target in the build starts to be problematic, I cannot imagine how problematic it is for a new contributor.It is not urgent, but we should at least have a plan on how to simplify and unify things.I'm in favour of reducing the number of test targets to the minimum - for different configurations I think we should provide a parameter pointing to jvm options file and maybe to cassandra.yaml. I know that we currently do some super hacky things with cassandra yaml for different configs - like concatenting parts of it. I presume it is not necessary - we can have a default test config yaml and a directory with overriding yamls; while building we could have a tool which is able to load the default configuration, apply the override and save the resulting yaml somewhere in the build/test/configs for example. That would allows us to easily use those yamls in IDE as well - currently it is impossible.What do you think?Thank you and my apologize for bothering about lower priority stuff while we have a 5.0 release headache...Jacek

Re: Long tests, Burn tests, Simulator tests, Fuzz tests - can we clarify the diffs?

2023-11-30 Thread Benedict

A burn test is a randomised test targeting broad coverage of a single system, subsystem or utility, that may be long-running and that could be run indefinitely, each run providing incrementally more assurance of quality of the system.A long test is a unit test that sometimes takes a long time to run, no more no less. I’m not sure any of these offer all that much value anymore, and perhaps we could look to deprecate them.On 30 Nov 2023, at 17:20, Josh McKenzie  wrote:Strongly agree. I started working on a declarative refactor out of our CI configuration so circle, ASFCI, and other systems could inherit from it (for instance, see pre-commit pipeline declaration here); I had to set that down while I finished up implementing an internal CI system since the code in neither the ASF CI structure nor circle structure (.sh embedded in .yml /cry) was re-usable in their current form.Having a jvm.options and cassandra.yaml file per suite and referencing them from a declarative job definition would make things a lot easier to wrap our heads around and maintain I think.As for what qualifies as burn vs. long... /shrug couldn't tell you. Would have to go down the git blame + dev ML + JIRA rabbit hole. :) Maybe someone else on-list knows.On Thu, Nov 30, 2023, at 4:25 AM, Jacek Lewandowski wrote:Hi,I'm getting a bit lost - what are the exact differences between those test scenarios? What are the criteria for qualifying a test to be part of a certain scenario?I'm working a little bit with tests and build scripts and the number of different configurations for which we have a separate target in the build starts to be problematic, I cannot imagine how problematic it is for a new contributor.It is not urgent, but we should at least have a plan on how to simplify and unify things.I'm in favour of reducing the number of test targets to the minimum - for different configurations I think we should provide a parameter pointing to jvm options file and maybe to cassandra.yaml. I know that we currently do some super hacky things with cassandra yaml for different configs - like concatenting parts of it. I presume it is not necessary - we can have a default test config yaml and a directory with overriding yamls; while building we could have a tool which is able to load the default configuration, apply the override and save the resulting yaml somewhere in the build/test/configs for example. That would allows us to easily use those yamls in IDE as well - currently it is impossible.What do you think?Thank you and my apologize for bothering about lower priority stuff while we have a 5.0 release headache...Jacek

Re: Road to 5.0-GA (was: [VOTE] Release Apache Cassandra 5.0-alpha2)

2023-11-04 Thread Benedict

Yep, data loss bugs are not any old bug. I’m concretely -1 (binding) releasing a beta with one that’s either under investigation or confirmed.As Scott says, hopefully it won’t come to that - the joy of deterministic testing is this should be straightforward to triage.On 4 Nov 2023, at 17:30, C. Scott Andreas  wrote:I’d happily be the first to vote -1(nb) on a release containing a known and reproducible bug that can result in data loss or an incorrect response to a query. And I certainly wouldn’t run it.Since we have a programmatic repro within just a few seconds, this should not take long to root-cause.On Friday, Alex worked to get this reproducing on a Cassandra branch rather than via unstaged changes. We should have a published / shareable example with details near the beginning of the week.– ScottOn Nov 4, 2023, at 10:17 AM, Josh McKenzie  wrote:I think before we cut a beta we need to have diagnosed and fixed 18993 (assuming it is a bug).Before a beta? I could see that for rc or GA definitely, but having a known (especially non-regressive) data loss bug in a beta seems like it's compatible with the guarantees we're providing for it: https://cwiki.apache.org/confluence/display/CASSANDRA/Release+LifecycleThis release is recommended for test/QA clusters where short(order of minutes) downtime during upgrades is not an issueOn Sat, Nov 4, 2023, at 12:56 PM, Ekaterina Dimitrova wrote:Totally agree with the others. Such an issue on its own should be a priority in any release. Looking forward to the reproduction test mentioned on the ticket.Thanks to Alex for his work on harry!On Sat, 4 Nov 2023 at 12:47, Benedict <bened...@apache.org> wrote:Alex can confirm but I think it actually turns out to be a new bug in 5.0, but either way we should not cut a release with such a serious potential known issue.  > On 4 Nov 2023, at 16:18, J. D. Jordan <jeremiah.jor...@gmail.com> wrote: >  > Sounds like 18993 is not a regression in 5.0? But present in 4.1 as well?  So I would say we should fix it with the highest priority and get a new 4.1.x released. Blocking 5.0 beta voting is a secondary issue to me if we have a “data not being returned” issue in an existing release? >  >> On Nov 4, 2023, at 11:09 AM, Benedict <bened...@apache.org> wrote: >>  >> I think before we cut a beta we need to have diagnosed and fixed 18993 (assuming it is a bug). >>  >>>> On 4 Nov 2023, at 16:04, Mick Semb Wever <m...@apache.org> wrote: >>>  >>>  >>>>  >>>> With the publication of this release I would like to switch the >>>> default 'latest' docs on the website from 4.1 to 5.0.  Are there any >>>> objections to this ? >>>  >>>  >>> I would also like to propose the next 5.0 release to be 5.0-beta1 >>>  >>> With the aim of reaching GA for the Summit, I would like to suggest we >>> work towards the best-case scenario of 5.0-beta1 in two weeks and >>> 5.0-rc1 first week Dec. >>>  >>> I know this is a huge ask with lots of unknowns we can't actually >>> commit to.  But I believe it is a worthy goal, and possible if nothing >>> sideswipes us – but we'll need all the help we can get this month to >>> make it happen. >>

Re: Road to 5.0-GA (was: [VOTE] Release Apache Cassandra 5.0-alpha2)

2023-11-04 Thread Benedict

Alex can confirm but I think it actually turns out to be a new bug in 5.0, but 
either way we should not cut a release with such a serious potential known 
issue.

> On 4 Nov 2023, at 16:18, J. D. Jordan  wrote:
> 
> Sounds like 18993 is not a regression in 5.0? But present in 4.1 as well?  
> So I would say we should fix it with the highest priority and get a new 4.1.x 
> released. Blocking 5.0 beta voting is a secondary issue to me if we have a 
> “data not being returned” issue in an existing release?
> 
>> On Nov 4, 2023, at 11:09 AM, Benedict  wrote:
>> 
>> I think before we cut a beta we need to have diagnosed and fixed 18993 
>> (assuming it is a bug).
>> 
>>>> On 4 Nov 2023, at 16:04, Mick Semb Wever  wrote:
>>> 
>>> 
>>>> 
>>>> With the publication of this release I would like to switch the
>>>> default 'latest' docs on the website from 4.1 to 5.0.  Are there any
>>>> objections to this ?
>>> 
>>> 
>>> I would also like to propose the next 5.0 release to be 5.0-beta1
>>> 
>>> With the aim of reaching GA for the Summit, I would like to suggest we
>>> work towards the best-case scenario of 5.0-beta1 in two weeks and
>>> 5.0-rc1 first week Dec.
>>> 
>>> I know this is a huge ask with lots of unknowns we can't actually
>>> commit to.  But I believe it is a worthy goal, and possible if nothing
>>> sideswipes us – but we'll need all the help we can get this month to
>>> make it happen.
>>

Re: Road to 5.0-GA (was: [VOTE] Release Apache Cassandra 5.0-alpha2)

2023-11-04 Thread Benedict

I think before we cut a beta we need to have diagnosed and fixed 18993 
(assuming it is a bug).

> On 4 Nov 2023, at 16:04, Mick Semb Wever  wrote:
> 
> 
>> 
>> With the publication of this release I would like to switch the
>> default 'latest' docs on the website from 4.1 to 5.0.  Are there any
>> objections to this ?
> 
> 
> I would also like to propose the next 5.0 release to be 5.0-beta1
> 
> With the aim of reaching GA for the Summit, I would like to suggest we
> work towards the best-case scenario of 5.0-beta1 in two weeks and
> 5.0-rc1 first week Dec.
> 
> I know this is a huge ask with lots of unknowns we can't actually
> commit to.  But I believe it is a worthy goal, and possible if nothing
> sideswipes us – but we'll need all the help we can get this month to
> make it happen.

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-11-02 Thread Benedict

> Projects MUST direct outsiders towards official releases rather than raw source repositories, nightly builds, snapshots, release candidates, or any other similar packages.Admittedly, “direct” here is ambiguous, but I think the sentiment that users should only be invited to use voted releases is reasonable either way.On 2 Nov 2023, at 15:59, Josh McKenzie  wrote:My reading of ASF policy is that directing users to CEP preview releases that are not formally voted upon is not acceptable. The policy you quote indicates they should be intended only for active participants on dev@I disagree with this interpretation; it'd be good to get some clarification as I don't see the narrow requirement of "developers on dev@". This interpretation would actively stifle any project's ability to get early user input and testing on things that are in-development. The primary reason I read it differently (aside from the negative implications) is the following text (emphasis mine):Projects SHOULD make available developer resources to support individuals actively participating in development or following the dev list and thus aware of the conditions placed on unreleased materials.For example, a user downloading a snapshot release with the unified compaction strategy in it to test it against their data set and provide feedback to engineers working on it on the dev ML or dev slack is very much someone actively participating in the development. It shouldn't just be contributors or committers actively working on the code who touch it before it's merged to trunk should it?On Thu, Nov 2, 2023, at 10:16 AM, Benedict wrote:My view is that we wait and see what the CI looks like at that time.My reading of ASF policy is that directing users to CEP preview releases that are not formally voted upon is not acceptable. The policy you quote indicates they should be intended only for active participants on dev@, whereas our explicit intention is to enable them to be advertised to users at the summit.On 2 Nov 2023, at 13:27, Josh McKenzie  wrote:I’m not sure we need any additional mechanisms beyond DISCUSS threads, polls and lazy consensus?...This likely means at least another DISCUSS thread and lazy consensus if you want to knowingly go against it, or want to modify or clarify what’s meant. ...It can be chucked out or rewoven at zero cost, but if the norms have taken hold and are broadly understood in the same way, it won’t change much or at all, because the actual glue is the norm, not the words, which only serve to broadcast some formulation of the norm.100% agree on all counts. Hopefully this discussion is useful for other folks as well.So - with the clarification that our agreement on green CI represents a polled majority consensus of the folks participating on the discussion at the time but not some kind of hard unbendable obligation, is this something we want to consider relaxing for TCM and Accord?This thread ran long (and got detoured - mea culpa) - the tradeoffs seem like:We merge them without green CI and cut a cassandra-5.1 branch so we can release an alpha-1 snapshot from that branch. This likely leaves cassandra-5.1 and trunk in an unstable place w/regards to CI. TCM/Accord devs can be expected to be pulled into fixing core issues / finalizing the features and the burden for test stabilization "leaking out" across others in the community who don't have context on their breakage (see: CASSANDRA-8099, cassandra-4.0 release, cassandra-4.1 release, now push for cassandra-5.0 QA stabilization).Push for green CI on Accord / TCM before merge and alpha availability, almost certainly delaying their availability to the community.Cut a preview / snapshot release from the accord feature branch, made available to the dev community. We could automate creation / update of docker images with snapshot releases of all HEAD for trunk and feature branches.Some other approach I'm not thinking of / missedSo as Mick asked earlier in the thread:Is anyone up for looking into adding a "preview" qualifier to our release process? I'm in favor of this. If we cut preview snapshots from trunk and all feature branches periodically (nightly? weekly?), preferably as docker images, this satisfies the desire to get these features into the hands of the dev and user community to test them out and provide feedback to the dev process while also allowing us to keep a high bar for merge to trunk.Referencing the ASF Release Policy: https://www.apache.org/legal/release-policy.html#release-definition, this is consistent with the guidance:During the process of developing software and preparing a release, various packages are made available to the development community for testing purposes. Projects MUST direct outsiders towards official releases rather than raw source repositories, nightly builds, snapshots, release candidates, or any other similar packages. Projects SHOULD make available developer resources to support individuals acti

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-11-02 Thread Benedict

My view is that we wait and see what the CI looks like at that time.My reading of ASF policy is that directing users to CEP preview releases that are not formally voted upon is not acceptable. The policy you quote indicates they should be intended only for active participants on dev@, whereas our explicit intention is to enable them to be advertised to users at the summit.On 2 Nov 2023, at 13:27, Josh McKenzie  wrote:I’m not sure we need any additional mechanisms beyond DISCUSS threads, polls and lazy consensus?...This likely means at least another DISCUSS thread and lazy consensus if you want to knowingly go against it, or want to modify or clarify what’s meant. ...It can be chucked out or rewoven at zero cost, but if the norms have taken hold and are broadly understood in the same way, it won’t change much or at all, because the actual glue is the norm, not the words, which only serve to broadcast some formulation of the norm.100% agree on all counts. Hopefully this discussion is useful for other folks as well.So - with the clarification that our agreement on green CI represents a polled majority consensus of the folks participating on the discussion at the time but not some kind of hard unbendable obligation, is this something we want to consider relaxing for TCM and Accord?This thread ran long (and got detoured - mea culpa) - the tradeoffs seem like:We merge them without green CI and cut a cassandra-5.1 branch so we can release an alpha-1 snapshot from that branch. This likely leaves cassandra-5.1 and trunk in an unstable place w/regards to CI. TCM/Accord devs can be expected to be pulled into fixing core issues / finalizing the features and the burden for test stabilization "leaking out" across others in the community who don't have context on their breakage (see: CASSANDRA-8099, cassandra-4.0 release, cassandra-4.1 release, now push for cassandra-5.0 QA stabilization).Push for green CI on Accord / TCM before merge and alpha availability, almost certainly delaying their availability to the community.Cut a preview / snapshot release from the accord feature branch, made available to the dev community. We could automate creation / update of docker images with snapshot releases of all HEAD for trunk and feature branches.Some other approach I'm not thinking of / missedSo as Mick asked earlier in the thread:Is anyone up for looking into adding a "preview" qualifier to our release process? I'm in favor of this. If we cut preview snapshots from trunk and all feature branches periodically (nightly? weekly?), preferably as docker images, this satisfies the desire to get these features into the hands of the dev and user community to test them out and provide feedback to the dev process while also allowing us to keep a high bar for merge to trunk.Referencing the ASF Release Policy: https://www.apache.org/legal/release-policy.html#release-definition, this is consistent with the guidance:During the process of developing software and preparing a release, various packages are made available to the development community for testing purposes. Projects MUST direct outsiders towards official releases rather than raw source repositories, nightly builds, snapshots, release candidates, or any other similar packages. Projects SHOULD make available developer resources to support individuals actively participating in development or following the dev list and thus aware of the conditions placed on unreleased materials.We direct people to the official downloads on the website and add a section below that references the latest snapshot releases for CEP-approved feature branch work in progress + trunk.Generically, a release is anything that is published beyond the group that owns it. For an Apache project, that means any publication outside the development community, defined as individuals actively participating in development or following the dev list.I think so long as we're clear about them being preview / snapshot releases of in-development work where we're looking for feedback on the dev process, as well as clearly directing people to the dev ML and #cassandra-dev on slack, this would be a pretty big win for the project.So - that's my bid. What do others think?On Wed, Nov 1, 2023, at 8:11 PM, Benedict wrote:So my view is that the community is strongly built on consensus, so expressions of sentiment within the community have strong normative weight even without any specific legislative effect. You shouldn’t knowingly go against what appears to be a consensus (or even widely-held) view, even if it has no formal weight. So I’m not sure we need any additional mechanisms beyond DISCUSS threads, polls and lazy consensus?Let’s treat your thread as a POLL for arguments sake: lots of folk voted, and every vote was positive. So clearly there’s strong endorsement for the approach, or parts thereof, in some form. Given the goal of consensus in decision-making, it would not be reasonable to ignore this widely held v

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

So my view is that the community is strongly built on consensus, so expressions of sentiment within the community have strong normative weight even without any specific legislative effect. You shouldn’t knowingly go against what appears to be a consensus (or even widely-held) view, even if it has no formal weight. So I’m not sure we need any additional mechanisms beyond DISCUSS threads, polls and lazy consensus?Let’s treat your thread as a POLL for arguments sake: lots of folk voted, and every vote was positive. So clearly there’s strong endorsement for the approach, or parts thereof, in some form. Given the goal of consensus in decision-making, it would not be reasonable to ignore this widely held view on contributions. This likely means at least another DISCUSS thread and lazy consensus if you want to knowingly go against it, or want to modify or clarify what’s meant. This just falls naturally out of how we do things here I think, and is how we go about a lot of business already. It retains the agility you were talking about, setting norms cheaply.It isn’t however a tightly held policy or legislative cudgel, it’s just what those who were talking and paying attention at the time agreed. It can be chucked out or rewoven at zero cost, but if the norms have taken hold and are broadly understood in the same way, it won’t change much or at all, because the actual glue is the norm, not the words, which only serve to broadcast some formulation of the norm.On 1 Nov 2023, at 23:41, Josh McKenzie wrote:but binding to the same extent 2 committers reviewing something we later need to revert is binding.To elaborate a bit - what I mean is "it's a bar we apply to help establish a baseline level of consensus but it's very much a 2-way door". Obviously 2 committers +1'ing code is a formal agreed upon voting mechanism.On Wed, Nov 1, 2023, at 7:26 PM, Josh McKenzie wrote:Community voting is also entirely by consensus, there is no such thing as a simple majority community vote, technical or otherwise.Ah hah! You're absolutely correct in that this isn't one of our "blessed" ways we vote. There's nothing written down about "committers are binding, simple majority" for any specific category of discussion.Are we ok with people creatively applying different ways to vote for things where there's not otherwise guidance if they feel it helps capture sentiment and engagement? Obviously the outcome of that isn't binding in the same way other votes by the pmc are, but binding to the same extent 2 committers reviewing something we later need to revert is binding.I'd rather we have a bunch of committers weigh in if we're talking about changing import ordering, or Config.java structure, or refactoring out singletons, or gatekeeping CI - things we've had come up over the years where we've had a lot of people chime in and we benefit from more than just "2 committers agree on it" but less than "We need a CEP or pmc vote for this".On Wed, Nov 1, 2023, at 5:10 PM, Benedict wrote:The project governance document does not list any kind of general purpose technical change vote. There are only three very specific kinds of community vote: code contributions, CEP and release votes. Community voting is also entirely by consensus, there is no such thing as a simple majority community vote, technical or otherwise. I suggest carefully re-reading the document we both formulated!If it is a technical contribution, as you contest, we only need a normal technical contribution vote to override it - i.e. two committer +1s. If that’s how we want to roll with it, I guess we’re not really in disagreement.None of this really fundamentally changes anything. There’s a strong norm for a commit gate on CI, and nobody is going to go about breaking this norm willy-nilly. But equally there’s no need to panic and waste all this time debating hypothetical mechanisms to avoid this supposedly ironclad rule.We clearly need to address confusion over governance though. The idea that agreeing things carefully costs us agility is one I cannot endorse. The project has leaned heavily into the consensus side of the Apache Way, as evidenced by our governance document. That doesn’t mean things can’t change quickly, it just means before those changes become formal requirements there needs to be broad consensus, as defined in the governing document. That’s it.The norm existed before the vote, and it exists whether the vote was valid or not. That is how things evolve on the project, we just formalise them a little more slowly.On 1 Nov 2023, at 20:07, Josh McKenzie wrote:First off, I appreciate your time and attention on this stuff. Want to be up front about that since these kinds of discussions can get prickly all too easily. I'm at least as guilty as anyone else about getting my back up on stuff like this. Figuring out the right things to "harden" as shared contractual ways we behave and what to leave loose and case

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

> The idea that agreeing things carefully costs us agility is one I cannot endorsenot one I can endorse On 1 Nov 2023, at 21:11, Benedict  wrote:The project governance document does not list any kind of general purpose technical change vote. There are only three very specific kinds of community vote: code contributions, CEP and release votes.  Community voting is also entirely by consensus, there is no such thing as a simple majority community vote, technical or otherwise. I suggest carefully re-reading the document we both formulated!If it is a technical contribution, as you contest, we only need a normal technical contribution vote to override it - i.e. two committer +1s. If that’s how we want to roll with it, I guess we’re not really in disagreement.None of this really fundamentally changes anything. There’s a strong norm for a commit gate on CI, and nobody is going to go about breaking this norm willy-nilly. But equally there’s no need to panic and waste all this time debating hypothetical mechanisms to avoid this supposedly ironclad rule.We clearly need to address confusion over governance though. The idea that agreeing things carefully costs us agility is one I cannot endorse. The project has leaned heavily into the consensus side of the Apache Way, as evidenced by our governance document. That doesn’t mean things can’t change quickly, it just means before those changes become formal requirements there needs to be broad consensus, as defined in the governing document. That’s it.The norm existed before the vote, and it exists whether the vote was valid or not. That is how things evolve on the project, we just formalise them a little more slowly.On 1 Nov 2023, at 20:07, Josh McKenzie  wrote:First off, I appreciate your time and attention on this stuff. Want to be up front about that since these kinds of discussions can get prickly all too easily. I'm at least as guilty as anyone else about getting my back up on stuff like this. Figuring out the right things to "harden" as shared contractual ways we behave and what to leave loose and case-by-case is going to continue to be a challenge for us as we grow.The last thing I personally want is for us to have too many extraneous rules formalizing things that just serve to slow down peoples' ability to contribute to the project. The flip side of that - for all of us to work in a shared space and collectively remain maximally productive, some individual freedoms (ability to merge a bunch of broken code and/or ninja in things as we see fit, needing 2 committers' eyes on things, etc) will have to be given up.At it's core the discussion we had was prompted by divergence between circle and ASF CI and our release process dragging on repeatedly during the "stabilize ASF CI" phase. The "do we require green ci before merge of tickets" seems like it came along as an intuitive rider; best I can recall my thinking was "how else could we have a manageable load to stabilize in ASF CI if we didn't even require green circle before merging things in", but we didn't really dig into details; from a re-reading now, that portion of the discussion was just taken for granted as us being in alignment. Given it was a codifying a norm and everyone else in the discussion generally agreed, I don't think I or anyone thought to question it.“Votes on project structure and governance”. Governance, per Wikipedia, is "the way rules, norms and actions are structured and sustained.”Bluntly, I'm not that worried about what wikipedia or a dictionary says about the topic. What I'm worried about here is what we collectively as a community think of as governance. "Do we have green CI pre-merge or not", to me personally, didn't qualify as a governance issue but rather a technical change. I'm open to being convinced otherwise, that things like that should qualify for a higher bar of voting, but again, I'm leery of that slowing down other workflow optimizations, changes, or community-wide impacting improvements people are making. Or muddying the waters to where people aren't sure what does or doesn't qualify as governance so they end up not pursuing things they're interested in improving as they're off-put by the bureaucratic burden of getting supermajority buy-in from pmc members who are much harder to get to participate in discussions on the ML compared to showing up for roll-call. :innocent:My understanding of "The Apache Way" is that to move at speed and at scale, we need to trust each other to do the right thing and know we can back out if things go awry. So if some folks talk through mutating config through virtual tables for instance, or folks working on TCM put things up for review and I don't have cycles, I trust the folks doing that work (the committers working on it or review it) that I personally just stay out of it knowing that if things need refining going forward we'll do so. Different things have a different cos

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

that side of the discussion you'd want to open now? i.e. the case for merging bodies of work without green CI, when we do that, how we do that, why we do that? We very well could have missed a very meaningful and useful scenario that would have changed the collective conversation since nobody brought it up at the time. We simple majority committer voted in; we can simple majority committer vote out if we think this is too constricting a policy or if we want to add an exception to it right?That's the blessing and the curse of decisions made with a lower bar; lower bar to undo.And I suppose secondly - if you disagree on whether something qualifies for the super majority governance bar vs. the simple majority committer bar... how do we navigate that?On Wed, Nov 1, 2023, at 12:33 PM, Benedict wrote:Your conceptualisation implies no weight to the decision, as a norm is not binding?The community voting section mentions only three kinds of decision, and this was deliberate: code contributions, CEP and releases - the latter of which non-PMC members are only permitted to veto; their votes do not count positively[1]. Everything else is a PMC decision.> I think you're arguing that voting to change our bar for merging when it comes to CI falls under "votes on project structure”“Votes on project structure and governance”. Governance, per Wikipedia, is "the way rules, norms and actions are structured and sustained.”I do not see any ambiguity here. The community side provides no basis for a vote of this kind, while the PMC side specifically reserves this kind of decision. But evidently we need to make this clearer.Regarding the legitimacy of questioning this now: I have not come up against this legislation before. The norm of requiring green CI has been around for a lot longer than this vote, so nothing much changed until we started questioning the specifics of this legislation. At this point, the legitimacy of the decision also matters. Clearly there is broad support for a policy of this kind, but is this specific policy adequate?While I endorse the general sentiment of the policy, I do not endorse a policy that has no wiggle room. I have made every effort in all of my policy-making to ensure there are loosely-defined escape hatches for the community to use, in large part to minimise this kind of legalistic logjam, which is just wasted cycles.On 1 Nov 2023, at 15:31, Josh McKenzie  wrote:That vote thread also did not reach the threshold; it was incorrectly counted, as committer votes are not binding for procedural changes. I counted at most 8 PMC +1 votes. This piqued my curiosity.Link to how we vote: https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+GovernanceSTATUS: Ratified 2020/06/25Relevant bits here:On dev@:Discussion / binding votes on releases (Consensus: min 3 PMC +1, no -1)Discussion / binding votes on project structure and governance changes (adopting subprojects, how we vote and govern, etc). (super majority)The thread where we voted on the CI bar Jeremiah referenced: https://lists.apache.org/thread/2shht9rb0l8fh2gfqx6sz9pxobo6sr60Particularly relevant bit:Committer / pmc votes binding.
Simple majority passes.I think you're arguing that voting to change our bar for merging when it comes to CI falls under "votes on project structure"? I think when I called that vote I was conceptualizing it as a technical discussion about a shared norm on how we as committers deal with code contributions, where the "committer votes are binding, simple majority" applies.I can see credible arguments in either direction, though I'd have expected those concerns or counter-arguments to have come up back in Jan of 2022 when we voted on the CI changes, not almost 2 years later after us operating under this new shared norm. The sentiments expressed on the discuss and vote thread were consistently positive and uncontentious; this feels to me like it falls squarely under the spirit of lazy consensus only at a much larger buy-in level than usual: https://community.apache.org/committers/decisionMaking.html#lazy-consensusWe've had plenty of time to call this vote and merge bar into question (i.e. every ticket we merge we're facing the "no regressions" bar), and the only reason I'd see us treating TCM or Accord differently would be because they're much larger bodies of work at merge so it's going to be a bigger lift to get to non-regression CI, and/or we would want a release cut from a formal branch rather than a feature branch for preview.An alternative approach to keep this merge and CI burden lower would have been more incremental work merged into trunk periodically, an argument many folks in the community have made in the past. I personally have mixed feelings about it; there's pros and cons to both approaches.All that said, I'm in favor of us continuing with this as a valid and ratified vote (technical norms == committer binding + simple majority). If we want to open a

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

Your conceptualisation implies no weight to the decision, as a norm is not binding?The community voting section mentions only three kinds of decision, and this was deliberate: code contributions, CEP and releases - the latter of which non-PMC members are only permitted to veto; their votes do not count positively[1]. Everything else is a PMC decision.> I think you're arguing that voting to change our bar for merging when it comes to CI falls under "votes on project structure”“Votes on project structure and governance”. Governance, per Wikipedia, is "the way rules, norms and actions are structured and sustained.”I do not see any ambiguity here. The community side provides no basis for a vote of this kind, while the PMC side specifically reserves this kind of decision. But evidently we need to make this clearer.Regarding the legitimacy of questioning this now: I have not come up against this legislation before. The norm of requiring green CI has been around for a lot longer than this vote, so nothing much changed until we started questioning the specifics of this legislation. At this point, the legitimacy of the decision also matters. Clearly there is broad support for a policy of this kind, but is this specific policy adequate?While I endorse the general sentiment of the policy, I do not endorse a policy that has no wiggle room. I have made every effort in all of my policy-making to ensure there are loosely-defined escape hatches for the community to use, in large part to minimise this kind of legalistic logjam, which is just wasted cycles.On 1 Nov 2023, at 15:31, Josh McKenzie  wrote:That vote thread also did not reach the threshold; it was incorrectly counted, as committer votes are not binding for procedural changes. I counted at most 8 PMC +1 votes. This piqued my curiosity.Link to how we vote: https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+GovernanceSTATUS: Ratified 2020/06/25Relevant bits here:On dev@:Discussion / binding votes on releases (Consensus: min 3 PMC +1, no -1)Discussion / binding votes on project structure and governance changes (adopting subprojects, how we vote and govern, etc). (super majority)The thread where we voted on the CI bar Jeremiah referenced: https://lists.apache.org/thread/2shht9rb0l8fh2gfqx6sz9pxobo6sr60Particularly relevant bit:Committer / pmc votes binding.
Simple majority passes.I think you're arguing that voting to change our bar for merging when it comes to CI falls under "votes on project structure"? I think when I called that vote I was conceptualizing it as a technical discussion about a shared norm on how we as committers deal with code contributions, where the "committer votes are binding, simple majority" applies.I can see credible arguments in either direction, though I'd have expected those concerns or counter-arguments to have come up back in Jan of 2022 when we voted on the CI changes, not almost 2 years later after us operating under this new shared norm. The sentiments expressed on the discuss and vote thread were consistently positive and uncontentious; this feels to me like it falls squarely under the spirit of lazy consensus only at a much larger buy-in level than usual: https://community.apache.org/committers/decisionMaking.html#lazy-consensusWe've had plenty of time to call this vote and merge bar into question (i.e. every ticket we merge we're facing the "no regressions" bar), and the only reason I'd see us treating TCM or Accord differently would be because they're much larger bodies of work at merge so it's going to be a bigger lift to get to non-regression CI, and/or we would want a release cut from a formal branch rather than a feature branch for preview.An alternative approach to keep this merge and CI burden lower would have been more incremental work merged into trunk periodically, an argument many folks in the community have made in the past. I personally have mixed feelings about it; there's pros and cons to both approaches.All that said, I'm in favor of us continuing with this as a valid and ratified vote (technical norms == committer binding + simple majority). If we want to open a formal discussion about instead considering that a procedural change and rolling things back based on those grounds I'm fine with that, but we'll need to discuss that and think about the broader implications since things like changing import ordering, tooling, or other ecosystem-wide impacting changes (CI systems we all share, etc) would similarly potentially run afoul of needing supermajority pmc participation of we categorize that type of work as "project structure" as per the governance rules.On Tue, Oct 31, 2023, at 1:25 PM, Jeremy Hanna wrote:I think the goal is to say "how could we get some working version of TCM/Accord into people's hands to try out at/by Summit?"  That's all.  People are eager to see it and try it out.On Oct 31, 2023, at 12:16 PM, Benedict  wrote:No

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-31 Thread Benedict

No, if I understand it correctly we’re in weird hypothetical land where people are inventing new release types (“preview”) to avoid merging TCM[1] in the event we want to cut a 5.1 release from the PR prior to the summit if there’s some handful of failing tests in the PR. This may or may not be a waste of everyone’s time.Jeremiah, I’m not questioning: it was procedurally invalid. How we handle that is, as always, a matter for community decision making.[1] how this helps isn’t entirely clearOn 31 Oct 2023, at 17:08, Paulo Motta  wrote:Even if it was not formally prescribed as far as I understand, we have been following the "only merge on Green CI" custom as much as possible for the past several years. Is the proposal to relax this rule for 5.0?On Tue, Oct 31, 2023 at 1:02 PM Jeremiah Jordan <jeremiah.jor...@gmail.com> wrote:
You are free to argue validity.  I am just stating what I see on the mailing list and in the wiki.  We had a vote which was called passing and was not contested at that time.  The vote was on a process which includes as #3 in the list:Before a merge, a committer needs either a non-regressing (i.e. no new failures) run of circleci with the required test suites (TBD; see below) or of ci-cassandra.Non-regressing is defined here as "Doesn't introduce any new test failures; any new failures in CI are clearly not attributable to this diff"(NEW) After merging tickets, ci-cassandra runs against the SHA and the author gets an advisory update on the related JIRA for any new errors on CI. The author of the ticket will take point on triaging this new failure and either fixing (if clearly reproducible or related to their work) or opening a JIRA for the intermittent failure and linking it in butler (https://butler.cassandra.apache.org/#/)Which clearly says that before merge we ensure there are no known new regressions to CI.The allowance for releases without CI being green, and merges without the CI being completely green are from the fact that our trunk CI has rarely been completely green, so we allow merging things which do not introduce NEW regressions, and we allow releases with known regressions that are deemed acceptable.We can indeed always vote to override it, and if it comes to that we can consider that as an option.-Jeremiah


On Oct 31, 2023 at 11:41:29 AM, Benedict <bened...@apache.org> wrote:

That vote thread also did not reach the threshold; it was incorrectly counted, as committer votes are not binding for procedural changes. I counted at most 8 PMC +1 votes. The focus of that thread was also clearly GA releases and merges on such branches, since there was a focus on releases being failure-free. But this predates the more general release lifecycle vote that allows for alphas to have failing tests - which logically would be impossible if nothing were merged with failing or flaky tests.Either way, the vote and discussion specifically allow for this to be overridden.路‍♀️On 31 Oct 2023, at 16:29, Jeremiah Jordan <jeremiah.jor...@gmail.com> wrote:
I never said there was a need for green CI for alpha.  We do have a requirement for not merging things to trunk that have known regressions in CI.Vote here: https://lists.apache.org/thread/j34mrgcy9wrtn04nwwymgm6893h0xwo9


On Oct 31, 2023 at 3:23:48 AM, Benedict <bened...@apache.org> wrote:

There is no requirement for green CI on alpha. We voted last year to require running all tests before commit and to require green CI for beta releases. This vote was invalid because it didn’t reach the vote floor for a procedural change but anyway is not inconsistent with knowingly and selectively merging work without green CI.If we reach the summit we should take a look at the state of the PRs and make a decision about if they are alpha quality; if so, and we want a release, we should simply merge it and release. Making up a new release type when the work meets alpha standard to avoid an arbitrary and not mandated commit bar seems the definition of silly.On 31 Oct 2023, at 04:34, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:That is my understanding as well. If the TCM and Accord based on TCM branches are ready to commit by ~12/1 we can cut a 5.1 branch and then a 5.1-alpha release.Where “ready to commit” means our usual things of two committer +1 and green CI etc.If we are not ready to commit then I propose that as long as everything in the accord+tcm Apache repo branch has had two committer +1’s, but maybe people are still working on fixes for getting CI green or similar, we cut a 5.1-preview  build from the feature branch to vote on with known issues documented.  This would not be the preferred path, but would be a way to have a voted on release for summit.-Jeremiah On Oct 30, 2023, at 5:59 PM, Mick Semb Wever <m...@apache.org> wrote:Hoping we can get clarity on this.The proposal was, once TCM and Accord merges to trunk,  then immediately branch ca

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-31 Thread Benedict

That vote thread also did not reach the threshold; it was incorrectly counted, as committer votes are not binding for procedural changes. I counted at most 8 PMC +1 votes. The focus of that thread was also clearly GA releases and merges on such branches, since there was a focus on releases being failure-free. But this predates the more general release lifecycle vote that allows for alphas to have failing tests - which logically would be impossible if nothing were merged with failing or flaky tests.Either way, the vote and discussion specifically allow for this to be overridden.路‍♀️On 31 Oct 2023, at 16:29, Jeremiah Jordan  wrote:
I never said there was a need for green CI for alpha.  We do have a requirement for not merging things to trunk that have known regressions in CI.Vote here: https://lists.apache.org/thread/j34mrgcy9wrtn04nwwymgm6893h0xwo9


On Oct 31, 2023 at 3:23:48 AM, Benedict <bened...@apache.org> wrote:

There is no requirement for green CI on alpha. We voted last year to require running all tests before commit and to require green CI for beta releases. This vote was invalid because it didn’t reach the vote floor for a procedural change but anyway is not inconsistent with knowingly and selectively merging work without green CI.If we reach the summit we should take a look at the state of the PRs and make a decision about if they are alpha quality; if so, and we want a release, we should simply merge it and release. Making up a new release type when the work meets alpha standard to avoid an arbitrary and not mandated commit bar seems the definition of silly.On 31 Oct 2023, at 04:34, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:That is my understanding as well. If the TCM and Accord based on TCM branches are ready to commit by ~12/1 we can cut a 5.1 branch and then a 5.1-alpha release.Where “ready to commit” means our usual things of two committer +1 and green CI etc.If we are not ready to commit then I propose that as long as everything in the accord+tcm Apache repo branch has had two committer +1’s, but maybe people are still working on fixes for getting CI green or similar, we cut a 5.1-preview  build from the feature branch to vote on with known issues documented.  This would not be the preferred path, but would be a way to have a voted on release for summit.-Jeremiah On Oct 30, 2023, at 5:59 PM, Mick Semb Wever <m...@apache.org> wrote:Hoping we can get clarity on this.The proposal was, once TCM and Accord merges to trunk,  then immediately branch cassandra-5.1 and cut an immediate 5.1-alpha1 release.This was to focus on stabilising TCM and Accord as soon as it lands, hence the immediate branching.And the alpha release as that is what our Release Lifecycle states it to be.https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle My understanding is that there was no squeezing in extra features into 5.1 after TCM+Accord lands, and there's no need for a "preview" release – we move straight to the alpha, as our lifecycle states.  And we will describe all usability shortcomings and bugs with the alpha, our lifecycle docs permit this, if we feel the need to.All this said, if TCM does not merge before the Summit, and we want to get a release into user hands, it has been suggested we cut a preview release 5.1-preview1 off the feature branch.  This is a different scenario, and only a mitigation plan.  On Thu, 26 Oct 2023 at 14:20, Benedict <bened...@apache.org> wrote:The time to stabilise is orthogonal to the time we branch. Once we branch we stop accepting new features for the branch, and work to stabilise.My understanding is we will branch as soon as we have a viable alpha containing TCM and Accord. That means pretty soon after they land in the project, which we expect to be around the summit.If this isn’t the expectation we should make that clear, as it will affect how this decision is made.On 26 Oct 2023, at 10:14, Benjamin Lerer <b.le...@gmail.com> wrote:
Regarding the release of 5.1, I 
understood the proposal to be that we cut an actual alpha, thereby 
sealing the 5.1 release from new features. Only features merged before 
we cut the alpha would be permitted, and the alpha should be cut as soon
 as practicable. What exactly would we be waiting for? The problem I believe is about expectations. It seems that your expectation is that a release with only TCM and Accord will reach GA quickly. Based on the time it took us to release 4.1, I am simply expecting more delays (a GA around end of May, June). In which case it seems to me that we could be interested in shipping more stuff in the meantime (thinking of CASSANDRA-15254 or CEP-29 for example).I do not have a strong opinion, I just want to make sure that we all share the same understanding and fully understand what we agree upon.    

Le jeu. 26 oct. 2023 à 10:59, Benjamin Lerer <b.le...@gmail.com> a écrit :
I am surprised this needs to be said, 
but - espec

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-31 Thread Benedict

There is no requirement for green CI on alpha. We voted last year to require running all tests before commit and to require green CI for beta releases. This vote was invalid because it didn’t reach the vote floor for a procedural change but anyway is not inconsistent with knowingly and selectively merging work without green CI.If we reach the summit we should take a look at the state of the PRs and make a decision about if they are alpha quality; if so, and we want a release, we should simply merge it and release. Making up a new release type when the work meets alpha standard to avoid an arbitrary and not mandated commit bar seems the definition of silly.On 31 Oct 2023, at 04:34, J. D. Jordan  wrote:That is my understanding as well. If the TCM and Accord based on TCM branches are ready to commit by ~12/1 we can cut a 5.1 branch and then a 5.1-alpha release.Where “ready to commit” means our usual things of two committer +1 and green CI etc.If we are not ready to commit then I propose that as long as everything in the accord+tcm Apache repo branch has had two committer +1’s, but maybe people are still working on fixes for getting CI green or similar, we cut a 5.1-preview  build from the feature branch to vote on with known issues documented.  This would not be the preferred path, but would be a way to have a voted on release for summit.-Jeremiah On Oct 30, 2023, at 5:59 PM, Mick Semb Wever  wrote:Hoping we can get clarity on this.The proposal was, once TCM and Accord merges to trunk,  then immediately branch cassandra-5.1 and cut an immediate 5.1-alpha1 release.This was to focus on stabilising TCM and Accord as soon as it lands, hence the immediate branching.And the alpha release as that is what our Release Lifecycle states it to be.https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle My understanding is that there was no squeezing in extra features into 5.1 after TCM+Accord lands, and there's no need for a "preview" release – we move straight to the alpha, as our lifecycle states.  And we will describe all usability shortcomings and bugs with the alpha, our lifecycle docs permit this, if we feel the need to.All this said, if TCM does not merge before the Summit, and we want to get a release into user hands, it has been suggested we cut a preview release 5.1-preview1 off the feature branch.  This is a different scenario, and only a mitigation plan.  On Thu, 26 Oct 2023 at 14:20, Benedict <bened...@apache.org> wrote:The time to stabilise is orthogonal to the time we branch. Once we branch we stop accepting new features for the branch, and work to stabilise.My understanding is we will branch as soon as we have a viable alpha containing TCM and Accord. That means pretty soon after they land in the project, which we expect to be around the summit.If this isn’t the expectation we should make that clear, as it will affect how this decision is made.On 26 Oct 2023, at 10:14, Benjamin Lerer <b.le...@gmail.com> wrote:
Regarding the release of 5.1, I 
understood the proposal to be that we cut an actual alpha, thereby 
sealing the 5.1 release from new features. Only features merged before 
we cut the alpha would be permitted, and the alpha should be cut as soon
 as practicable. What exactly would we be waiting for? The problem I believe is about expectations. It seems that your expectation is that a release with only TCM and Accord will reach GA quickly. Based on the time it took us to release 4.1, I am simply expecting more delays (a GA around end of May, June). In which case it seems to me that we could be interested in shipping more stuff in the meantime (thinking of CASSANDRA-15254 or CEP-29 for example).I do not have a strong opinion, I just want to make sure that we all share the same understanding and fully understand what we agree upon.    

Le jeu. 26 oct. 2023 à 10:59, Benjamin Lerer <b.le...@gmail.com> a écrit :
I am surprised this needs to be said, 
but - especially for long-running CEPs - you must involve yourself 
early, and certainly within some reasonable time of being notified the 
work is ready for broader input and review. In this case, more than six 
months ago.It is unfortunately more complicated than that because six month ago Ekaterina and I were working on supporting Java 17 and dropping Java 8 which was needed by different ongoing works. We both missed the announcement that TCM was ready for review and anyway would not have been available at that time. Maxim has asked me ages ago for a review of 
CASSANDRA-15254  more than 6 months ago and I have not been able to help him so far. We all have a limited bandwidth and can miss some announcements.    

The project has grown and a lot of things are going on in parallel. There are also more interdependencies between the different projects. In my opinion what we are lacking is a global overview of the different things going on in the project and some rough ideas of the status of the different signif

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-30 Thread Benedict

  [1]  https://nam06.safelinks.protection.outlook.com/?url="">   Another little thing that I'd like to mention is a release management story. In the Apache Ignite project, we've got used to creating a release thread and posting the release status updates and/or problems, and/or delays there, and maybe some of the benchmarks at the end. Of course, this was done by the release manager who volunteered to do this work. I'm not saying we're doing anything wrong here, no, but the publicity and openness, coupled with regular updates, could help create a real sense of the remaining work in progress. These are my personal feelings, and definitely not actions to be taken. The example is here: [2].  [2]  https://nam06.safelinks.protection.outlook.com/?url="">  On Thu, 26 Oct 2023 at 11:15, Benjamin Lerer <b.le...@gmail.com> wrote: >> >> Regarding the release of 5.1, I understood the proposal to be that we cut an actual alpha, thereby sealing the 5.1 release from new features. Only features merged before we cut the alpha would be permitted, and the alpha should be cut as soon as practicable.
 What exactly would we be waiting for? > > > The problem I believe is about expectations. It seems that your expectation is that a release with only TCM and Accord will reach GA quickly. Based on the time it took us to release 4.1, I am simply expecting more delays (a GA around end of May, June). In
 which case it seems to me that we could be interested in shipping more stuff in the meantime (thinking of CASSANDRA-15254 or CEP-29 for example). > I do not have a strong opinion, I just want to make sure that we all share the same understanding and fully understand what we agree upon. > > Le jeu. 26 oct. 2023 à 10:59, Benjamin Lerer <b.le...@gmail.com> a écrit : >>> >>> I am surprised this needs to be said, but - especially for long-running CEPs - you must involve yourself early, and certainly within some reasonable time of being notified the work is ready for broader input and review. In this case, more than six months
 ago. >> >> >> It is unfortunately more complicated than that because six month ago Ekaterina and I were working on supporting Java 17 and dropping Java 8 which was needed by different ongoing works. We both missed the announcement that TCM was ready for review and anyway
 would not have been available at that time. Maxim has asked me ages ago for a review of CASSANDRA-15254  more than 6 months ago and I have not been able to help him so far. We all have a limited bandwidth and can miss some announcements. >> >> The project has grown and a lot of things are going on in parallel. There are also more interdependencies between the different projects. In my opinion what we are lacking is a global overview of the different things going on in the project and some rough
 ideas of the status of the different significant pieces. It would allow us to better organize ourselves. >> >> Le jeu. 26 oct. 2023 à 00:26, Benedict <bened...@apache.org> a écrit : >>> >>> I have spoken privately with Ekaterina, and to clear up some possible ambiguity: I realise nobody has demanded a delay to this work to conduct additional reviews; a couple of folk have however said they would prefer one. >>> >>> >>> My point is that, as a community, we need to work on ensuring folk that care about a CEP participate at an appropriate time. If they aren’t able to, the consequences of that are for them to bear. >>> >>> >>> We should be working to avoid surprises as CEP start to land. To this end, I think we should work on some additional paragraphs for the governance doc covering expectations around the landing of CEPs. >>> >>> >>> On 25 Oct 2023, at 21:55, Benedict <bened...@apache.org> wrote: >>> >>>  >>> >>> I am surprised this needs to be said, but - especially for long-running CEPs - you must involve yourself early, and certainly within some reasonable time of being notified the work is ready for broader input and review. In this case, more than six months
 ago. >>> >>> >>> This isn’t the first time this has happened, and it is disappointing to see it again. Clearly we need to make this explicit in the guidance docs. >>> >>> >>> Regarding the release of 5.1, I understood the proposal to be that we cut an actual alpha, thereby sealing the 5.1 release from new features. Only features merged before we cut the alpha would be permitted, and the alpha should be cut as soon as practicable.
 What exactly would we be waiting for? >>> >>> >>> If we don’t have a clear and near-term trigger for branching 5.1 for its own release, shortly after Accord and TCM merge, then I am in favour of instead delaying 5.0.

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-26 Thread Benedict

I don’t want to take a view on how long stabilisation will take, as it’s guesswork. I hope it won’t be lengthy, but that I don’t think these guesses should affect the branch date.The question is only when we branch. The proposal I would endorse is that we branch as soon as TCM and Accord land. I think our normal branch rules should apply besides that - work landing in trunk between now and then makes the cut, but once branched new work lands in trunk, not 5.1.Since we’re also stabilising 5.0, I don’t expect lots of work to land between now and then. But that’s a question of focus and practicality, not a procedurally position.On 26 Oct 2023, at 13:36, Ekaterina Dimitrova wrote:Benedict, what is your expectation for stabilization time? And what is the suggestion for the patches Benjamin mentioned, which are on their way to land in trunk? (Or any other patch on its way to be merged)On Thu, 26 Oct 2023 at 8:20, Benedict <bened...@apache.org> wrote:The time to stabilise is orthogonal to the time we branch. Once we branch we stop accepting new features for the branch, and work to stabilise.My understanding is we will branch as soon as we have a viable alpha containing TCM and Accord. That means pretty soon after they land in the project, which we expect to be around the summit.If this isn’t the expectation we should make that clear, as it will affect how this decision is made.On 26 Oct 2023, at 10:14, Benjamin Lerer <b.le...@gmail.com> wrote:
Regarding the release of 5.1, I
understood the proposal to be that we cut an actual alpha, thereby
sealing the 5.1 release from new features. Only features merged before
we cut the alpha would be permitted, and the alpha should be cut as soon
as practicable. What exactly would we be waiting for? The problem I believe is about expectations. It seems that your expectation is that a release with only TCM and Accord will reach GA quickly. Based on the time it took us to release 4.1, I am simply expecting more delays (a GA around end of May, June). In which case it seems to me that we could be interested in shipping more stuff in the meantime (thinking of CASSANDRA-15254 or CEP-29 for example).I do not have a strong opinion, I just want to make sure that we all share the same understanding and fully understand what we agree upon.

Le jeu. 26 oct. 2023 à 10:59, Benjamin Lerer <b.le...@gmail.com> a écrit :
I am surprised this needs to be said,
but - especially for long-running CEPs - you must involve yourself
early, and certainly within some reasonable time of being notified the
work is ready for broader input and review. In this case, more than six
months ago.It is unfortunately more complicated than that because six month ago Ekaterina and I were working on supporting Java 17 and dropping Java 8 which was needed by different ongoing works. We both missed the announcement that TCM was ready for review and anyway would not have been available at that time. Maxim has asked me ages ago for a review of
CASSANDRA-15254 more than 6 months ago and I have not been able to help him so far. We all have a limited bandwidth and can miss some announcements.

The project has grown and a lot of things are going on in parallel. There are also more interdependencies between the different projects. In my opinion what we are lacking is a global overview of the different things going on in the project and some rough ideas of the status of the different significant pieces. It would allow us to better organize ourselves.

Le jeu. 26 oct. 2023 à 00:26, Benedict <bened...@apache.org> a écrit :I have spoken privately with Ekaterina, and to clear up some possible ambiguity: I realise nobody has demanded a delay to this work to conduct additional reviews; a couple of folk have however said they would prefer one.

My point is that, as a community, we need to work on ensuring folk that care about a CEP participate at an appropriate time. If they aren’t able to, the consequences of that are for them to bear. We should be working to avoid surprises as CEP start to land. To this end, I think we should work on some additional paragraphs for the governance doc covering expectations around the landing of CEPs.On 25 Oct 2023, at 21:55, Benedict <bened...@apache.org> wrote:I am surprised this needs to be said, but - especially for long-running CEPs - you must involve yourself early, and certainly within some reasonable time of being notified the work is ready for broader input and review. In this case, more than six months ago.

This isn’t the first time this has happened, and it is disappointing to see it again. Clearly we need to make this explicit in the guidance docs.

Regarding the release of 5.1, I understood the proposal to be that we cut an actual alpha, thereby sealing the 5.1 release from new features. Only features merged before we cut the alpha would be permitted, and the alpha should be cut as soon as practicable. What exactly would we be w

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-26 Thread Benedict

The time to stabilise is orthogonal to the time we branch. Once we branch we stop accepting new features for the branch, and work to stabilise.My understanding is we will branch as soon as we have a viable alpha containing TCM and Accord. That means pretty soon after they land in the project, which we expect to be around the summit.If this isn’t the expectation we should make that clear, as it will affect how this decision is made.On 26 Oct 2023, at 10:14, Benjamin Lerer wrote:
Regarding the release of 5.1, I
understood the proposal to be that we cut an actual alpha, thereby
sealing the 5.1 release from new features. Only features merged before
we cut the alpha would be permitted, and the alpha should be cut as soon
as practicable. What exactly would we be waiting for? The problem I believe is about expectations. It seems that your expectation is that a release with only TCM and Accord will reach GA quickly. Based on the time it took us to release 4.1, I am simply expecting more delays (a GA around end of May, June). In which case it seems to me that we could be interested in shipping more stuff in the meantime (thinking of CASSANDRA-15254 or CEP-29 for example).I do not have a strong opinion, I just want to make sure that we all share the same understanding and fully understand what we agree upon.

This isn’t the first time this has happened, and it is disappointing to see it again. Clearly we need to make this explicit in the guidance docs.

If we don’t have a clear and near-term trigger for branching 5.1 for its own release, shortly after Accord and TCM merge, then I am in favour of instead delaying 5.0.On 25 Oct 2023, at 19:40, Mick Semb Wever <m...@apache.org> wrote:I'm open to the suggestions of not branching cassandra-5.1 and/or naming a preview release something other than 5.1-alpha1.But… the codebases and release process (and upgrade tests) do not currently support releases with qualifiers that are not alpha, beta, or rc. We can add a preview qualifier to this list, but I would not want to block getting a preview release out only because this wasn't yet in place. Hence the proposal used 5.1-alpha1 simply because that's what we know we can do today. An alpha release also means (typically) the branch.Is anyone up for looking into adding a "preview" qualifier to our release process? This may also solve our previous discussions and desire to have quarterly releases that

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-25 Thread Benedict

I have spoken privately with Ekaterina, and to clear up some possible ambiguity: I realise nobody has demanded a delay to this work to conduct additional reviews; a couple of folk have however said they would prefer one.

My point is that, as a community, we need to work on ensuring folk that care about a CEP participate at an appropriate time. If they aren’t able to, the consequences of that are for them to bear. We should be working to avoid surprises as CEP start to land. To this end, I think we should work on some additional paragraphs for the governance doc covering expectations around the landing of CEPs.On 25 Oct 2023, at 21:55, Benedict  wrote:I am surprised this needs to be said, but - especially for long-running CEPs - you must involve yourself early, and certainly within some reasonable time of being notified the work is ready for broader input and review. In this case, more than six months ago.

This isn’t the first time this has happened, and it is disappointing to see it again. Clearly we need to make this explicit in the guidance docs.

Regarding the release of 5.1, I understood the proposal to be that we cut an actual alpha, thereby sealing the 5.1 release from new features. Only features merged before we cut the alpha would be permitted, and the alpha should be cut as soon as practicable. What exactly would we be waiting for? 

If we don’t have a clear and near-term trigger for branching 5.1 for its own release, shortly after Accord and TCM merge, then I am in favour of instead delaying 5.0.On 25 Oct 2023, at 19:40, Mick Semb Wever  wrote:I'm open to the suggestions of not branching cassandra-5.1 and/or naming a preview release something other than 5.1-alpha1.But… the codebases and release process (and upgrade tests) do not currently support releases with qualifiers that are not alpha, beta, or rc.  We can add a preview qualifier to this list, but I would not want to block getting a preview release out only because this wasn't yet in place.  Hence the proposal used 5.1-alpha1 simply because that's what we know we can do today.  An alpha release also means (typically) the branch.Is anyone up for looking into adding a "preview" qualifier to our release process? This may also solve our previous discussions and desire to have quarterly releases that folk can use through the trunk dev cycle.Personally, with my understanding of timelines in front of us to fully review and stabilise tcm, it makes sense to branch it as soon as it's merged.  It's easiest to stabilise it on a branch, and there's definitely the desire and demand to do so, so it won't be getting forgotten or down-prioritised.   On Wed, 25 Oct 2023 at 18:07, Jeremiah Jordan <jeremiah.jor...@gmail.com> wrote:
If we do a 5.1 release why not take it as an opportunity to release more things. I am not saying that we will. Just that we should let that door open.
Agreed.  This is the reason I brought up the possibility of not branching off 5.1 immediately.

On Oct 25, 2023 at 3:17:13 AM, Benjamin Lerer <b.le...@gmail.com> wrote:

The proposal includes 3 things:1. Do not include TCM and Accord in 5.0 to avoid delaying 5.02. The next release will be 5.1 and will include only Accord and TCM3. Merge TCM and Accord right now in 5.1 (making an initial release)I am fine with question 1 and do not have a strong opinion on which way to go.2. Means that every new feature will have to wait for post 5.1 even if it is ready before 5.1 is stabilized and shipped. If we do a 5.1 release why not take it as an opportunity to release more things. I am not saying that we will. Just that we should let that door open.3. There is a need to merge TCM and Accord as maintaining those separate branches is costly in terms of time and energy. I fully understand that. On the other hand merging TCM and Accord will make the TCM review harder and I do believe that this second round of review is valuable as it already uncovered a valid issue. Nevertheless, I am fine with merging TCM as soon as it passes CI and continuing the review after the merge. If we cannot meet at least that quality level (Green CI) we should not merge just for creating an 5.1.alpha release for the summit.Now, I am totally fine with a preview release without numbering and with big warnings that will only serve as a preview for the summit.Le mer. 25 oct. 2023 à 06:33, Berenguer Blasi <berenguerbl...@gmail.com> a écrit :I also think there's many good new features in 5.0 already they'd make a 
good release even on their own. My 2 cts.

On 24/10/23 23:20, Brandon Williams wrote:
> The catch here is that we don't publish docker images currently.  The
> C* docker images available are not made by us.
>
> Kind Regards,
> Brandon
>
> On Tue, Oct 24, 2023 at 3:31 PM Patrick McFadin <pmcfa...@gmail.com> wrote:
>> Let me make that really easy. Hell yes
>>
>> Not everybody runs CCM, I've tried but I've met resistance.
&

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-25 Thread Benedict

ing a cronned job that'd do nightly docker container builds on trunk + feature branches, archived for N days, and we make that generally known to the dev@ list here so folks that want to poke at the current state of trunk or other branches could do so with very low friction. We'd probably see more engagement on feature branches if it was turn-key easy for other C* devs to spin the up and check them out.
>>>
>>> For what you're talking about here Patrick (a docker image for folks outside the dev@ audience and more user-facing), we'd want to vote on it and go through the formal process.
>>>
>>> On Tue, Oct 24, 2023, at 3:10 PM, Jeremiah Jordan wrote:
>>>
>>> In order for the project to advertise the release outside the dev@ list it needs to be a formal release.  That just means that there was a release vote and at least 3 PMC members +1’ed it, and there are more +1 than there are -1, and we follow all the normal release rules.  The ASF release process doesn’t care what branch you cut the artifacts from or what version you call it.
>>>
>>> So the project can cut artifacts for and release a 5.1-alpha1, 5.1-dev-preview1, what ever we want to version this thing, from trunk or any other branch name we want.
>>>
>>> -Jeremiah
>>>
>>> On Oct 24, 2023 at 2:03:41 PM, Patrick McFadin <pmcfa...@gmail.com> wrote:
>>>
>>> I would like to have something for developers to use ASAP to try the Accord syntax. Very few people have seen it, and I think there's a learning curve we can start earlier.
>>>
>>> It's my understanding that ASF policy is that it needs to be a project release to create a docker image.
>>>
>>> On Tue, Oct 24, 2023 at 11:54 AM Jeremiah Jordan <jeremiah.jor...@gmail.com> wrote:
>>>
>>> If we decide to go the route of not merging TCM to the 5.0 branch.  Do we actually need to immediately cut a 5.1 branch?  Can we work on stabilizing things while it is in trunk and cut the 5.1 branch when we actually think we are near releasing?  I don’t see any reason we can not cut “preview” artifacts from trunk?
>>>
>>> -Jeremiah
>>>
>>> On Oct 24, 2023 at 11:54:25 AM, Jon Haddad <rustyrazorbl...@apache.org> wrote:
>>>
>>> I guess at the end of the day, shipping a release with a bunch of awesome features is better than holding it back.  If there's 2 big releases in 6 months the community isn't any worse off.
>>>
>>> We either ship something, or nothing, and something is probably better.
>>>
>>> Jon
>>>
>>>
>>> On 2023/10/24 16:27:04 Patrick McFadin wrote:
>>>
>>> +1 to what you are saying, Josh. Based on the last survey, yes, everyone
>>>
>>> was excited about Accord, but SAI and UCS were pretty high on the list.
>>>
>>>
>>> Benedict and I had a good conversation last night, and now I understand
>>>
>>> more essential details for this conversation. TCM is taking far more work
>>>
>>> than initially scoped, and Accord depends on a stable TCM. TCM is months
>>>
>>> behind and that's a critical fact, and one I personally just learned of. I
>>>
>>> thought things were wrapping up this month, and we were in the testing
>>>
>>> phase. I get why that's a topic we are dancing around. Nobody wants to say
>>>
>>> ship dates are slipping because that's part of our culture. It's
>>>
>>> disappointing and, if new information, an unwelcome surprise, but none of
>>>
>>> us should be angry or in a blamey mood because I guarantee every one of us
>>>
>>> has shipped the code late. My reaction yesterday was based on an incorrect
>>>
>>> assumption. Now that I have a better picture, my point of view is changing.
>>>
>>>
>>> Josh's point about what's best for users is crucial. Users deserve stable
>>>
>>> code with a regular cadence of features that make their lives easier. If we
>>>
>>> put 5.0 on hold for TCM + Accord, users will get neither for a very long
>>>
>>> time. And I mentioned a disaster yesterday. A bigger disaster would be
>>>
>>> shipping Accord with a major bug that causes data loss, eroding community
>>>
>>> trust. Accord has to be the most bulletproof of all bulletproof features.
>>>
>>> The pressure to ship is only going to increase and that's fertile ground
>>>
>>> for that sort of bug.
>>>
>>>
>>> So, taking a step back and with a clearer picture, I support the 5.0 + 5.1
>>>
>&

Re: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-10-24 Thread Benedict

[LEGAL-656] Application of Generative AI policy to dependencies - ASF JIRAissues.apache.orgLegal’s opinion is that this is not an acceptable workaround to the policy.On 22 Sep 2023, at 23:51, German Eichberger via dev  wrote:






+1 with taking it to legal




As anyone else I enjoy speculating about legal stuff and I think for jars you probably need possible deniability aka no paper trail that we knowingly... but that horse is out of the barn. So really interested in what legal says





If you can stomach non Java here is an alternate DiskANN implementation: 
microsoft/DiskANN: Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search (github.com)




Thanks,

German





From: Josh McKenzie 
Sent: Friday, September 22, 2023 7:43 AM
To: dev 
Subject: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30
 


I highly doubt liability works like that in all jurisdictions

That's a fantastic point. When speculating there, I overlooked the fact that there are literally dozens of legal jurisdictions in which this project is used and the foundation operates.


As a PMC let's take this to legal.


On Fri, Sep 22, 2023, at 9:16 AM, Jeff Jirsa wrote:

To do that, the cassandra PMC can open a legal JIRA and ask for a (durable, concrete) opinion.




On Fri, Sep 22, 2023 at 5:59 AM Benedict <bened...@apache.org> wrote:






my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyright



I highly doubt liability works like that in all jurisdictions, even if it might in some. I can even think of some historic cases related to Linux where patent trolls went after users of Linux, though I’m not sure where that got
 to and I don’t remember all the details.


But anyway, none of us are lawyers and we shouldn’t be depending on this kind of analysis. At minimum we should invite legal to proffer an opinion on whether dependencies are a valid loophole to the policy.







On 22 Sep 2023, at 13:48, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:





This Gen AI generated code use thread should probably be its own mailing list DISCUSS thread?  It applies to all source code we take in, and accept copyright assignment of, not to jars we depend on and not only to vector related
 code contributions.



On Sep 22, 2023, at 7:29 AM, Josh McKenzie <jmcken...@apache.org> wrote:



So if we're going to chat about GenAI on this thread here, 2 things:

A dependency we pull in != a code contribution (I am not a lawyer but my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyright and it's not
 sticky). Easier to transition to a different dep if there's something API compatible or similar.
With code contributions we take in, we take on some exposure in terms of copyright and infringement. git revert can be painful.

For this thread, here's an excerpt from the ASF policy:


a recommended practice when using generative AI tooling is to use tools with features that identify any included content that is similar to parts of the tool’s training data, as well as the license
 of that content.

Given the above, code generated in whole or in part using AI can be contributed if the contributor ensures that:


The terms and conditions of the generative AI tool do not place
 any restrictions on use of the output that would be inconsistent with the Open Source Definition (e.g., ChatGPT’s terms are inconsistent).

At least one of the following conditions
 is met:


The output is not copyrightable subject matter (and would not
 be even if produced by a human)

No third party materials are included in the output

Any third party materials that are included in the output are
 being used with permission (e.g., under a compatible open source license) of the third party copyright holders and in compliance with the applicable license terms


A contributor obtain reasonable
 certainty that conditions 2.2 or 2.3 are met if the AI tool itself provides sufficient information about materials that may have been copied, or from code scanning results


E.g. AWS CodeWhisperer recently added a feature that provides
 notice and attribution



When providing contributions authored using generative AI tooling, a recommended practice is for contributors to indicate the tooling used to create the contribution. This should be included as a token
 in the source control commit message, for example including the phrase “Generated-by



I think the real challenge right now is ensuring that the output from an LLM doesn't include a string of tokens that's identical to something in its input training dataset if it's trained on non-permissively licensed inputs. That
 plus the risk of, at least in the US, the courts landing on the side of saying that not only is the output of generative AI not copyrightable, but that there's legal l

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-23 Thread Benedict

To be clear, I’m not making an argument either way about the path forwards we should take, just concurring about a likely downside of this proposal. I don’t have a strong opinion about how we should proceed.On 23 Oct 2023, at 18:16, Benedict  wrote:I agree. If we go this route we should essentially announce an immediate 5.1 alpha at the same time as 5.0 GA, and I can’t see almost anybody rolling out 5.0 with 5.1 so close on its heels.On 23 Oct 2023, at 18:11, Aleksey Yeshchenko  wrote:I’m not so sure that many folks will choose to go 4.0->5.0->5.1 path instead of just waiting longer for TCM+Accord to be in, and go 4.0->5.1 in one hop.Nobody likes going through these upgrades. So I personally expect 5.0 to be a largely ghost release if we go this route, adopted by few, just a permanent burden on the merge path to trunk.Not to say that there isn’t valuable stuff in 5.0 without TCM and Accord - there most certainly is - but with the expectation that 5.1 will follow up reasonably shortly after with all that *and* two highly anticipated features on top, I just don’t see the point. It will be another 2.2 release.On 23 Oct 2023, at 17:43, Josh McKenzie  wrote:We discussed that at length in various other mailing threads Jeff - kind of settled on "we're committing to cutting a major (semver MAJOR or MINOR) every 12 months but want to remain flexible for exceptions when appropriate".And then we discussed our timeline for 5.0 this year and settled on the "let's try and get it out this calendar year so it's 12 months after 4.1, but we'll grandfather in TCM and Accord past freeze date if they can make it by October".So that's the history for how we landed here.2) Do we drop the support of 3.0 and 3.11 after 5.0.0 is out or after 5.1.0 is?This is my understanding, yes. Deprecation and support drop is predicated on the 5.0 release, not any specific features or anything.On Mon, Oct 23, 2023, at 12:29 PM, Jeff Jirsa wrote:On Mon, Oct 23, 2023 at 4:52 AM Mick Semb Wever <m...@apache.org> wrote:The TCM work (CEP-21) is in its review stage but being well past our cut-off date¹ for merging, and now jeopardising 5.0 GA efforts, I would like to propose the following.I think this presumes that 5.0 GA is date driven instead of feature driven.I'm sure there's a conversation elsewhere, but why isn't this date movable?

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-23 Thread Benedict

I agree. If we go this route we should essentially announce an immediate 5.1 alpha at the same time as 5.0 GA, and I can’t see almost anybody rolling out 5.0 with 5.1 so close on its heels.On 23 Oct 2023, at 18:11, Aleksey Yeshchenko  wrote:I’m not so sure that many folks will choose to go 4.0->5.0->5.1 path instead of just waiting longer for TCM+Accord to be in, and go 4.0->5.1 in one hop.Nobody likes going through these upgrades. So I personally expect 5.0 to be a largely ghost release if we go this route, adopted by few, just a permanent burden on the merge path to trunk.Not to say that there isn’t valuable stuff in 5.0 without TCM and Accord - there most certainly is - but with the expectation that 5.1 will follow up reasonably shortly after with all that *and* two highly anticipated features on top, I just don’t see the point. It will be another 2.2 release.On 23 Oct 2023, at 17:43, Josh McKenzie  wrote:We discussed that at length in various other mailing threads Jeff - kind of settled on "we're committing to cutting a major (semver MAJOR or MINOR) every 12 months but want to remain flexible for exceptions when appropriate".And then we discussed our timeline for 5.0 this year and settled on the "let's try and get it out this calendar year so it's 12 months after 4.1, but we'll grandfather in TCM and Accord past freeze date if they can make it by October".So that's the history for how we landed here.2) Do we drop the support of 3.0 and 3.11 after 5.0.0 is out or after 5.1.0 is?This is my understanding, yes. Deprecation and support drop is predicated on the 5.0 release, not any specific features or anything.On Mon, Oct 23, 2023, at 12:29 PM, Jeff Jirsa wrote:On Mon, Oct 23, 2023 at 4:52 AM Mick Semb Wever  wrote:The TCM work (CEP-21) is in its review stage but being well past our cut-off date¹ for merging, and now jeopardising 5.0 GA efforts, I would like to propose the following.I think this presumes that 5.0 GA is date driven instead of feature driven.I'm sure there's a conversation elsewhere, but why isn't this date movable?

Re: Push TCM (CEP-21) and Accord (CEP-15) to 5.1 (and cut an immediate 5.1-alpha1)

2023-10-23 Thread Benedict

I’m cool with this.

We may have to think about numbering as I think TCM will break some backwards 
compatibility and we might technically expect the follow-up release to be 6.0

Maybe it’s not so bad to have such rapid releases either way.

> On 23 Oct 2023, at 12:52, Mick Semb Wever  wrote:
> 
> 
> 
> The TCM work (CEP-21) is in its review stage but being well past our cut-off 
> date¹ for merging, and now jeopardising 5.0 GA efforts, I would like to 
> propose the following.
> 
> We merge TCM and Accord only to trunk.  Then branch cassandra-5.1 and cut an 
> immediate 5.1-alpha1 release.
> 
> I see this as a win-win scenario for us, considering our current situation.  
> (Though it is unfortunate that Accord is included in this scenario because we 
> agreed it to be based upon TCM.)
> 
> This will mean…
>  - We get to focus on getting 5.0 to beta and GA, which already has a ton of 
> features users want.
>  - We get an alpha release with TCM and Accord into users hands quickly for 
> broader testing and feedback.
>  - We isolate GA efforts on TCM and Accord – giving oss and downstream 
> engineers time and patience reviewing and testing.  TCM will be the biggest 
> patch ever to land in C*.
>  - Give users a choice for a more incremental upgrade approach, given just 
> how many new features we're putting on them in one year.
>  - 5.1 w/ TCM and Accord will maintain its upgrade compatibility with all 4.x 
> versions, just as if it had landed in 5.0.
> 
> 
> The risks/costs this introduces are
>  - If we cannot stabilise TCM and/or Accord on the cassandra-5.1 branch, and 
> at some point decide to undo this work, while we can throw away the 
> cassandra-5.1 branch we would need to do a bit of work reverting the changes 
> in trunk.  This is a _very_ edge case, as confidence levels on the design and 
> implementation of both are already tested and high.
>  - We will have to maintain an additional branch.  I propose that we treat 
> the 5.1 branch in the same maintenance window as 5.0 (like we have with 3.0 
> and 3.11).  This also adds the merge path overhead.
>  - Reviewing of TCM and Accord will continue to happen post-merge.  This is 
> not our normal practice, but this work will have already received its two +1s 
> from committers, and such ongoing review effort is akin to GA stabilisation 
> work on release branches.
> 
> 
> I see no other ok solution in front of us that gets us at least both the 5.0 
> beta and TCM+Accord alpha releases this year.  Keeping in mind users demand 
> to start experimenting with these features, and our Cassandra Summit in 
> December.
> 
> 
> 1) https://lists.apache.org/thread/9c5cnn57c7oqw8wzo3zs0dkrm4f17lm3
> 
>

Re: [DISCUSS] CommitLog default disk access mode

2023-10-16 Thread Benedict

I have some plans to (eventually) use the commit log as memtable payload 
storage (ie memtables would reference the commit log entries directly, storing 
only indexing info), and to back first level of sstables by reference to commit 
log entries. This will permit us to deliver not only much bigger memtables 
(cutting compaction throughput requirements by the ratio of size increase - so 
pretty dramatically), and faster flushing (so better behaviour ling write 
bursts), but also a fairly cheap and simple way to support MVCC - which will be 
helpful for transaction throughput.

There is also a new commit log (“journal”) coming with Accord, that the rest of 
C* may or may not transition to.

I only say this because this makes the utility of direct IO for commit log 
suspect, as we will be reading from the files as a matter of course should this 
go ahead; and we may end up relying on a different commit log implementation 
before long anyway.

This is obviously a big suggestion and is not guaranteed to transpire, and 
probably won’t within the next year or so, but it should perhaps form some 
minimal part of any calculus. If the patch is otherwise simple and beneficial I 
don’t have anything against it, and the use of direct IO could well be of 
benefit eg in compaction - and also in future if we manage to bring a page 
management in process. So laying foundations there could be of benefit, even if 
the commit log eventually does not use it.

> On 16 Oct 2023, at 17:00, Jon Haddad  wrote:
> 
> I haven't looked at the patch, but at a high level, defaulting to direct I/O 
> for commit logs makes a lot of sense to me.  
> 
>> On 2023/10/16 06:34:05 "Pawar, Amit" wrote:
>> [Public]
>> 
>> Hi,
>> 
>> CommitLog uses mmap (memory mapped ) segments by default. Direct-IO feature 
>> is proposed through new PR[1] to improve the CommitLog IO speed. Enabling 
>> this by default could be useful feature to address IO bottleneck seen during 
>> peak load.
>> 
>> Need your input regarding changing this default. Please suggest.
>> 
>> https://issues.apache.org/jira/browse/CASSANDRA-18464
>> 
>> thanks,
>> Amit Pawar
>> 
>> [1] - https://github.com/apache/cassandra/pull/2777
>>

Re: [VOTE] Accept java-driver

2023-10-07 Thread Benedict

+1On 7 Oct 2023, at 10:03, Mick Semb Wever wrote:LEGAL-658On Fri, 6 Oct 2023 at 17:43, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:The software grant agreement covers all donated code. The ASF does not need any historical agreements. The agreement giving the ASF copyright etc is the Software Grant Agreement. Yes, any future work done after donation needs to be covered by ASF CLAs.But happy to see someone ask legal@ to confirm this so we can move forward.On Oct 6, 2023, at 3:33 AM, Benedict <bened...@apache.org> wrote:Are we certain about that? It’s unclear to me from the published guidance; would be nice to get legal to weigh in to confirm to make sure we aren’t skipping any steps, as we haven’t been involved until now so haven’t the visibility. At the very least it reads to me that anyone expected to be maintaining the software going forwards should have a CLA on file with ASF, but I’d have expected the ASF to also want a record of the historic CLAs.On 6 Oct 2023, at 09:28, Mick Semb Wever <m...@apache.org> wrote:On Thu, 5 Oct 2023 at 17:50, Jeremiah Jordan <jeremiah.jor...@gmail.com> wrote:
I think this is covered by the grant agreement?https://www.apache.org/licenses/software-grant-template.pdf2. Licensor represents that, to Licensor's knowledge, Licensor is
legally entitled to grant the above license. Licensor agrees to notify
the Foundation of any facts or circumstances of which Licensor becomes
aware and which makes or would make Licensor's representations in this
License Agreement inaccurate in any respect.

On Oct 5, 2023 at 4:35:08 AM, Benedict <bened...@apache.org> wrote:

Surely it needs to be shared with the foundation and the PMC so we can verify? Or at least have ASF legal confirm they have received and are satisfied with the tarball? It certainly can’t be kept private to DS, AFAICT.Of course it shouldn’t be shared publicly but not sure how PMC can fulfil its verification function here without itCorrect, thanks JD.These are CLAs that were submitted to DS, not to ASF.It is DS's legal responsibility to ensure what they are donating they have the right to (i.e. have the copyright), when submitting the SGA. It's not on the ASF or the PMC to verify this. Here we're simply demonstrating that we (DS) have done that due diligence, and are keeping record of it.

Re: [VOTE] Accept java-driver

2023-10-06 Thread Benedict

Are we certain about that? It’s unclear to me from the published guidance; would be nice to get legal to weigh in to confirm to make sure we aren’t skipping any steps, as we haven’t been involved until now so haven’t the visibility. At the very least it reads to me that anyone expected to be maintaining the software going forwards should have a CLA on file with ASF, but I’d have expected the ASF to also want a record of the historic CLAs.On 6 Oct 2023, at 09:28, Mick Semb Wever  wrote:On Thu, 5 Oct 2023 at 17:50, Jeremiah Jordan <jeremiah.jor...@gmail.com> wrote:
I think this is covered by the grant agreement?https://www.apache.org/licenses/software-grant-template.pdf2. Licensor represents that, to Licensor's knowledge, Licensor is
legally entitled to grant the above license. Licensor agrees to notify
the Foundation of any facts or circumstances of which Licensor becomes
aware and which makes or would make Licensor's representations in this
License Agreement inaccurate in any respect.



On Oct 5, 2023 at 4:35:08 AM, Benedict <bened...@apache.org> wrote:

Surely it needs to be shared with the foundation and the PMC so we can verify? Or at least have ASF legal confirm they have received and are satisfied with the tarball? It certainly can’t be kept private to DS, AFAICT.Of course it shouldn’t be shared publicly but not sure how PMC can fulfil its verification function here without itCorrect, thanks JD.These are CLAs that were submitted to DS, not to ASF.It is DS's legal responsibility to ensure what they are donating they have the right to (i.e. have the copyright), when submitting the SGA.  It's not on the ASF or the PMC to verify this.  Here we're simply demonstrating that we (DS) have done that due diligence, and are keeping record of it.

Re: [VOTE] Accept java-driver

2023-10-05 Thread Benedict

Surely it needs to be shared with the foundation and the PMC so we can verify? Or at least have ASF legal confirm they have received and are satisfied with the tarball? It certainly can’t be kept private to DS, AFAICT.Of course it shouldn’t be shared publicly but not sure how PMC can fulfil its verification function here without it.On 5 Oct 2023, at 10:23, Mick Semb Wever  wrote:   .On Tue, 3 Oct 2023 at 13:25, Josh McKenzie  wrote:I see now this will likely be instead apache/cassandra-java-driverI was wondering about that. apache/java-driver seemed pretty broad. :)From the linked page:Check that all active committers have a signed CLA on record. TODO – attach listI've been part of these discussions and work so am familiar with the status of it (as well as guidance and clearance from the foundation re: folks we couldn't reach) - but might be worthwhile to link to the sheet or perhaps instead provide a summary of the 49 java contributors, their CLA signing status, attempts to reach out, etc for other PMC members that weren't actively involved back when we were working through it.We have a spreadsheet with this information, and the tarball of all the signed CLAs.The tarball we should keep private to DS, but know that we have it for governance's sake.I've attached the spreadsheet to the CEP confluence page.

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-26 Thread Benedict

I agree with Ariel, the more suitable insertion point is probably the JDK level 
FileSystemProvider and FileSystem abstraction.

It might also be that we can reuse existing work here in some cases?

> On 26 Sep 2023, at 17:49, Ariel Weisberg  wrote:
> 
> 
> Hi,
> 
> Support for multiple storage backends including remote storage backends is a 
> pretty high value piece of functionality. I am happy to see there is interest 
> in that.
> 
> I think that `ChannelProxyFactory` as an integration point is going to 
> quickly turn into a dead end as we get into really using multiple storage 
> backends. We need to be able to list files and really the full range of 
> filesystem interactions that Java supports should work with any backend to 
> make development, testing, and using existing code straightforward.
> 
> It's a little more work to get C* to creates paths for alternate backends 
> where appropriate, but that works is probably necessary even with 
> `ChanelProxyFactory` and munging UNIX paths (vs supporting multiple 
> Fileystems). There will probably also be backend specific behaviors that show 
> up above the `ChannelProxy` layer that will depend on the backend.
> 
> Ideally there would be some config to specify several backend filesystems and 
> their individual configuration that can be used, as well as configuration and 
> support for a "backend file router" for file creation (and opening) that can 
> be used to route files to the backend most appropriate.
> 
> Regards,
> Ariel
> 
>> On Mon, Sep 25, 2023, at 2:48 AM, Claude Warren, Jr via dev wrote:
>> I have just filed CEP-36 [1] to allow for keyspace/table storage outside of 
>> the standard storage space.  
>> 
>> There are two desires  driving this change:
>> The ability to temporarily move some keyspaces/tables to storage outside the 
>> normal directory tree to other disk so that compaction can occur in 
>> situations where there is not enough disk space for compaction and the 
>> processing to the moved data can not be suspended.
>> The ability to store infrequently used data on slower cheaper storage layers.
>> I have a working POC implementation [2] though there are some issues still 
>> to be solved and much logging to be reduced.
>> 
>> I look forward to productive discussions,
>> Claude
>> 
>> [1] 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
>> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory 
>> 
>> 
>

Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Benedict

my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyrightI highly doubt liability works like that in all jurisdictions, even if it might in some. I can even think of some historic cases related to Linux where patent trolls went after users of Linux, though I’m not sure where that got to and I don’t remember all the details.But anyway, none of us are lawyers and we shouldn’t be depending on this kind of analysis. At minimum we should invite legal to proffer an opinion on whether dependencies are a valid loophole to the policy.On 22 Sep 2023, at 13:48, J. D. Jordan  wrote:This Gen AI generated code use thread should probably be its own mailing list DISCUSS thread?  It applies to all source code we take in, and accept copyright assignment of, not to jars we depend on and not only to vector related code contributions.On Sep 22, 2023, at 7:29 AM, Josh McKenzie  wrote:So if we're going to chat about GenAI on this thread here, 2 things:A dependency we pull in != a code contribution (I am not a lawyer but my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyright and it's not sticky). Easier to transition to a different dep if there's something API compatible or similar.With code contributions we take in, we take on some exposure in terms of copyright and infringement. git revert can be painful.For this thread, here's an excerpt from the ASF policy:a recommended practice when using generative AI tooling is to use tools with features that identify any included content that is similar to parts of the tool’s training data, as well as the license of that content.Given the above, code generated in whole or in part using AI can be contributed if the contributor ensures that:The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition (e.g., ChatGPT’s terms are inconsistent).At least one of the following conditions is met:The output is not copyrightable subject matter (and would not be even if produced by a human)No third party materials are included in the outputAny third party materials that are included in the output are being used with permission (e.g., under a compatible open source license) of the third party copyright holders and in compliance with the applicable license termsA contributor obtain reasonable certainty that conditions 2.2 or 2.3 are met if the AI tool itself provides sufficient information about materials that may have been copied, or from code scanning resultsE.g. AWS CodeWhisperer recently added a feature that provides notice and attributionWhen providing contributions authored using generative AI tooling, a recommended practice is for contributors to indicate the tooling used to create the contribution. This should be included as a token in the source control commit message, for example including the phrase “Generated-byI think the real challenge right now is ensuring that the output from an LLM doesn't include a string of tokens that's identical to something in its input training dataset if it's trained on non-permissively licensed inputs. That plus the risk of, at least in the US, the courts landing on the side of saying that not only is the output of generative AI not copyrightable, but that there's legal liability on either the users of the tools or the creators of the models for some kind of copyright infringement. That can be sticky; if we take PR's that end up with that liability exposure, we end up in a place where either the foundation could be legally exposed and/or we'd need to revert some pretty invasive code / changes.For example, Microsoft and OpenAI have publicly committed to paying legal fees for people sued for copyright infringement for using their tools: https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view. Pretty interesting, and not a step a provider would take in an environment where things were legally clear and settled.So while the usage of these things is apparently incredibly pervasive right now, "everybody is doing it" is a pretty high risk legal defense. :)On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote:On Thu, 21 Sept 2023 at 10:41, Benedict <bened...@apache.org> wrote:At some point we have to discuss this, and here’s as good a place as any. There’s a great news article published talking about how generative AI was used to assist in developing the new vector search feature, which is itself really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal policy on use for contributions to the project. This proposal is to include a dependency, but I’m not sure if that avoids the issue, and I’m equally uncertain how much this issue is isolated to the dependency (or affects it at all?)Anyway, thi

Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Benedict

My reading is quite different, in fact it is quite explicit that e.g. ChatGPT is forbidden from use, whereas AWS CodeWhisperer may be permitted depending on the attribution.I assume you are reading clause 2.1, but this requires that work "would not be [copyrightable] even if produced by a human” which is clearly not the case for most code.I suspect most generated code is forbidden in practice. Either way, the portions of any contribution produced by the code assistant must be included in a separate commit with the tooling used clearly marked in the commit, including any source attribution. This is likely a challenging task to undertake retrospectively, and we may need advice on how to proceed unless there is an audit trail of some kind that can be followed to ensure this is done accurately - particularly since multiple generative code tools appear to have been used in the production of this work.As I said, an annoying topic.On 22 Sep 2023, at 13:06, Mick Semb Wever  wrote:On Thu, 21 Sept 2023 at 10:41, Benedict <bened...@apache.org> wrote:At some point we have to discuss this, and here’s as good a place as any. There’s a great news article published talking about how generative AI was used to assist in developing the new vector search feature, which is itself really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal policy on use for contributions to the project. This proposal is to include a dependency, but I’m not sure if that avoids the issue, and I’m equally uncertain how much this issue is isolated to the dependency (or affects it at all?)Anyway, this is an annoying discussion we need to have at some point, so raising it here now so we can figure it out.[1] https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/[2] https://www.apache.org/legal/generative-tooling.htmlMy reading of the ASF's GenAI policy is that any generated work in the jvector library (and cep-30 ?) are not copyrightable, and that makes them ok for us to include.If there was a trace to copyrighted work, or the tooling imposed a copyright or restrictions, we would then have to take considerations.

Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-21 Thread Benedict

At some point we have to discuss this, and here’s as good a place as any. There’s a great news article published talking about how generative AI was used to assist in developing the new vector search feature, which is itself really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal policy on use for contributions to the project. This proposal is to include a dependency, but I’m not sure if that avoids the issue, and I’m equally uncertain how much this issue is isolated to the dependency (or affects it at all?)Anyway, this is an annoying discussion we need to have at some point, so raising it here now so we can figure it out.[1] https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/[2] https://www.apache.org/legal/generative-tooling.htmlOn 21 Sep 2023, at 09:04, Mick Semb Wever  wrote:On Wed, 20 Sept 2023 at 18:31, Mike Adamson  wrote:The original patch for CEP-30 brought several modified Lucene classes in-tree to implement the concurrent HNSW graph used by the vector index.These classes are now being replaced with the io.github.jbellis.jvector library, which contains an improved diskANN implementation for the on-disk graph format. The repo for this library is here: https://github.com/jbellis/jvector.The library does not replace any code used by SAI or other parts of the codebase and is used solely by the vector index.I would welcome any feedback on this change. +1but to nit-pick on legalities… it would be nice to avoid including a library copyrighted to DataStax (for historical reasons).The Jamm library is in a similar state in that it has a license that refers to the copyright owner but does not state the copyright owner anywhere.Can we get a copyright on Jamm, and can both not be Datastax (pls) ?

Re: [DISCUSS] Vector type and empty value

2023-09-20 Thread Benedict

ne if empty is null or not.
>>>> 
>>>>> I also think that it would be good to standardize on one approach to 
>>>>> avoid confusion.
>>>> 
>>>> I agree, but also don’t feel it’s a perfect one-size-fits-all thing…. 
>>>> Let’s say I have a “blob” type and I write an empty byte… what does this 
>>>> mean?  What does it mean for "text" type?  The fact I get back a null in 
>>>> both those cases was very confusing to me… I do feel that some types 
>>>> should support empty, and the common code of empty == null I think is very 
>>>> brittle (blob/text was not correct in different places due to this...)… so 
>>>> I am cool with removing that relationship, but don’t think we should have 
>>>> a rule blocking empty for all current / future types as it some times does 
>>>> make sense.
>>>> 
>>>>> empty vector (I presume) for the vector type?
>>>> 
>>>> Empty vectors (vector[0]) are blocked at the type level, the smallest 
>>>> vector is vector[1]
>>>> 
>>>>> as types that can never be null
>>>> 
>>>> One pro here is that “null” is cheaper (in some regards) than delete 
>>>> (though we can never purge), but having 2 similar behaviors (write null, 
>>>> do a delete) at the type level is a bit confusing… Right now I am allowed 
>>>> to do the following (the below isn’t valid CQL, its a hybrid of CQL + Java 
>>>> code…)
>>>> 
>>>> CREATE TABLE fluffykittens (pk int primary key, cuteness int);
>>>> INSERT INTO fluffykittens (pk, cuteness) VALUES (0, new byte[0])
>>>> 
>>>> CREATE TABLE typesarehard (pk1 int, pk2 int, cuteness int, PRIMARY KEY 
>>>> ((pk1, pk2));
>>>> INSERT INTO typesarehard (pk1, pk2, cuteness) VALUES (new byte[0], new 
>>>> byte[0], new byte[0]) — valid as the partition key is not empty as its a 
>>>> composite of 2 empty values, this is the same as new byte[2]
>>>> 
>>>> The first time I ever found out that empty bytes was valid was when a user 
>>>> was trying to abuse this in collections (also the fact collections support 
>>>> null in some cases and not others is fun…)…. It was blowing up in random 
>>>> places… good times!
>>>> 
>>>> I am personally not in favor of allowing empty bytes (other than for blob 
>>>> / text as that is actually valid for the domain), but having similar types 
>>>> having different semantics I feel is more problematic...
>>>> 
>>>>> On Sep 19, 2023, at 8:56 AM, Josh McKenzie  wrote:
>>>>> 
>>>>>> I am strongly in favour of permitting the table definition forbidding 
>>>>>> nulls - and perhaps even defaulting to this behaviour. But I don’t think 
>>>>>> we should have types that are inherently incapable of being null.
>>>>> I'm with Benedict. Seems like this could help prevent whatever "nulls in 
>>>>> primary key columns" problems Aleksey was alluding to on those tickets 
>>>>> back in the day that pushed us towards making the new types non-emptiable 
>>>>> as well (i.e. primary keys are non-null in table definition).
>>>>> 
>>>>> Furthering Alex' question, having a default value for unset fields in any 
>>>>> non-collection context seems... quite surprising to me in a database. I 
>>>>> could see the argument for making container / collection types 
>>>>> non-nullable, maybe, but that just keeps us in a potential straddle case 
>>>>> (some types nullable, some not).
>>>>> 
>>>>> On Tue, Sep 19, 2023, at 8:22 AM, Benedict wrote:
>>>>>> 
>>>>>> If I understand this suggestion correctly it is a whole can of worms, as 
>>>>>> types that can never be null prevent us ever supporting outer joins that 
>>>>>> return these types.
>>>>>> 
>>>>>> I am strongly in favour of permitting the table definition forbidding 
>>>>>> nulls - and perhaps even defaulting to this behaviour. But I don’t think 
>>>>>> we should have types that are inherently incapable of being null. I also 
>>>>>> certainly don’t think we should have bifurcated our behaviour between 
>>>>>> types like this.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 19 Sep 2023, at 11:54, Alex Petrov  wrote:
>>>>>>> 
>>>>>>> To make sure I understand this right; does that mean there will be a 
>>>>>>> default value for unset fields? Like 0 for numerical values, and an 
>>>>>>> empty vector (I presume) for the vector type?
>>>>>>> 
>>>>>>> On Fri, Sep 15, 2023, at 11:46 AM, Benjamin Lerer wrote:
>>>>>>>> Hi everybody,
>>>>>>>> 
>>>>>>>> I noticed that the new Vector type accepts empty ByteBuffer values as 
>>>>>>>> an input representing null.
>>>>>>>> When we introduced TINYINT and SMALLINT (CASSANDRA-895) we started 
>>>>>>>> making types non -emptiable. This approach makes more sense to me as 
>>>>>>>> having to deal with empty value is error prone in my opinion.
>>>>>>>> I also think that it would be good to standardize on one approach to 
>>>>>>>> avoid confusion.
>>>>>>>> 
>>>>>>>> Should we make the Vector type non-emptiable and stick to it for the 
>>>>>>>> new types?
>>>>>>>> 
>>>>>>>> I like to hear your opinion.
>>>> 
>>>> 
>> 
>

Re: [DISCUSS] Vector type and empty value

2023-09-19 Thread Benedict

If I understand this suggestion correctly it is a whole can of worms, as types 
that can never be null prevent us ever supporting outer joins that return these 
types.

I am strongly in favour of permitting the table definition forbidding nulls - 
and perhaps even defaulting to this behaviour. But I don’t think we should have 
types that are inherently incapable of being null. I also certainly don’t think 
we should have bifurcated our behaviour between types like this.

> On 19 Sep 2023, at 11:54, Alex Petrov  wrote:
> 
> 
> To make sure I understand this right; does that mean there will be a default 
> value for unset fields? Like 0 for numerical values, and an empty vector (I 
> presume) for the vector type?
> 
>> On Fri, Sep 15, 2023, at 11:46 AM, Benjamin Lerer wrote:
>> Hi everybody,
>> 
>> I noticed that the new Vector type accepts empty ByteBuffer values as an 
>> input representing null.
>> When we introduced TINYINT and SMALLINT (CASSANDRA-895) we started making 
>> types non -emptiable. This approach makes more sense to me as having to deal 
>> with empty value is error prone in my opinion.
>> I also think that it would be good to standardize on one approach to avoid 
>> confusion.
>> 
>> Should we make the Vector type non-emptiable and stick to it for the new 
>> types?
>> 
>> I like to hear your opinion.
>

Re: [DISCUSS] Addition of smile-nlp test dependency for CEP-30

2023-09-13 Thread Benedict

There’s a distinction for spotbugs and other build related tools where they can be downloaded and used during the build so long as they’re not critical to the build process.They have to be downloaded dynamically in binary form I believe though, they cannot be included in the release.So it’s not really in conflict with what Jeff is saying, and my recollection accords with Jeff’sOn 13 Sep 2023, at 17:42, Brandon Williams  wrote:On Wed, Sep 13, 2023 at 11:37 AM Jeff Jirsa  wrote:You can open a legal JIRA to confirm, but based on my understanding (and re-confirming reading https://www.apache.org/legal/resolved.html#category-a ): We should probably get clarification here regardless, iirc this came up when we were considering SpotBugs too.

Re: [VOTE] Release Apache Cassandra 5.0-alpha1

2023-08-30 Thread Benedict

Yes, my understanding is that the number is not itself copyrightable. We simply 
attribute in the source code as a courtesy and for future readers.

If you are concerned we can loop in legal.

> On 30 Aug 2023, at 12:29, Mick Semb Wever  wrote:
> 
> 
>> 
>>> - It looks like there might be compiled code in the release? [1][2]
> 
> 
> Non issue. Test resources.
> 
> 
>>> - LICENSE is missing some 3rd party code license information [5] This 
>>> contains code "Copyright DataStax, Inc." under ALv2, python-smhasher under 
>>> MIT, OrderedDict under MIT (copyright Raymond Hettinger) and code from 
>>> MagnetoDB under ALv2.
> 
> 
> CASSANDRA-18807
> 
> 
>>> - LICENSE has no mention of 3rd party CRC code in [10]
>>> - Note that any code under CC 4.0 is incompatible with the ALv2. [11]
> 
> 
> This comes down to using an int number from Philip Koopman's CRC work.
> `private static final int CRC24_POLY = 0x1974F0B;`
> 
> It was questioned whether a number can be copyrighted, in which case
> we would not be including third-party work here. The code comment
> explains this too.
> 
> Benedict?
> 
> 
>>> - LICENSE also doesn't mention this file [9]
> 
> 
> CASSANDRA-18807
> 
> 
>>> - In LICENSE LongTimSort.java incorrectly mentions two different copyright 
>>> owners
>>> - In LICENSE, AbstractGuavaIterator.java is incorrectly mentioned as 
>>> AbstractIterator.java
> 
> 
> CASSANDRA-18807
> 
> 
>>> - NOTICE seems OK but may also be missing some things due to missing 3rd 
>>> party code in LICENSE under ALv2
> 
> 
> No additions required.
> 
> 
>>> - Files are missing ASF headers [3][4][6][7][8] are these 3rd party files?
> 
> 
> Non issue. Doc files, or third-party files.
> Dockerfiles fixed in CASSANDRA-18807
> 
> 
> 
>> 1../test/data/serialization/3.0/utils.BloomFilter1000.bin
>> 2. ./test/data/serialization/4.0/utils.BloomFilter1000.bin
>> 3. ./doc/modules/cassandra/examples/BASH/*.sh
>> 4. ./pylib/Dockerfile.ubuntu.*
>> 5. ./lib/cassandra-driver-internal-only-3.25.0.zip
>> 6. ./lib/cassandra-driver-3.25.0/cassandra/murmur3.py
>> 7. ./lib/cassandra-driver-3.25.0/cassandra/io/asyncioreactor.py
>> 8 ./lib/cassandra-driver-3.25.0/cassandra/io/libevwrapper.c
>> 9. ./tools/fqltool/src/org/apache/cassandra/fqltool/commands/Dump.java
>> 10. ./src/java/org/apache/cassandra/net/Crc.java
>> 11. https://www.apache.org/legal/resolved.html#cc-by

Re: Tokenization and SAI query syntax

2023-08-07 Thread Benedict

Yep, this sounds like the potentially least bad approach for now. Sorry Caleb, I jumped in without properly reading the thread and assumed we were proposing changes to CQL.If it’s clear we’re dropping into a sub-language and providing a sub-query to it that’s SAI-specific, that gives us pretty broad leeway IMO.On 7 Aug 2023, at 22:27, Josh McKenzie  wrote:Been chatting a bit w/Caleb about this offline and poking around to better educate myself.using functions (ignoring the implementation complexity) at least removes ambiguity. This, plus using functions lets us kick the can down the road a bit in terms of landing on an integrated grammar we agree on. It seems to me there's a tension between:"SQL-like" (i.e. postgres-like)"Indexing and Search domain-specific-like" (i.e. lucene syntax which, as Benedict points out, doesn't really jell w/what we have in CQL at this point), and??? Some other YOLO CQL / C* specific thing where we go our own roadI don't think we're really going to know what our feature-set in terms of indexing is going to look like or the shape it's going to take for awhile, so backing ourselves into any of the 3 corners above right now feels very premature to me.So I'm coming around to the expr / method call approach to preserve that flexibility. It's maximally explicit and preserves optionality at the expense of being clunky. For now.On Mon, Aug 7, 2023, at 4:00 PM, Caleb Rackliffe wrote:> I do not think we should start using lucene syntax for it, it will make people think they can do everything else lucene allows.I'm sure we won't be supporting everything Lucene allows, but this is going to evolve. Right off the bat, if you introduce support for tokenization and filtering, someone is, for example, going to ask for phrase queries. ("John Smith landed in Virginia" is tokenized, but someone wants to match exactly on "John Smith".) The whole point of the Vector project is to do relevance, right? Are we going to do term boosting? Do we need queries like "field: quick brown +fox -news" where fox must be present, news cannot be present, and quick and brown increase relevance?SASI uses "=" and "LIKE" in a way that assumes the user understands the tokenization scheme in use on the target field. I understand that's a bit ambiguous.If we object to allowing expr embedding of a subset of the Lucene syntax, I can't imagine we're okay w/ then jamming a subset of that syntax into the main CQL grammar.If we want to do this in non-expr CQL space, I think using functions (ignoring the implementation complexity) at least removes ambiguity. "token_match", "phrase_match", "token_like", "=", and "LIKE" would all be pretty clear, although there may be other problems. For instance, what happens when I try to use "token_match" on an indexed field whose analyzer does not tokenize? We obviously can't use the index, so we'd be reduced to requiring a filtering query, but maybe that's fine. My point is that, if we're going to make write and read analyzers symmetrical, there's really no way to make the semantics of our queries totally independent of analysis. (ex. "field : foo bar" behaves differently w/ read tokenization than it does without. It could even be an OR or AND query w/ tokenization, depending on our defaults.)On Mon, Aug 7, 2023 at 12:55 PM Atri Sharma <a...@apache.org> wrote:Why not start with SQLish operators supported by many databases (LIKE and CONTAINS)?On Mon, Aug 7, 2023 at 10:01 PM J. D. Jordan <jeremiah.jor...@gmail.com> wrote:I am also -1 on directly exposing lucene like syntax here. Besides being ugly, SAI is not lucene, I do not think we should start using lucene syntax for it, it will make people think they can do everything else lucene allows.On Aug 7, 2023, at 5:13 AM, Benedict <bened...@apache.org> wrote:I’m strongly opposed to : It is very dissimilar to our current operators. CQL is already not the prettiest language, but let’s not make it a total mish mash.On 7 Aug 2023, at 10:59, Mike Adamson <madam...@datastax.com> wrote:I am also in agreement with 'column : token' in that 'I don't hate it' but I'd like to offer an alternative to this in 'column HAS token'. HAS is currently not a keyword that we use so wouldn't cause any brain conflicts.While I don't hate ':' I have a particular dislike of the lucene search syntax because of its terseness and lack of easy readability. Saying that, I'm happy to do with ':' if that is the decision. On Fri, 4 Aug 2023 at 00:23, Jon Haddad <rustyrazorbl...@apache.org> wrote:Assuming SAI is a superset of SASI, and we were to set up something so that SASI indexes auto convert to SAI, this gives even more weight to my point regarding how differing behavior for the same syntax can lead to issues.  Imo the best case scenario results in the user not even noticing their indexes have change

Re: Tokenization and SAI query syntax

2023-08-07 Thread Benedict

om/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$
> > >>>
> > >>> One idea is to take the basic Lucene index, which it seems we already
> > have
> > >>> some support for, and feed it to "expr". This is nice for two reasons:
> > >>>
> > >>> 1.) People can just write Lucene queries if they already know how.
> > >>> 2.) No changes to the grammar.
> > >>>
> > >>> Lucene has distinct concepts of filtering and querying, and this is
> > kind of
> > >>> the latter. I'm not sure how, for example, we would want "expr" to
> > interact
> > >>> w/ filters on other column indexes in vanilla CQL space...
> > >>>
> > >>>
> > >>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie <jmcken...@apache.org>
> > wrote:
> > >>>>
> > >>>> `column CONTAINS term`. Contains is used by both Java and Python for
> > >>>> substring searches, so at least some users will be surprised by
> > term-based
> > >>>> behavior.
> > >>>>
> > >>>> I wonder whether users are in their "programming language" headspace
> > or in
> > >>>> their "querying a database" headspace when interacting with CQL? i.e.
> > this
> > >>>> would only present confusion if we expected users to be thinking in
> > the
> > >>>> idioms of their respective programming languages. If they're thinking
> > in
> > >>>> terms of SQL, MATCHES would probably end up confusing them a bit
> > since it
> > >>>> doesn't match the general structure of the MATCH operator.
> > >>>>
> > >>>> That said, I also think CONTAINS loses something important that you
> > allude
> > >>>> to here Jonathan:
> > >>>>
> > >>>> with corresponding query-time tokenization and analysis.  This means
> > that
> > >>>> the query term is not always a substring of the original string!
> > Besides
> > >>>> obvious transformations like lowercasing, you have things like
> > >>>> PhoneticFilter available as well.
> > >>>>
> > >>>> So to me, neither MATCHES nor CONTAINS are particularly great
> > candidates.
> > >>>>
> > >>>> So +1 to the "I don't actually hate it" sentiment on:
> > >>>>
> > >>>> column : term`. Inspired by Lucene’s syntax
> > >>>>
> > >>>>
> > >>>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote:
> > >>>>
> > >>>>
> > >>>> I have a strong preference not to use the name of an SQL operator,
> > since
> > >>>> it precludes us later providing the SQL standard operator to users.
> > >>>>
> > >>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term?
> > >>>>
> > >>>>
> > >>>>> On 24 Jul 2023, at 13:34, Andrés de la Peña <adelap...@apache.org>
> > wrote:
> > >>>>
> > >>>> 
> > >>>> `column = term` is definitively problematic because it creates an
> > >>>> ambiguity when the queried column belongs to the primary key. For some
> > >>>> queries we wouldn't know whether the user wants a primary key query
> > using
> > >>>> regular equality or an index query using the analyzer.
> > >>>>
> > >>>> `term_matches(column, term)` seems quite clear and hard to
> > misinterpret,
> > >>>> but it's quite long to write and its implementation will be
> > challenging
> > >>>> since we would need a bunch of special casing around SelectStatement
> > and
> > >>>> functions.
> > >>>>
> > >>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they seem
> > to
> > >>>> evoke different behaviours to what they would have.
> > >>>>
> > >>>> `column LIKE :term:` seems a bit redundant compared to just using
> > `column
> > >>>> : term`, and we are still introducing a new symbol.
> > >>>>

Re: Tokenization and SAI query syntax

2023-07-24 Thread Benedict

I have a strong preference not to use the name of an SQL operator, since it precludes us later providing the SQL standard operator to users.What about CONTAINS TOKEN term? Or CONTAINS TERM term?On 24 Jul 2023, at 13:34, Andrés de la Peña  wrote:`column = term` is definitively problematic because it creates an ambiguity when the queried column belongs to the primary key. For some queries we wouldn't know whether the user wants a primary key query using regular equality or an index query using the analyzer.`term_matches(column, term)` seems quite clear and hard to misinterpret, but it's quite long to write and its implementation will be challenging since we would need a bunch of special casing around SelectStatement and functions.LIKE, MATCHES and CONTAINS could be a bit misleading since they seem to evoke different behaviours to what they would have.`column LIKE :term:` seems a bit redundant compared to just using `column : term`, and we are still introducing a new symbol.I think I like `column : term` the most, because it's brief, it's similar to the equivalent Lucene's syntax, and it doesn't seem to clash with other different meanings that I can think of.On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis  wrote:Hi all,With phase 1 of SAI wrapping up, I’d like to start the ball rolling on aligning around phase 2 features.In particular, we need to nail down the syntax for doing non-exact string matches.  We have a proof of concept that includes full Lucene analyzer and filter functionality – just the text transformation pieces, none of the storage parts – which is the gold standard in this space.  For example, the StandardAnalyzer [1] lowercases all terms and removes stopwords (common words like “a”, “is”, “the” that are usually not useful to search against).  Lucene also has classes that offer stemming, special case handling for email, and many languages besides English [2].What syntax should we use to express “rows whose analyzed tokens match this search term?”The syntax must be clear that we want to look for this term within the column data using the configured index with corresponding query-time tokenization and analysis.  This means that the query term is not always a substring of the original string!  Besides obvious transformations like lowercasing, you have things like PhoneticFilter available as well.Here are my thoughts on some of the options:`column = term`.  This is what the POC does today and it’s super confusing to overload = to mean something other than exact equality.  I am not a fan.`column LIKE term` or `column LIKE %term%`. The closest SQL operator, but neither the wildcarded nor unwildcarded syntax matches the semantics of term-based search.`column MATCHES term`. I rather like this one, although Mike points out that “match” has a meaning in the context of regular expressions that could cause confusion here.`column CONTAINS term`. Contains is used by both Java and Python for substring searches, so at least some users will be surprised by term-based behavior.`term_matches(column, term)`. Postgresql FTS makes you use functions like this for everything.  It’s pretty clunky, and we would need to make the amazingly hairy SelectStatement even hairier to handle “use a function result in a predicate” like this.`column : term`. Inspired by Lucene’s syntax.  I don’t actually hate it.`column LIKE :term:`. Stick with the LIKE operator but add a new symbol to indicate term matching.  Arguably more SQL-ish than a new bare symbol operator.[1] https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html[2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html-- Jonathan Ellisco-founder, http://www.datastax.com@spyced

Re: Bloom filter calculation

2023-07-11 Thread Benedict

I’m not sure I follow your reasoning. The bloom filter table is false positive 
per sstable given the number of bits *per key*. So for 10 keys you would have 
200 bits, which yields the same false positive rate as 20 bits and 1 key.

It does taper slightly at much larger N, but it’s pretty nominal for practical 
purposes.

I don’t understand what you mean by merging multiple filters together. We do 
lookup multiple bloom filters per query, but only one per sstable, and the 
false positive rate you’re calculating for 10 such lookups would not be 
accurate. This would be 1-(1-0.671)^10 which is still only around a 4%, not 
100%. You seem to be looking at the false positive rate of a bloom filter of 20 
bits with 10 entries, which means only 2 bits per entry?

> On 11 Jul 2023, at 07:14, Claude Warren, Jr via dev 
>  wrote:
> 
> 
> Can someone explain to me how the Bloom filter table in 
> BloomFilterCalculations was derived and how it is supposed to work?  As I 
> read the table it seems to indicate that with 14 hashes and 20 bits you get a 
> fp of 6.71e-05.  But if you plug those numbers into the Bloom filter 
> calculator [1],  that is calculated only for 1 item being in the filter.  If 
> you merge multiple filters together the false positive rate goes up.  And as 
> [1] shows by 5 merges you are over 50% fp rate and by 10 you are at close to 
> 100% fp.  So I have to assume this analysis is wrong.  Can someone point me 
> to the correct calculations?
> 
> Claude
> 
> [1] https://hur.st/bloomfilter/?n==6.71e-05=20=14

Re: Improved DeletionTime serialization to reduce disk size

2023-07-03 Thread Benedict

I checked and I’m pretty sure we do, but it doesn’t apply any liveness optimisation. I had misunderstood the optimisation you proposed. Ideally we would encode any non-live timestamp with the delta offset, but since that’s a distinct optimisation perhaps that can be left to another patch.Are we happy, though, that the two different deletion time serialisers can store different ranges of timestamp? Both are large ranges, but I am not 100% comfortable with them diverging.On 3 Jul 2023, at 05:45, Berenguer Blasi wrote:

It can look into it. I don't have a deep knowledge of the sstable
format hence why I wanted to document it someday. But DeletionTime
is being serialized in other places as well iirc and I doubt
(finger in the air) we'll have that Epoch handy.

On 29/6/23 17:22, Benedict wrote:

So I’m just taking a quick peek at
SerializationHeader and we already have a method for reading
and writing a deletion time with offsets from EncodingStats.

So perhaps we simply have a bug where we are
using DeletionTime Serializer instead of
SerializationHeader.writeLocalDeletionTime? It looks to me
like this is already available at most (perhaps all) of the
relevant call sites.

On 29 Jun 2023, at 15:53, Josh McKenzie
wrote:

I would prefer we not plan on two distinct changes to
this

I agree with this sentiment, and

+1, if you have time for this approach and no other in
this window.

People are going to use 5.0 for awhile. Better to have an
improvement in their hands for that duration than no
improvement at all IMO. Justifies the cost of the double
implementation and transitions to me.

On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:

Just for completeness the change is a handful loc.
The rest is added tests and we'd loose the sstable
format change opportunity window.

+1, if you have time for this approach
and no other in this window.

(If you have time for the other, or
someone else does, then the technically superior
approach should win)

Re: Improved DeletionTime serialization to reduce disk size

2023-06-29 Thread Benedict

So I’m just taking a quick peek at SerializationHeader and we already have a 
method for reading and writing a deletion time with offsets from EncodingStats.

So perhaps we simply have a bug where we are using DeletionTime Serializer 
instead of SerializationHeader.writeLocalDeletionTime? It looks to me like this 
is already available at most (perhaps all) of the relevant call sites.

> On 29 Jun 2023, at 15:53, Josh McKenzie  wrote:
> 
> 
>> 
>> I would prefer we not plan on two distinct changes to this
> I agree with this sentiment, and
> 
>> +1, if you have time for this approach and no other in this window.
> People are going to use 5.0 for awhile. Better to have an improvement in 
> their hands for that duration than no improvement at all IMO. Justifies the 
> cost of the double implementation and transitions to me.
> 
>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>> Just for completeness the change is a handful loc. The rest is added tests 
>> and we'd loose the sstable format change opportunity window.
>> 
>> 
>> 
>> +1, if you have time for this approach and no other in this window.
>> 
>> (If you have time for the other, or someone else does, then the technically 
>> superior approach should win)
>> 
>> 
>

Re: Improved DeletionTime serialization to reduce disk size

2023-06-26 Thread Benedict

I would prefer we not plan on two distinct changes to this, particularly when neither change is particularly more complex than the other. There is a modest cost to maintenance from changing this multiple times. But if others feel strongly otherwise I won’t stand in the way.On 26 Jun 2023, at 05:49, Berenguer Blasi  wrote:

Thanks for the replies.
I intend to javadoc the ssatble format in detail someday and more
  improvements might come up then, along the vint encoding mentioned
  here. But unless sbdy volunteers to do that in 5.0, is anybody
  against I try to get the original proposal (1 byte flags for
  sentinel values) in?
Regards

Distant future people will not be happy about this, I can
  already tell you now.

  Eh, they'll all be AI's anyway and will just rewrite the code
in a background thread.

LOL

On 23/6/23 15:44, Josh McKenzie wrote:

If we’re doing this, why don’t we delta encode a vint from
  some per-sstable minimum value? I’d expect that to commonly
  compress to a single byte or so.

  +1 to this approach.

Distant future people will not be happy about this, I can
  already tell you now.

  Eh, they'll all be AI's anyway and will just rewrite the code
in a background thread.

  On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:

It's a possibility. Though I haven't coded and benchmarked
  such an 

approach and I don't think I would have the time before the
  freeze to 

take advantage of the sstable format change opportunity.

Still it's sthg that can be explored later. If we can shave
  a few extra 

% then that would always be great imo.

On 23/6/23 13:57, Benedict wrote:

> If we’re doing this, why don’t we delta encode a vint
  from some per-sstable minimum value? I’d expect that to
  commonly compress to a single byte or so.

>

>> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <alek...@apple.com>
  wrote:

>>

>> Distant future people will not be happy about
  this, I can already tell you now.

>>

>> Sounds like a reasonable improvement to me
  however.

>>

>>> On 23 Jun 2023, at 07:22, Berenguer Blasi <berenguerbl...@gmail.com>
  wrote:

>>>

>>> Hi all,

>>>

>>> DeletionTime.markedForDeleteAt is a long
  useconds since Unix Epoch. But I noticed that with 7 bytes we
  can already encode ~2284 years. We can either shed the 8th
  byte, for reduced IO and disk, or can encode some sentinel
  values (such as LIVE) as flags there. That would mean reading
  and writing 1 byte instead of 12 (8 mfda long + 4 ldts int).
  Yes we already avoid serializing DeletionTime (DT) in sstables
  at _row_ level entirely but not at _partition_ level and it is
  also serialized at index, metadata, etc.

>>>

>>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk
  and some jmh (1) to evaluate the impact of the new alg (2).
  It's tested here against a 70% and a 30% LIVE DTs  to see how
  we perform:

>>>

>>>  [java] Benchmark (liveDTPcParam) 
  (sstableParam)  Mode  Cnt  Score   Error  Units

>>>  [java]
  DeletionTimeDeSerBench.testRawAlgReads 70PcLive 
  NC  avgt   15  0.331 ± 0.001  ns/op

>>>  [java]
  DeletionTimeDeSerBench.testRawAlgReads 70PcLive 
  OA  avgt   15  0.335 ± 0.004  ns/op

>>>  [java]
  DeletionTimeDeSerBench.testRawAlgReads 30PcLive 
  NC  avgt   15  0.334 ± 0.002  ns/op

>>>  [java]
  DeletionTimeDeSerBench.testRawAlgReads 30PcLive 
  OA  avgt   15  0.340 ± 0.008  ns/op

>>>  [java]
  DeletionTimeDeSerBench.testNewAlgWrites 70PcLive 
  NC  avgt   15  0.337 ± 0.006  ns/op

>>>  [java]
  DeletionTimeDeSerBench.testNewAlgWrites 70PcLive 
  OA  avgt   15  0.340 ± 0.004

Re: Improved DeletionTime serialization to reduce disk size

2023-06-23 Thread Benedict

If we’re doing this, why don’t we delta encode a vint from some per-sstable 
minimum value? I’d expect that to commonly compress to a single byte or so.

> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko  wrote:
> 
> Distant future people will not be happy about this, I can already tell you 
> now.
> 
> Sounds like a reasonable improvement to me however.
> 
>> On 23 Jun 2023, at 07:22, Berenguer Blasi  wrote:
>> 
>> Hi all,
>> 
>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I 
>> noticed that with 7 bytes we can already encode ~2284 years. We can either 
>> shed the 8th byte, for reduced IO and disk, or can encode some sentinel 
>> values (such as LIVE) as flags there. That would mean reading and writing 1 
>> byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid 
>> serializing DeletionTime (DT) in sstables at _row_ level entirely but not at 
>> _partition_ level and it is also serialized at index, metadata, etc.
>> 
>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk 
>> and some jmh (1) to evaluate the impact of the new alg (2). It's tested here 
>> against a 70% and a 30% LIVE DTs  to see how we perform:
>> 
>> [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   
>> Error  Units
>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  NC  
>> avgt   15  0.331 ± 0.001  ns/op
>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  OA  
>> avgt   15  0.335 ± 0.004  ns/op
>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  NC  
>> avgt   15  0.334 ± 0.002  ns/op
>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  OA  
>> avgt   15  0.340 ± 0.008  ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  NC  
>> avgt   15  0.337 ± 0.006  ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  OA  
>> avgt   15  0.340 ± 0.004  ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  NC  
>> avgt   15  0.339 ± 0.004  ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  OA  
>> avgt   15  0.343 ± 0.016  ns/op
>> 
>> That was ByteBuffer backed to test the extra bit level operations impact. 
>> But what would be the impact of an end to end test against disk?
>> 
>> [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  
>> Cnt ScoreError  Units
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive  
>> NC  avgt   15   605236.515 ± 19929.058  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive  
>> OA  avgt   15   586477.039 ± 7384.632  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive  
>> NC  avgt   15   937580.311 ± 30669.647  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive  
>> OA  avgt   15   914097.770 ± 9865.070  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk 
>> 70PcLive  NC  avgt   15  1314417.207 ± 37879.012  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT  Disk 
>> 70PcLive  OA  avgt   15 805256.345 ±  15471.587  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 
>> 30PcLive  NC  avgt   15 1583239.011 ±  50104.245  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk 
>> 30PcLive  OA  avgt   15 1439605.006 ±  64342.510  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 
>> 70PcLive  NC  avgt   15 295711.217 ±   5432.507  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 
>> 70PcLive  OA  avgt   15 305282.827 ±   1906.841  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 
>> 30PcLive  NC  avgt   15   446029.899 ±   4038.938  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 30PcLive 
>>  OA  avgt   15   479085.875 ± 10032.804  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 
>> 70PcLive  NC  avgt   15  1789434.838 ± 206455.771  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT   Disk 
>> 70PcLive  OA  avgt   15 589752.861 ±  31676.265  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDTDisk 
>> 30PcLive  NC  avgt   15 1754862.122 ± 164903.051  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 
>> 30PcLive  OA  avgt   15  1252162.253 ± 121626.818  ns/o
>> 
>> We can see big improvements when backed with the disk and little impact from 
>> the new alg.
>> 
>> Given we're already introducing a new sstable format (OA) in 5.0 I would

Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

2023-06-12 Thread Benedict

I agree that this is more suitable as a paging option, and not as a CQL LIMIT option. If it were to be a CQL LIMIT option though, then it should be accurate regarding result set IMO; there shouldn’t be any further results that could have been returned within the LIMIT.On 12 Jun 2023, at 10:16, Benjamin Lerer  wrote:Thanks Jacek for raising that discussion.I do not have in mind a scenario where it could be useful to specify a LIMIT in bytes. The LIMIT clause is usually used when you know how many rows you wish to display or use. Unless somebody has a useful scenario in mind I do not think that there is a need for that feature.Paging in bytes makes sense to me as the paging mechanism is transparent for the user in most drivers. It is simply a way to optimize your memory usage from end to end.I do not like the approach of using both of them simultaneously because if you request a page with a certain amount of rows and do not get it then is is really confusing and can be a problem for some usecases. We have users keeping their session open and the page information to display page of data.Le lun. 12 juin 2023 à 09:08, Jacek Lewandowski  a écrit :Hi,I was working on limiting query results by their size expressed in bytes, and some questions arose that I'd like to bring to the mailing list.The semantics of queries (without aggregation) - data limits are applied on the raw data returned from replicas - while it works fine for the row number limits as the number of rows is not likely to change after post-processing, it is not that accurate for size based limits as the cell sizes may be different after post-processing (for example due to applying some transformation function, projection, or whatever). We can truncate the results after post-processing to stay within the user-provided limit in bytes, but if the result is smaller than the limit - we will not fetch more. In that case, the meaning of "limit" being an actual limit is valid though it would be misleading for the page size because we will not fetch the maximum amount of data that does not exceed the page size.Such a problem is much more visible for "group by" queries with aggregation. The paging and limiting mechanism is applied to the rows rather than groups, as it has no information about how much memory a single group uses. For now, I've approximated a group size as the size of the largest participating row. The problem concerns the allowed interpretation of the size limit expressed in bytes. Whether we want to use this mechanism to let the users precisely control the size of the resultset, or we instead want to use this mechanism to limit the amount of memory used internally for the data and prevent problems (assuming restricting size and rows number can be used simultaneously in a way that we stop when we reach any of the specified limits).https://issues.apache.org/jira/browse/CASSANDRA-11745thanks,- - -- --- -  -Jacek Lewandowski

Re: Agrona vs fastutil and fastutil-concurrent-wrapper

Nope, my awareness of Agrona predates Branimir’s proposal, as does others. Aleksey intended to propose its inclusion beforehand also.If all we’re getting is lock striping, do we really need a separate library?On 25 May 2023, at 19:33, Jonathan Ellis  wrote:Let's not fall prey to status quo bias, nobody performed an exhaustive analysis of agrona in November.  If Branimir had proposed fastutils at the time that's what we'd be using today.On Thu, May 25, 2023 at 10:50 AM Benedict <bened...@apache.org> wrote:Given they provide no data or explanation, and that benchmarking is hard, I’m not inclined to give much weight to their analysis.Agrona was favoured in large part due to the perceived quality of the library. I’m not inclined to swap it out without proper evidence the fastutils is both materially faster in a manner care about and of similar quality.On 25 May 2023, at 16:43, Jonathan Ellis <jbel...@gmail.com> wrote:Try it out and see, the only data point I have is that the company who has spent more effort here than anyone else I could find likes fastutil better.On Thu, May 25, 2023 at 10:33 AM Dinesh Joshi <djo...@apache.org> wrote:> On May 25, 2023, at 6:14 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
> 
> Any objections to adding the concurrent wrapper and switching out agrona for fastutil?

How does fastutil compare to agrona in terms of memory profile and runtime performance? How invasive would it be to switch?-- Jonathan Ellisco-founder, http://www.datastax.com@spyced
-- Jonathan Ellisco-founder, http://www.datastax.com@spyced

Re: Agrona vs fastutil and fastutil-concurrent-wrapper

I’m far less inclined to take that approach to fundamental libraries, where quality is far more important than presentation.On 25 May 2023, at 17:29, David Capwell  wrote:Agrona isn’t going anywhere due to the library being more than basic collections.Now, with regard to single-threaded collections… honestly I dislike Agrona as I always fight to avoid boxing; carrot was far better with this regard…. Didn’t look at the fastutil versions to see if they are better here, but I do know I am personally not happy with Agrona primitive collections…I do believe the main motivator for this is that fastutil has a concurrent version of their collections, so you gain access to concurrent primitive collections; something we do not have today… Given the desire for concurrent primitive collections, I am cool with it.I’m not inclined to swap it outWhen it came to random testing libraries, I believe the stance taken before was that we should allow multiple versions and the best one will win eventually… so I am cool having the same stance for primitive collections...On May 25, 2023, at 8:50 AM, Benedict  wrote:Given they provide no data or explanation, and that benchmarking is hard, I’m not inclined to give much weight to their analysis.Agrona was favoured in large part due to the perceived quality of the library. I’m not inclined to swap it out without proper evidence the fastutils is both materially faster in a manner care about and of similar quality.On 25 May 2023, at 16:43, Jonathan Ellis  wrote:Try it out and see, the only data point I have is that the company who has spent more effort here than anyone else I could find likes fastutil better.On Thu, May 25, 2023 at 10:33 AM Dinesh Joshi <djo...@apache.org> wrote:> On May 25, 2023, at 6:14 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
> 
> Any objections to adding the concurrent wrapper and switching out agrona for fastutil?

How does fastutil compare to agrona in terms of memory profile and runtime performance? How invasive would it be to switch?-- Jonathan Ellisco-founder, http://www.datastax.com@spyced

Re: Agrona vs fastutil and fastutil-concurrent-wrapper

Given they provide no data or explanation, and that benchmarking is hard, I’m not inclined to give much weight to their analysis.Agrona was favoured in large part due to the perceived quality of the library. I’m not inclined to swap it out without proper evidence the fastutils is both materially faster in a manner care about and of similar quality.On 25 May 2023, at 16:43, Jonathan Ellis  wrote:Try it out and see, the only data point I have is that the company who has spent more effort here than anyone else I could find likes fastutil better.On Thu, May 25, 2023 at 10:33 AM Dinesh Joshi  wrote:> On May 25, 2023, at 6:14 AM, Jonathan Ellis  wrote:
> 
> Any objections to adding the concurrent wrapper and switching out agrona for fastutil?

How does fastutil compare to agrona in terms of memory profile and runtime performance? How invasive would it be to switch?-- Jonathan Ellisco-founder, http://www.datastax.com@spyced

Re: [DISCUSS] Bring cassandra-harry in tree as a submodule

I would really like us to split out utilities into a common project, personally. It would be nice to work with a shared palette, including for dtest-api, accord, Harry etc.I think it would help clean up the codebase a bit too, as we have some (minimal) tight coupling with utilities and the C* process.But doubt we have the time for that anytime soon.On 25 May 2023, at 05:04, Caleb Rackliffe  wrote:Isn’t the other reason Accord works well as a submodule that it has no dependencies on C* proper? Harry does at the moment, right? (Not that we couldn’t address that…just trying to think this through…)On May 24, 2023, at 6:54 PM, Benedict  wrote:In this case Harry is a testing module - it’s not something we will develop in tandem with C* releases, and we will want improvements to be applied across all branches.So it seems a natural fit for submodules to me.On 24 May 2023, at 21:09, Caleb Rackliffe  wrote:> Submodules do have their own overhead and edge cases, so I am mostly in favor of using for cases where the code must live outside of tree (such as jvm-dtest that lives out of tree as all branches need the same interfaces)Agreed. Basically where I've ended up on this topic.> We could go over some interesting examples such as testing 2i (SAI)+100On Wed, May 24, 2023 at 1:40 PM Alex Petrov <al...@coffeenco.de> wrote:> I'm about to need to harry test for the paging across tombstone work for https://issues.apache.org/jira/browse/CASSANDRA-18424 (that's where my own overlapping fuzzing came in). In the process, I'll see if I can't distill something really simple along the lines of how React approaches it (https://react.dev/learn).We can pick that up as an example, sure. On Wed, May 24, 2023, at 4:53 PM, Josh McKenzie wrote:I have submitted a proposal to Cassandra Summit for a 4-hour Harry workshop,I'm about to need to harry test for the paging across tombstone work for https://issues.apache.org/jira/browse/CASSANDRA-18424 (that's where my own overlapping fuzzing came in). In the process, I'll see if I can't distill something really simple along the lines of how React approaches it (https://react.dev/learn).Ideally we'd be able to get something together that's a high level "In the next 15 minutes, you will know and understand A-G and have access to N% of the power of harry" kind of offer.Honestly, there's a lot in our ecosystem where we could benefit from taking a page from their book in terms of onboarding and getting started IMO.On Wed, May 24, 2023, at 10:31 AM, Alex Petrov wrote:> I wonder if a mini-onboarding session would be good as a community session - go over Harry, how to run it, how to add a test?  Would that be the right venue?  I just would like to see how we can not only plug it in to regular CI but get everyone that wants to add a test be able to know how to get started with it.I have submitted a proposal to Cassandra Summit for a 4-hour Harry workshop, but unfortunately it got declined. Goes without saying, we can still do it online, time and resources permitting. But again, I do not think it should be barring us from making Harry a part of the codebase, as it already is. In fact, we can be iterating on the development quicker having it in-tree. We could go over some interesting examples such as testing 2i (SAI), modelling Group By tests, or testing repair. If there is enough appetite and collaboration in the community, I will see if we can pull something like that together. Input on _what_ you would like to see / hear / tested is also appreciated. Harry was developed out of a strong need for large-scale testing, which also has informed many of its APIs, but we can make it easier to access for interactive testing / unit tests. We have been doing a lot of that with Transactional Metadata, too. > I'll hold off on this until Alex Petrov chimes in. @Alex -> got any thoughts here?Yes, sorry for not responding on this thread earlier. I can not understate how excited I am about this, and how important I think this is. Time constraints are somehow hard to overcome, but I hope the results brought by TCM will make it all worth it.On Wed, May 24, 2023, at 4:23 PM, Alex Petrov wrote:I think pulling Harry into the tree will make adoption easier for the folks. I have been a bit swamped with Transactional Metadata work, but I wanted to make some of the things we were using for testing TCM available outside of TCM branch. This includes a bunch of helper methods to perform operations on the clusters, data generation, and more useful stuff. Of course, the question always remains about how much time I want to spend porting it all to Gossip, but I think we can find a reasonable compromise. I would not set this improvement as a prerequisite to pulling Harry into the main branch, but rather interpret it as a commitment from myself to take community input and make it more approachable by the day. On Wed, May 24, 2023, at 2:44 PM, Josh McKenzie wrote:importantly it’s a million times

Re: [DISCUSS] Bring cassandra-harry in tree as a submodule

2023-05-24 Thread Benedict

In this case Harry is a testing module - it’s not something we will develop in tandem with C* releases, and we will want improvements to be applied across all branches.So it seems a natural fit for submodules to me.On 24 May 2023, at 21:09, Caleb Rackliffe  wrote:> Submodules do have their own overhead and edge cases, so I am mostly in favor of using for cases where the code must live outside of tree (such as jvm-dtest that lives out of tree as all branches need the same interfaces)Agreed. Basically where I've ended up on this topic.> We could go over some interesting examples such as testing 2i (SAI)+100On Wed, May 24, 2023 at 1:40 PM Alex Petrov <al...@coffeenco.de> wrote:> I'm about to need to harry test for the paging across tombstone work for https://issues.apache.org/jira/browse/CASSANDRA-18424 (that's where my own overlapping fuzzing came in). In the process, I'll see if I can't distill something really simple along the lines of how React approaches it (https://react.dev/learn).We can pick that up as an example, sure. On Wed, May 24, 2023, at 4:53 PM, Josh McKenzie wrote:I have submitted a proposal to Cassandra Summit for a 4-hour Harry workshop,I'm about to need to harry test for the paging across tombstone work for https://issues.apache.org/jira/browse/CASSANDRA-18424 (that's where my own overlapping fuzzing came in). In the process, I'll see if I can't distill something really simple along the lines of how React approaches it (https://react.dev/learn).Ideally we'd be able to get something together that's a high level "In the next 15 minutes, you will know and understand A-G and have access to N% of the power of harry" kind of offer.Honestly, there's a lot in our ecosystem where we could benefit from taking a page from their book in terms of onboarding and getting started IMO.On Wed, May 24, 2023, at 10:31 AM, Alex Petrov wrote:> I wonder if a mini-onboarding session would be good as a community session - go over Harry, how to run it, how to add a test?  Would that be the right venue?  I just would like to see how we can not only plug it in to regular CI but get everyone that wants to add a test be able to know how to get started with it.I have submitted a proposal to Cassandra Summit for a 4-hour Harry workshop, but unfortunately it got declined. Goes without saying, we can still do it online, time and resources permitting. But again, I do not think it should be barring us from making Harry a part of the codebase, as it already is. In fact, we can be iterating on the development quicker having it in-tree. We could go over some interesting examples such as testing 2i (SAI), modelling Group By tests, or testing repair. If there is enough appetite and collaboration in the community, I will see if we can pull something like that together. Input on _what_ you would like to see / hear / tested is also appreciated. Harry was developed out of a strong need for large-scale testing, which also has informed many of its APIs, but we can make it easier to access for interactive testing / unit tests. We have been doing a lot of that with Transactional Metadata, too. > I'll hold off on this until Alex Petrov chimes in. @Alex -> got any thoughts here?Yes, sorry for not responding on this thread earlier. I can not understate how excited I am about this, and how important I think this is. Time constraints are somehow hard to overcome, but I hope the results brought by TCM will make it all worth it.On Wed, May 24, 2023, at 4:23 PM, Alex Petrov wrote:I think pulling Harry into the tree will make adoption easier for the folks. I have been a bit swamped with Transactional Metadata work, but I wanted to make some of the things we were using for testing TCM available outside of TCM branch. This includes a bunch of helper methods to perform operations on the clusters, data generation, and more useful stuff. Of course, the question always remains about how much time I want to spend porting it all to Gossip, but I think we can find a reasonable compromise. I would not set this improvement as a prerequisite to pulling Harry into the main branch, but rather interpret it as a commitment from myself to take community input and make it more approachable by the day. On Wed, May 24, 2023, at 2:44 PM, Josh McKenzie wrote:importantly it’s a million times better than the dtest-api process - which stymies development due to the friction.This is my major concern.What prompted this thread was harry being external to the core codebase and the lack of adoption and usage of it having led to atrophy of certain aspects of it, which then led to redundant implementation of some fuzz testing and lost time.We'd all be better served to have this closer to the main codebase as a forcing function to smooth out the rough edges, integrate it, and make it a collective artifact and first class citizen IMO.I have similar opinions about the dtest-api.On Wed, May 24, 2023, at 4:05 AM, Benedict wrote:It’s not without

Re: [DISCUSS] Bring cassandra-harry in tree as a submodule

2023-05-24 Thread Benedict

It’s not without hiccups, and I’m sure we have more to learn. But it mostly just works, and importantly it’s a million times better than the dtest-api process - which stymies development due to the friction.On 24 May 2023, at 08:39, Mick Semb Wever  wrote:WRT git submodules and CASSANDRA-18204, are we happy with how it is working for accord ? The time spent on getting that running has been a fair few hours, where we could have cut many manual module releases in that time. David and folks working on accord ? On Tue, 23 May 2023 at 20:09, Josh McKenzie  wrote:I'll hold off on this until Alex Petrov chimes in. @Alex -> got any thoughts here?On Tue, May 16, 2023, at 5:17 PM, Jeremy Hanna wrote:I think it would be great to onboard Harry more officially into the project.  However it would be nice to perhaps do some sanity checking outside of Apple folks to see how approachable it is.  That is, can someone take it and just run it with the current readme without any additional context?I wonder if a mini-onboarding session would be good as a community session - go over Harry, how to run it, how to add a test?  Would that be the right venue?  I just would like to see how we can not only plug it in to regular CI but get everyone that wants to add a test be able to know how to get started with it.JeremyOn May 16, 2023, at 1:34 PM, Abe Ratnofsky  wrote:Just to make sure I'm understanding the details, this would mean apache/cassandra-harry maintains its status as a separate repository, apache/cassandra references it as a submodule, and clones and builds Harry locally, rather than pulling a released JAR. We can then reference Harry as a library without maintaining public artifacts for it. Is that in line with what you're thinking?> I'd also like to see us get a Harry run integrated as part of our pre-commit CII'm a strong supporter of this, of course.On May 16, 2023, at 11:03 AM, Josh McKenzie  wrote:Similar to what we've done with accord in https://issues.apache.org/jira/browse/CASSANDRA-18204, I'd like to discuss bringing cassandra-harry in-tree as a submodule. repo link: https://github.com/apache/cassandra-harryGiven the value it's brought to the project's stabilization efforts and the movement of other things in the ecosystem to being more integrated (accord, build-scripts https://issues.apache.org/jira/browse/CASSANDRA-18133), I think having the testing framework better localized and integrated would be a net benefit for adoption, awareness, maintenance, and tighter workflows as we troubleshoot future failures it surfaces.I'd also like to see us get a Harry run integrated as part of our pre-commit CI (a 5 minute simple soak test for instance) and having that local in this fashion should make that a cleaner integration as well.Thoughts?

Re: [DISCUSS] Feature branch version hygiene

2023-05-18 Thread Benedict

The .x approach only breaks down for unreleased majors, for which all of our intuitions breakdown and we rehash it every year.My mental model, though, is that anything that’s not a concrete release number is a target version. Which is where 5.0 goes wrong - it’s not a release so it should be a target, but for some reason we use it as a placeholder to park work arriving in 5.0.0.If we instead use 5.0.0 for this purpose, we just need to get 5.0-alpha1 labels added when those releases are cut.Then I propose we break the confusion in both directions by scrapping 5.0 entirely and introducing 5.0-target.So tickets go to 5.0-target if they target 5.0, and to 5.0.0 once they are resolved (with additional labels as necessary)Simples?On 18 May 2023, at 15:21, Josh McKenzie wrote:My personal view is that 5.0 should not be used for any resolved tickets - they should go to 5.0-alpha1, since this is the correct release for them. 5.0 can then be the target version, which makes more sense given it isn’t a concrete release.Well now you're just opening Pandora's box about our strange idioms with FixVersion usage. ;)every ticket targeting 5.0 could use fixVersion 5.0.x, since it is pretty clear what this means.I think this diverges from our current paradigm where "5.x" == next feature release, "5.0.x" == next patch release (i.e. bugfix only). Not to say it's bad, just an adjustment... which if we're open to adjustment...I'm receptive to transitioning the discussion to that either on this thread or another; IMO we remain in a strange and convoluted place with our FixVersioning. My understanding of our current practice:.x is used to denote target version. For example: 5.x, 5.0.x, 5.1.x, 4.1.xWhen a ticket is committed, the FixVersion is transitioned to resolve the X to the next unreleased version in which it'll releaseWeird Things are done to make this work for the release process and release manager on feature releases (alpha, beta, etc)There's no clear fit for feature branch tickets in the above schemaAnd if I take what I think you're proposing here and extrapolate it out:.0 is used to denote target version. For example: 5.0. 5.0.0. 5.1.0. 4.1.0This appears to break down for patch releases: we _do_ release .0 versions of them rather than alpha/beta/etc, so a ticket targeting 4.1.0 would initially mean 2 different things based on resolved vs. unresolved status (resolved == in release, unresolved == targeting next unreleased) and that distinction would disappear on resolution (i.e. resolved + 4.1.0 would no longer definitively mean "contained in .0 release")When a release is cut, we bulk update FixVersion ending in .0 to the release version in which they're contained (not clear how to disambiguate the things from the above bullet point)For feature releases, .0 will transition to -alpha1One possible solution would be to just no longer release a .0 version of things and reserve .0 to indicate "parked". I don't particularly like that but it's not the worst.Another possible solution would be to just scrap this approach entirely and go with:FixVersion on unreleased _and still advocated for tickets_ always targets the next unreleased version. For other tickets where nobody is advocating for their work / inclusion, we either FixVersion "Backlog" or close as "Later"When a release is cut, roll all unresolved tickets w/that FixVersion to the next unreleased FixVersionWhen we're gearing up to a release, we can do a broad pass on everything that's unreleased w/the next feature releases FixVersion and move tickets that are desirable but not blockers to the next unreleased FixVersion (patch for bug, minor/major for improvements or new features)CEP tickets target the same FixVersion (i.e. next unreleased Feature release) as their parents. When the parent epic gets a new FixVersion on resolution, all children get that FixVersion (i.e. when we merge the CEP and update its FixVersion, we bulk update all children tickets)On Thu, May 18, 2023, at 9:08 AM, Benedict wrote:I don’t think we should over complicate this with special CEP release targets. If we do, they shouldn’t be versioned.My personal view is that 5.0 should not be used for any resolved tickets - they should go to 5.0-alpha1, since this is the correct release for them. 5.0 can then be the target version, which makes more sense given it isn’t a concrete release.But, in lieu of that, every ticket targeting 5.0 could use fixVersion 5.0.x, since it is pretty clear what this means. Some tickets that don’t hit 5.0.0 can then be postponed to a later version, but it’s not like this is burdensome. Anything marked feature/improvement and 5.0.x gets bumped to 5.1.x.On 18 May 2023, at 13:58, Josh McKenzie wrote:CEP-N seems like a good compromise. NextMajorRelease bumps into our interchangeable use of "Major" and "Minor" from a semver perspective and could get confusing. Suppose we could do NextFeatureRelease, b

Re: [DISCUSS] Feature branch version hygiene

2023-05-18 Thread Benedict

So we just rename alpha1 to beta1 if that happens?

Or, we point resolved tickets straight to 5.0.0, and add 5.0-alpha1 to any 
tickets with *only* 5.0.0

This is probably the easiest for folk to understand when browsing.

Finding new features is easy either way - look for 5.0.0.

> On 18 May 2023, at 15:08, Mick Semb Wever  wrote:
> 
> 
> 
> 
>> So when a CEP slips, do we have to create a 5.1-cep-N? 
> 
> 
> No, you'd just rename it, easy to do in just one place.
> I really don't care, but the version would at least helps indicate what the 
> branch is getting rebased off.
> 
> 
>  
>> My personal view is that 5.0 should not be used for any resolved tickets - 
>> they should go to 5.0-alpha1, since this is the correct release for them. 
>> 5.0 can then be the target version, which makes more sense given it isn’t a 
>> concrete release.
> 
> 
> Each time, we don't know if the first release will be an alpha1 or if we're 
> confident enough to go straight to a beta1.
> A goal with stable trunk would make the latter possible.
> 
> And with the additional 5.0 label has been requested by a few to make it easy 
> to search for what's new, this has been the simplest way.
>

Re: [DISCUSS] Feature branch version hygiene

2023-05-18 Thread Benedict

I don’t think we should over complicate this with special CEP release targets. If we do, they shouldn’t be versioned.My personal view is that 5.0 should not be used for any resolved tickets - they should go to 5.0-alpha1, since this is the correct release for them. 5.0 can then be the target version, which makes more sense given it isn’t a concrete release.But, in lieu of that, every ticket targeting 5.0 could use fixVersion 5.0.x, since it is pretty clear what this means. Some tickets that don’t hit 5.0.0 can then be postponed to a later version, but it’s not like this is burdensome. Anything marked feature/improvement and 5.0.x gets bumped to 5.1.x.On 18 May 2023, at 13:58, Josh McKenzie wrote:CEP-N seems like a good compromise. NextMajorRelease bumps into our interchangeable use of "Major" and "Minor" from a semver perspective and could get confusing. Suppose we could do NextFeatureRelease, but at that point why not just have it linked to the CEP and have the epic set.On Thu, May 18, 2023, at 12:26 AM, Caleb Rackliffe wrote:...otherwise I'm fine w/ just the CEP name, like "CEP-7" for SAI, etc.On Wed, May 17, 2023 at 11:24 PM Caleb Rackliffe wrote:So when a CEP slips, do we have to create a 5.1-cep-N? Could we just have a version that's "NextMajorRelease" or something like that? It should still be pretty easy to bulk replace if we have something else to filter on, like belonging to an epic?On Wed, May 17, 2023 at 6:42 PM Mick Semb Wever wrote:On Tue, 16 May 2023 at 13:02, J. D. Jordan wrote:Process question/discussion. Should tickets that are merged to CEP feature branches, like https://issues.apache.org/jira/browse/CASSANDRA-18204, have a fixver of 5.0 on them After merging to the feature branch?For the SAI CEP which is also using the feature branch method the "reviewed and merged to feature branch" tickets seem to be given a version of NA.Not sure that's the best “waiting for cep to merge” version either? But it seems better than putting 5.0 on them to me.Why I’m not keen on 5.0 is because if we cut the release today those tickets would not be there.What do other people think? Is there a better version designation we can use?On a different project I have in the past made a “version number” in JIRA for each long running feature branch. Tickets merged to the feature branch got the epic ticket number as their version, and then it got updated to the “real” version when the feature branch was merged to trunk.Thanks for raising the thread, I remember there was some confusion early wrt features branches too.To rehash, for everything currently resolved in trunk 5.0 is the correct fixVersion. (And there should be no unresolved issues today with 5.0 fixVersion, they should be 5.x)When alpha1 is cut, then the 5.0-alpha1 fixVersion is created and everything with 5.0 also gets 5.0-alpha1. At the same time 5.0-alpha2, 5.0-beta, 5.0-rc, 5.0.0 fixVersions are created. Here both 5.0-beta and 5.0-rc are blocking placeholder fixVersions: no resolved issues are left with this fixVersion the same as the .x placeholder fixVersions. The 5.0.0 is also used as a blocking version, though it is also an eventual fixVersion for resolved tickets. Also note, all tickets up to and including 5.0.0 will also have the 5.0 fixVersion.A particular reason for doing things the way they are is to make it easy for the release manager to bulk correct fixVersions, at release time or even later, i.e. without having to read the ticket or go talk to authors or painstakingly crawl CHANGES.txt.For feature branches my suggestion is that we create a fixVersion for each of them, e.g. 5.0-cep-15Yup, that's your suggestion Jeremiah (I wrote this up on the plane before I got to read your post properly).(As you say) This then makes it easy to see where the code is (or what the patch is currently being based on). And when the feature branch is merged then it is easy to bulk replace it with trunk's fixVersion, e.g. 5.0-cep-15 with 5.0The NA fixVersion was introduced for the other repositories, e.g. website updates.

Re: [DISCUSS] Feature branch version hygiene

2023-05-16 Thread Benedict

Copying my rely on the ticket…

We have this discussion roughly once per major. If you look back through dev@ 
you'll find the last one a few years back.
I don't recall NA ever being the approved approach, though. ".x" lines are 
target versions, whereas concrete versions are the ones a fix landed in. 
There's always ambiguity over the next release, as it's sort of both. But since 
there is no 5.0 version, only 5.0-alphaN, 5.0-betaN and 5.0.0, perhaps 5.0 is 
the correct label (and makes sense to me). I forget what we landed upon last 
time.
Work that has actually landed should probably be labelled as 5.0-alpha1

> On 16 May 2023, at 21:02, J. D. Jordan  wrote:
> 
> 
> Process question/discussion. Should tickets that are merged to CEP feature 
> branches, like  https://issues.apache.org/jira/browse/CASSANDRA-18204, have a 
> fixver of 5.0 on them After merging to the feature branch?
> 
> For the SAI CEP which is also using the feature branch method the "reviewed 
> and merged to feature branch" tickets seem to be given a version of NA.
> 
> Not sure that's the best “waiting for cep to merge” version either?  But it 
> seems better than putting 5.0 on them to me.
> 
> Why I’m not keen on 5.0 is because if we cut the release today those tickets 
> would not be there.
> 
> What do other people think?  Is there a better version designation we can use?
> 
> On a different project I have in the past made a “version number” in JIRA for 
> each long running feature branch. Tickets merged to the feature branch got 
> the epic ticket number as their version, and then it got updated to the 
> “real” version when the feature branch was merged to trunk.
> 
> -Jeremiah

Re: [DISCUSS] The future of CREATE INDEX

2023-05-15 Thread Benedict

3: CREATE  INDEX (Otherwise 2)NoIf configurable, should be a distributed configuration. This is very different to other local configurations, as the 2i selected has semantic implications, not just performance (and the perf implications are also much greater)On 15 May 2023, at 10:45, Mike Adamson  wrote:[POLL] Centralize existing syntax or create new syntax?1.) CREATE INDEX ... USING  WITH OPTIONS...2.) CREATE LOCAL INDEX ... USING ... WITH OPTIONS...  (same as 1, but adds LOCAL keyword for clarity and separation from future GLOBAL indexes) 1.) CREATE INDEX ... USING  WITH OPTIONS...[POLL] Should there be a default? (YES/NO)Yes[POLL] What do do with the default?1.) Allow a default, and switch it to SAI (no configurables)2.) Allow a default, and stay w/ the legacy 2i (no configurables)3.) YAML config to override default index (legacy 2i remains the default)4.) YAML config/guardrail to require index type selection (not required by default)3.) YAML config to override default index (legacy 2i remains the default)On Mon, 15 May 2023 at 08:54, Mick Semb Wever  wrote:[POLL] Centralize existing syntax or create new syntax?1.) CREATE INDEX ... USING  WITH OPTIONS...2.) CREATE LOCAL INDEX ... USING ... WITH OPTIONS...  (same as 1, but adds LOCAL keyword for clarity and separation from future GLOBAL indexes)(1) CREATE INDEX … [POLL] Should there be a default? (YES/NO)Yes (but see below). [POLL] What do do with the default?1.) Allow a default, and switch it to SAI (no configurables)2.) Allow a default, and stay w/ the legacy 2i (no configurables)3.) YAML config to override default index (legacy 2i remains the default)4.) YAML config/guardrail to require index type selection (not required by default)(4) YAML config. Commented out default of 2i.I agree that the default cannot change in 5.0, but our existing default of 2i can be commented out.For the user this gives them the same feedback, and puts the same requirement to edit one line of yaml, as when we disabled MVs and SASI in 4.0No one has complained about either of these, which is a clear signal folk understood how to get their existing DDLs to work from 3.x to 4.x
-- Mike AdamsonEngineering+1 650 389 6000 | datastax.comFind DataStax Online:

Re: [DISCUSS] The future of CREATE INDEX

Given we have no data in front of us to make a decision regarding switching defaults, I don’t think it is suitable to include that option in this poll. In fact, until we have sufficient data to discuss that I’m going to put a hard veto on that on technical grounds.On 12 May 2023, at 19:41, Caleb Rackliffe  wrote:...and to clarify, answers should be what you'd like to see for 5.0 specificallyOn Fri, May 12, 2023 at 1:36 PM Caleb Rackliffe  wrote:[POLL] Centralize existing syntax or create new syntax?1.) CREATE INDEX ... USING  WITH OPTIONS...2.) CREATE LOCAL INDEX ... USING ... WITH OPTIONS...  (same as 1, but adds LOCAL keyword for clarity and separation from future GLOBAL indexes)(In both cases, we deprecate w/ client warnings CREATE CUSTOM INDEX)[POLL] Should there be a default? (YES/NO)[POLL] What do do with the default?1.) Allow a default, and switch it to SAI (no configurables)2.) Allow a default, and stay w/ the legacy 2i (no configurables)3.) YAML config to override default index (legacy 2i remains the default)4.) YAML config/guardrail to require index type selection (not required by default)On Fri, May 12, 2023 at 12:39 PM Mick Semb Wever  wrote:Given it seems most DBs have a default index (see Postgres, etc.), I tend to lean toward having one, but that's me... I'm for it too.  Would be nice to enforce the setting is globally uniform to avoid the per-node problem. Or add a keyspace option. For users replaying <5 DDLs this would just require they set the default index to 2i.This is not a headache, it's a one-off action that can be clearly expressed in NEWS.It acts as a deprecation warning too.This prevents new uneducated users from creating the unintended index, it supports existing users, and it does not present SAI as the battle-tested default.Agree with the poll, there's a number of different PoVs here already.  I'm not fond of the LOCAL addition,  I appreciate what it informs, but it's just not important enough IMHO (folk should be reading up on the index type).

Re: [DISCUSS] The future of CREATE INDEX

But then we have to reconsider the existing syntax, or do we want LOCAL to be the default?We should be planning our language evolution along with our feature evolution.On 12 May 2023, at 19:28, Caleb Rackliffe  wrote:If at some point in the glorious future we have global indexes, I'm sure we can add GLOBAL to the syntax...sry, working on an ugly poll...On Fri, May 12, 2023 at 1:24 PM Benedict <bened...@apache.org> wrote:If folk should be reading up on the index type, doesn’t that conflict with your support of a default?Should there be different global and local defaults, once we have global indexes, or should we always default to a local index? Or a global one?On 12 May 2023, at 18:39, Mick Semb Wever <m...@apache.org> wrote:Given it seems most DBs have a default index (see Postgres, etc.), I tend to lean toward having one, but that's me... I'm for it too.  Would be nice to enforce the setting is globally uniform to avoid the per-node problem. Or add a keyspace option. For users replaying <5 DDLs this would just require they set the default index to 2i.This is not a headache, it's a one-off action that can be clearly expressed in NEWS.It acts as a deprecation warning too.This prevents new uneducated users from creating the unintended index, it supports existing users, and it does not present SAI as the battle-tested default.Agree with the poll, there's a number of different PoVs here already.  I'm not fond of the LOCAL addition,  I appreciate what it informs, but it's just not important enough IMHO (folk should be reading up on the index type).

Re: [DISCUSS] The future of CREATE INDEX

If folk should be reading up on the index type, doesn’t that conflict with your 
support of a default?

Should there be different global and local defaults, once we have global 
indexes, or should we always default to a local index? Or a global one?

> On 12 May 2023, at 18:39, Mick Semb Wever  wrote:
> 
> 
>> 
>> Given it seems most DBs have a default index (see Postgres, etc.), I tend to 
>> lean toward having one, but that's me...
> 
>  
> I'm for it too.  Would be nice to enforce the setting is globally uniform to 
> avoid the per-node problem. Or add a keyspace option. 
> 
> For users replaying <5 DDLs this would just require they set the default 
> index to 2i.
> This is not a headache, it's a one-off action that can be clearly expressed 
> in NEWS.
> It acts as a deprecation warning too.
> This prevents new uneducated users from creating the unintended index, it 
> supports existing users, and it does not present SAI as the battle-tested 
> default.
> 
> Agree with the poll, there's a number of different PoVs here already.  I'm 
> not fond of the LOCAL addition,  I appreciate what it informs, but it's just 
> not important enough IMHO (folk should be reading up on the index type).

Re: [DISCUSS] The future of CREATE INDEX

There remains the question of what the new syntax is - whether it augments CREATE INDEX to replace CREATE CUSTOM INDEX or if we introduce new syntax because we think it’s clearer.I can accept settling for modifying CREATE INDEX … USING, but I maintain that CREATE LOCAL  INDEX is betterOn 12 May 2023, at 18:31, Caleb Rackliffe  wrote:Even if we don't want to allow a default, we can keep the same CREATE INDEX syntax in place, and have a guardrail forcing (or not) the selection of an implementation, right? This would be no worse than the YAML option we already have for enabling 2i creation as a whole.On Fri, May 12, 2023 at 12:28 PM Benedict <bened...@apache.org> wrote:I’m not convinced a default index makes any sense, no. The trade-offs in a distributed setting are much more pronounced.Indexes in a local-only RDBMS are much simpler affairs; the trade offs are much more subtle than here. On 12 May 2023, at 18:24, Caleb Rackliffe <calebrackli...@gmail.com> wrote:> Now, giving this thread, there is pushback for a config to allow default impl to change… but there is 0 pushback for new syntax to make this explicit…. So maybe we should [POLL] for what syntax people want?I think the essential question is whether we want the concept of a default index. If we do, we need to figure that out now. If we don't then a new syntax that forces it becomes interesting.Given it seems most DBs have a default index (see Postgres, etc.), I tend to lean toward having one, but that's me...On Fri, May 12, 2023 at 12:20 PM David Capwell <dcapw...@apple.com> wrote:I really dislike the idea of the same CQL doing different things based upon a per-node configuration.I agree with Brandon that changing CQL behaviour like this based on node config is really not ideal. I am cool adding such a config, and also cool keeping CREATE INDEX disabled by default…. But would like to point out that we have many configs that impact CQL and they are almost always local configs…Is CREATE INDEX even allowed?  This is a per node config. Right now you can block globally, enable on a single instance, create the index for your users, then revert the config change on the instance…. All guardrails that define what we can do are per node configs…Now, giving this thread, there is pushback for a config to allow default impl to change… but there is 0 pushback for new syntax to make this explicit…. So maybe we should [POLL] for what syntax people want?if we decide before the 5.0 release that we have enough information to change the default (#1), we can change it in a matter of minutes.I am strongly against this… SAI is new for 5.0 so should be disabled by default; else we disrespect the idea that new features are disabled by default.  I am cool with our docs recommending if we do find its better in most cases, but we should not change the default in the same reason it lands in.On May 12, 2023, at 10:10 AM, Caleb Rackliffe <calebrackli...@gmail.com> wrote:I don't want to cut over for 5.0 either way. I was more contrasting a configurable cutover in 5.0 vs. a hard cutover later.On Fri, May 12, 2023 at 12:09 PM Benedict <bened...@apache.org> wrote:If the performance characteristics are as clear cut as you think, then maybe it will be an easy decision once the evidence is available for everyone to consider?If not, then we probably can’t do the hard cutover and so the answer is still pretty simple? On 12 May 2023, at 18:04, Caleb Rackliffe <calebrackli...@gmail.com> wrote:I don't particularly like the YAML solution either, but absent that, we're back to fighting about whether we introduce entirely new syntax or hard cut over to SAI at some point.We already have per-node configuration in the YAML that determines whether or not we can create a 2i at all, right?What if we just do #2 and #3 and punt on everything else?On Fri, May 12, 2023 at 11:56 AM Benedict <bened...@apache.org> wrote:A table is not a local concept at all, it has a global primary index - that’s the core idea of Cassandra.I agree with Brandon that changing CQL behaviour like this based on node config is really not ideal. New syntax is by far the simplest and safest solution to this IMO. It doesn’t have to use the word LOCAL, but I think that’s anyway an improvement, personally. In future we will hopefully offer GLOBAL indexes, and IMO it is better to reify the distinction in the syntax.On 12 May 2023, at 17:29, Caleb Rackliffe <calebrackli...@gmail.com> wrote:We don't need to know everything about SAI's performance profile to plan and execute some small, reasonable things now for 5.0. I'm going to try to summarize the least controversial package of ideas from the discussion above. I've left out creating any new syntax. For example, I think CREATE LOCAL INDEX, while explicit, is just not necessary. We don't use CREATE LOCAL TABLE, although it has the same locality as our indexes.Okay, so the proposal for 5.0...1.) Add a YAML option that specifies

Re: [DISCUSS] The future of CREATE INDEX

I’m not convinced a default index makes any sense, no. The trade-offs in a distributed setting are much more pronounced.Indexes in a local-only RDBMS are much simpler affairs; the trade offs are much more subtle than here. On 12 May 2023, at 18:24, Caleb Rackliffe  wrote:> Now, giving this thread, there is pushback for a config to allow default impl to change… but there is 0 pushback for new syntax to make this explicit…. So maybe we should [POLL] for what syntax people want?I think the essential question is whether we want the concept of a default index. If we do, we need to figure that out now. If we don't then a new syntax that forces it becomes interesting.Given it seems most DBs have a default index (see Postgres, etc.), I tend to lean toward having one, but that's me...On Fri, May 12, 2023 at 12:20 PM David Capwell <dcapw...@apple.com> wrote:I really dislike the idea of the same CQL doing different things based upon a per-node configuration.I agree with Brandon that changing CQL behaviour like this based on node config is really not ideal. I am cool adding such a config, and also cool keeping CREATE INDEX disabled by default…. But would like to point out that we have many configs that impact CQL and they are almost always local configs…Is CREATE INDEX even allowed?  This is a per node config. Right now you can block globally, enable on a single instance, create the index for your users, then revert the config change on the instance…. All guardrails that define what we can do are per node configs…Now, giving this thread, there is pushback for a config to allow default impl to change… but there is 0 pushback for new syntax to make this explicit…. So maybe we should [POLL] for what syntax people want?if we decide before the 5.0 release that we have enough information to change the default (#1), we can change it in a matter of minutes.I am strongly against this… SAI is new for 5.0 so should be disabled by default; else we disrespect the idea that new features are disabled by default.  I am cool with our docs recommending if we do find its better in most cases, but we should not change the default in the same reason it lands in.On May 12, 2023, at 10:10 AM, Caleb Rackliffe <calebrackli...@gmail.com> wrote:I don't want to cut over for 5.0 either way. I was more contrasting a configurable cutover in 5.0 vs. a hard cutover later.On Fri, May 12, 2023 at 12:09 PM Benedict <bened...@apache.org> wrote:If the performance characteristics are as clear cut as you think, then maybe it will be an easy decision once the evidence is available for everyone to consider?If not, then we probably can’t do the hard cutover and so the answer is still pretty simple? On 12 May 2023, at 18:04, Caleb Rackliffe <calebrackli...@gmail.com> wrote:I don't particularly like the YAML solution either, but absent that, we're back to fighting about whether we introduce entirely new syntax or hard cut over to SAI at some point.We already have per-node configuration in the YAML that determines whether or not we can create a 2i at all, right?What if we just do #2 and #3 and punt on everything else?On Fri, May 12, 2023 at 11:56 AM Benedict <bened...@apache.org> wrote:A table is not a local concept at all, it has a global primary index - that’s the core idea of Cassandra.I agree with Brandon that changing CQL behaviour like this based on node config is really not ideal. New syntax is by far the simplest and safest solution to this IMO. It doesn’t have to use the word LOCAL, but I think that’s anyway an improvement, personally. In future we will hopefully offer GLOBAL indexes, and IMO it is better to reify the distinction in the syntax.On 12 May 2023, at 17:29, Caleb Rackliffe <calebrackli...@gmail.com> wrote:We don't need to know everything about SAI's performance profile to plan and execute some small, reasonable things now for 5.0. I'm going to try to summarize the least controversial package of ideas from the discussion above. I've left out creating any new syntax. For example, I think CREATE LOCAL INDEX, while explicit, is just not necessary. We don't use CREATE LOCAL TABLE, although it has the same locality as our indexes.Okay, so the proposal for 5.0...1.) Add a YAML option that specifies a default implementation for CREATE INDEX, and make this the legacy 2i for now. No existing DDL breaks. We don't have to commit to the absolute superiority of SAI.2.) Add USING...WITH... support to CREATE INDEX, so we don't have to go to market using CREATE CUSTOM INDEX, which feels...not so polished. (The backend for this already exists w/ CREATE CUSTOM INDEX.)3.) Leave in place but deprecate (client warnings could work?) CREATE CUSTOM INDEX. Support the syntax for the foreseeable future.Can we live w/ this?I don't think any information about SAI we could possibly acquire before a 5.0 release would affect the reasonableness of this much.On Fri, May 12, 2023 at 10:54 AM Benedict <bened...@apache.org> wrote:

Re: [DISCUSS] The future of CREATE INDEX

I still prefer introducing CREATE LOCAL INDEX, to help users understand the semantics of the index they’re creating.I think it will in future potentially be quite confusing to be able to create global and local indexes using the same DDL statement.But, depending on appetite, that could plausibly be done in future instead.(I don’t endorse the assumption of a future switch of default)On 12 May 2023, at 18:18, Caleb Rackliffe  wrote:So the weakest version of the plan that actually accomplishes something useful for 5.0:1.) Just leave the CREATE INDEX default alone for now. Hard switch the default after 5.0.2.) Add USING...WITH... support to CREATE INDEX, so we don't have to go to market using CREATE CUSTOM INDEX, which feels...not so polished. (The backend for this already exists w/ CREATE CUSTOM INDEX.)3.) Leave in place but deprecate (client warnings could work?) CREATE CUSTOM INDEX. Support the syntax for the foreseeable future.Any objections to that?On Fri, May 12, 2023 at 12:10 PM Caleb Rackliffe <calebrackli...@gmail.com> wrote:I don't want to cut over for 5.0 either way. I was more contrasting a configurable cutover in 5.0 vs. a hard cutover later.On Fri, May 12, 2023 at 12:09 PM Benedict <bened...@apache.org> wrote:If the performance characteristics are as clear cut as you think, then maybe it will be an easy decision once the evidence is available for everyone to consider?If not, then we probably can’t do the hard cutover and so the answer is still pretty simple? On 12 May 2023, at 18:04, Caleb Rackliffe <calebrackli...@gmail.com> wrote:I don't particularly like the YAML solution either, but absent that, we're back to fighting about whether we introduce entirely new syntax or hard cut over to SAI at some point.We already have per-node configuration in the YAML that determines whether or not we can create a 2i at all, right?What if we just do #2 and #3 and punt on everything else?On Fri, May 12, 2023 at 11:56 AM Benedict <bened...@apache.org> wrote:A table is not a local concept at all, it has a global primary index - that’s the core idea of Cassandra.I agree with Brandon that changing CQL behaviour like this based on node config is really not ideal. New syntax is by far the simplest and safest solution to this IMO. It doesn’t have to use the word LOCAL, but I think that’s anyway an improvement, personally. In future we will hopefully offer GLOBAL indexes, and IMO it is better to reify the distinction in the syntax.On 12 May 2023, at 17:29, Caleb Rackliffe <calebrackli...@gmail.com> wrote:We don't need to know everything about SAI's performance profile to plan and execute some small, reasonable things now for 5.0. I'm going to try to summarize the least controversial package of ideas from the discussion above. I've left out creating any new syntax. For example, I think CREATE LOCAL INDEX, while explicit, is just not necessary. We don't use CREATE LOCAL TABLE, although it has the same locality as our indexes.Okay, so the proposal for 5.0...1.) Add a YAML option that specifies a default implementation for CREATE INDEX, and make this the legacy 2i for now. No existing DDL breaks. We don't have to commit to the absolute superiority of SAI.2.) Add USING...WITH... support to CREATE INDEX, so we don't have to go to market using CREATE CUSTOM INDEX, which feels...not so polished. (The backend for this already exists w/ CREATE CUSTOM INDEX.)3.) Leave in place but deprecate (client warnings could work?) CREATE CUSTOM INDEX. Support the syntax for the foreseeable future.Can we live w/ this?I don't think any information about SAI we could possibly acquire before a 5.0 release would affect the reasonableness of this much.On Fri, May 12, 2023 at 10:54 AM Benedict <bened...@apache.org> wrote:if we didn't have copious amounts of (not all public, I know, working on it) evidenceIf that’s the assumption on which this proposal is based, let’s discuss the evidence base first, as given the fundamentally different way they work (almost diametrically opposite), I would want to see a very high quality of evidence to support the claim.I don’t think we can resolve this conversation effectively until this question is settled.On 12 May 2023, at 16:19, Caleb Rackliffe <calebrackli...@gmail.com> wrote:> This creates huge headaches for everyone successfully using 2i today though, and SAI *is not* guaranteed to perform as well or better - it has a very different performance profile.We wouldn't have even advanced it to this point if we didn't have copious amounts of (not all public, I know, working on it) evidence it did for the vast majority of workloads. Having said that, I don't strongly agree that we should make it the default in 5.0, because performance isn't the only concern. (correctness, DDL back-compat, which we've sort of touched w/ the YAML default option, etc.)This conversation is now going in like 3 different directions, or at least 3 different "packages" of ideas

Re: [DISCUSS] The future of CREATE INDEX

If the performance characteristics are as clear cut as you think, then maybe it will be an easy decision once the evidence is available for everyone to consider?If not, then we probably can’t do the hard cutover and so the answer is still pretty simple? On 12 May 2023, at 18:04, Caleb Rackliffe  wrote:I don't particularly like the YAML solution either, but absent that, we're back to fighting about whether we introduce entirely new syntax or hard cut over to SAI at some point.We already have per-node configuration in the YAML that determines whether or not we can create a 2i at all, right?What if we just do #2 and #3 and punt on everything else?On Fri, May 12, 2023 at 11:56 AM Benedict <bened...@apache.org> wrote:A table is not a local concept at all, it has a global primary index - that’s the core idea of Cassandra.I agree with Brandon that changing CQL behaviour like this based on node config is really not ideal. New syntax is by far the simplest and safest solution to this IMO. It doesn’t have to use the word LOCAL, but I think that’s anyway an improvement, personally. In future we will hopefully offer GLOBAL indexes, and IMO it is better to reify the distinction in the syntax.On 12 May 2023, at 17:29, Caleb Rackliffe <calebrackli...@gmail.com> wrote:We don't need to know everything about SAI's performance profile to plan and execute some small, reasonable things now for 5.0. I'm going to try to summarize the least controversial package of ideas from the discussion above. I've left out creating any new syntax. For example, I think CREATE LOCAL INDEX, while explicit, is just not necessary. We don't use CREATE LOCAL TABLE, although it has the same locality as our indexes.Okay, so the proposal for 5.0...1.) Add a YAML option that specifies a default implementation for CREATE INDEX, and make this the legacy 2i for now. No existing DDL breaks. We don't have to commit to the absolute superiority of SAI.2.) Add USING...WITH... support to CREATE INDEX, so we don't have to go to market using CREATE CUSTOM INDEX, which feels...not so polished. (The backend for this already exists w/ CREATE CUSTOM INDEX.)3.) Leave in place but deprecate (client warnings could work?) CREATE CUSTOM INDEX. Support the syntax for the foreseeable future.Can we live w/ this?I don't think any information about SAI we could possibly acquire before a 5.0 release would affect the reasonableness of this much.On Fri, May 12, 2023 at 10:54 AM Benedict <bened...@apache.org> wrote:if we didn't have copious amounts of (not all public, I know, working on it) evidenceIf that’s the assumption on which this proposal is based, let’s discuss the evidence base first, as given the fundamentally different way they work (almost diametrically opposite), I would want to see a very high quality of evidence to support the claim.I don’t think we can resolve this conversation effectively until this question is settled.On 12 May 2023, at 16:19, Caleb Rackliffe <calebrackli...@gmail.com> wrote:> This creates huge headaches for everyone successfully using 2i today though, and SAI *is not* guaranteed to perform as well or better - it has a very different performance profile.We wouldn't have even advanced it to this point if we didn't have copious amounts of (not all public, I know, working on it) evidence it did for the vast majority of workloads. Having said that, I don't strongly agree that we should make it the default in 5.0, because performance isn't the only concern. (correctness, DDL back-compat, which we've sort of touched w/ the YAML default option, etc.)This conversation is now going in like 3 different directions, or at least 3 different "packages" of ideas, so there isn't even a single thing to vote on. Let me read through again and try to distill into something that we might be able to do so with...On Fri, May 12, 2023 at 7:56 AM Aleksey Yeshchenko <alek...@apple.com> wrote:This.I would also consider adding CREATE LEGACY INDEX syntax as an alias for today’s CREATE INDEX, the latter to be deprecated and (in very distant future) removed.On 12 May 2023, at 13:14, Benedict <bened...@apache.org> wrote:This creates huge headaches for everyone successfully using 2i today though, and SAI *is not* guaranteed to perform as well or better - it has a very different performance profile.I think we should deprecate CREATE INDEX, and introduce new syntax CREATE LOCAL INDEX to make clear that this is not a global index, and that this should require the USING syntax to avoid this problem in future. We should report warnings to the client when CREATE INDEX is used, indicating it is deprecated.

Re: [DISCUSS] The future of CREATE INDEX

A table is not a local concept at all, it has a global primary index - that’s the core idea of Cassandra.I agree with Brandon that changing CQL behaviour like this based on node config is really not ideal. New syntax is by far the simplest and safest solution to this IMO. It doesn’t have to use the word LOCAL, but I think that’s anyway an improvement, personally. In future we will hopefully offer GLOBAL indexes, and IMO it is better to reify the distinction in the syntax.On 12 May 2023, at 17:29, Caleb Rackliffe  wrote:We don't need to know everything about SAI's performance profile to plan and execute some small, reasonable things now for 5.0. I'm going to try to summarize the least controversial package of ideas from the discussion above. I've left out creating any new syntax. For example, I think CREATE LOCAL INDEX, while explicit, is just not necessary. We don't use CREATE LOCAL TABLE, although it has the same locality as our indexes.Okay, so the proposal for 5.0...1.) Add a YAML option that specifies a default implementation for CREATE INDEX, and make this the legacy 2i for now. No existing DDL breaks. We don't have to commit to the absolute superiority of SAI.2.) Add USING...WITH... support to CREATE INDEX, so we don't have to go to market using CREATE CUSTOM INDEX, which feels...not so polished. (The backend for this already exists w/ CREATE CUSTOM INDEX.)3.) Leave in place but deprecate (client warnings could work?) CREATE CUSTOM INDEX. Support the syntax for the foreseeable future.Can we live w/ this?I don't think any information about SAI we could possibly acquire before a 5.0 release would affect the reasonableness of this much.On Fri, May 12, 2023 at 10:54 AM Benedict <bened...@apache.org> wrote:if we didn't have copious amounts of (not all public, I know, working on it) evidenceIf that’s the assumption on which this proposal is based, let’s discuss the evidence base first, as given the fundamentally different way they work (almost diametrically opposite), I would want to see a very high quality of evidence to support the claim.I don’t think we can resolve this conversation effectively until this question is settled.On 12 May 2023, at 16:19, Caleb Rackliffe <calebrackli...@gmail.com> wrote:> This creates huge headaches for everyone successfully using 2i today though, and SAI *is not* guaranteed to perform as well or better - it has a very different performance profile.We wouldn't have even advanced it to this point if we didn't have copious amounts of (not all public, I know, working on it) evidence it did for the vast majority of workloads. Having said that, I don't strongly agree that we should make it the default in 5.0, because performance isn't the only concern. (correctness, DDL back-compat, which we've sort of touched w/ the YAML default option, etc.)This conversation is now going in like 3 different directions, or at least 3 different "packages" of ideas, so there isn't even a single thing to vote on. Let me read through again and try to distill into something that we might be able to do so with...On Fri, May 12, 2023 at 7:56 AM Aleksey Yeshchenko <alek...@apple.com> wrote:This.I would also consider adding CREATE LEGACY INDEX syntax as an alias for today’s CREATE INDEX, the latter to be deprecated and (in very distant future) removed.On 12 May 2023, at 13:14, Benedict <bened...@apache.org> wrote:This creates huge headaches for everyone successfully using 2i today though, and SAI *is not* guaranteed to perform as well or better - it has a very different performance profile.I think we should deprecate CREATE INDEX, and introduce new syntax CREATE LOCAL INDEX to make clear that this is not a global index, and that this should require the USING syntax to avoid this problem in future. We should report warnings to the client when CREATE INDEX is used, indicating it is deprecated.

Re: [DISCUSS] The future of CREATE INDEX

if we didn't have copious amounts of (not all public, I know, working on it) evidenceIf that’s the assumption on which this proposal is based, let’s discuss the evidence base first, as given the fundamentally different way they work (almost diametrically opposite), I would want to see a very high quality of evidence to support the claim.I don’t think we can resolve this conversation effectively until this question is settled.On 12 May 2023, at 16:19, Caleb Rackliffe  wrote:> This creates huge headaches for everyone successfully using 2i today though, and SAI *is not* guaranteed to perform as well or better - it has a very different performance profile.We wouldn't have even advanced it to this point if we didn't have copious amounts of (not all public, I know, working on it) evidence it did for the vast majority of workloads. Having said that, I don't strongly agree that we should make it the default in 5.0, because performance isn't the only concern. (correctness, DDL back-compat, which we've sort of touched w/ the YAML default option, etc.)This conversation is now going in like 3 different directions, or at least 3 different "packages" of ideas, so there isn't even a single thing to vote on. Let me read through again and try to distill into something that we might be able to do so with...On Fri, May 12, 2023 at 7:56 AM Aleksey Yeshchenko <alek...@apple.com> wrote:This.I would also consider adding CREATE LEGACY INDEX syntax as an alias for today’s CREATE INDEX, the latter to be deprecated and (in very distant future) removed.On 12 May 2023, at 13:14, Benedict <bened...@apache.org> wrote:This creates huge headaches for everyone successfully using 2i today though, and SAI *is not* guaranteed to perform as well or better - it has a very different performance profile.I think we should deprecate CREATE INDEX, and introduce new syntax CREATE LOCAL INDEX to make clear that this is not a global index, and that this should require the USING syntax to avoid this problem in future. We should report warnings to the client when CREATE INDEX is used, indicating it is deprecated.

Re: [DISCUSS] The future of CREATE INDEX

This creates huge headaches for everyone successfully using 2i today though, and SAI *is not* guaranteed to perform as well or better - it has a very different performance profile.I think we should deprecate CREATE INDEX, and introduce new syntax CREATE LOCAL INDEX to make clear that this is not a global index, and that this should require the USING syntax to avoid this problem in future. We should report warnings to the client when CREATE INDEX is used, indicating it is deprecated.On 12 May 2023, at 13:10, Mick Semb Wever  wrote:On Thu, 11 May 2023 at 05:27, Patrick McFadin  wrote:Having pulled a lot of developers out of the 2i fire,Yes.  I'm keen not to leave 2i as the default once SAI lands. Otherwise I agree with the deprecated first principle, but 2i is just too problematic. Just having no default in 5.0, forcing the user to evaluate which index to use would be an improvement.For example, if the default index in cassandra.yaml option exists but is commented out, that would prevent `CREATE INDEX` from working without specifying a `USING`. Then the yaml documentation would be clear about choices.  I'd be ok with that for 5.0, and then make sai the default in the following release.Note, having the option in cassandra.yaml is problematic, as this is not a per-node setting (AFAIK).

Re: [DISCUSS] The future of CREATE INDEX

2023-05-10 Thread Benedict

I’m not convinced by the changing defaults argument here. The characteristics of the two index types are very different, and users with scripts that make indexes today shouldn’t have their behaviour change.We could introduce new syntax that properly appreciates there’s no default index, perhaps CREATE LOCAL [type] INDEX? To also make clear that these indexes involve a partition key or scatter gatherOn 10 May 2023, at 06:26, guo Maxwell  wrote:+1 , as we must Improve the image of your own default indexing ability.and As for CREATE CUSTOM INDEX , should we just left as it is and we can disable the ability for create SAI through  CREATE CUSTOM INDEX  in some version after 5.0? for as I know there may be users using this as a plugin-index interface, like https://github.com/Stratio/cassandra-lucene-index (though these project may be inactive， But if someone wants to do something similar in the future, we don't have to stop).Jonathan Ellis  于2023年5月10日周三 10:01写道：+1 for this, especially in the long term.  CREATE INDEX should do the right thing for most people without requiring extra ceremony.On Tue, May 9, 2023 at 5:20 PM Jeremiah D Jordan  wrote:If the consensus is that SAI is the right default index, then we should just change CREATE INDEX to be SAI, and legacy 2i to be a CUSTOM INDEX.On May 9, 2023, at 4:44 PM, Caleb Rackliffe  wrote:Earlier today, Mick started a thread on the future of our index creation DDL on Slack:https://the-asf.slack.com/archives/C018YGVCHMZ/p1683527794220019At the moment, there are two ways to create a secondary index.1.) CREATE INDEX [IF NOT EXISTS] [name] ON  ()This creates an optionally named legacy 2i on the provided table and column.    ex. CREATE INDEX my_index ON kd.tbl(my_text_col)2.) CREATE CUSTOM INDEX [IF NOT EXISTS] [name] ON  () USING  [WITH OPTIONS = ]This creates a secondary index on the provided table and column using the specified 2i implementation class and (optional) parameters.    ex. CREATE CUSTOM INDEX my_index ON ks.tbl(my_text_col) USING 'StorageAttachedIndex'(Note that the work on SAI added aliasing, so `StorageAttachedIndex` is shorthand for the fully-qualified class name, which is also valid.)So what is there to discuss?The concern Mick raised is..."...just folk continuing to use CREATE INDEX  because they think CREATE CUSTOM INDEX is advanced (or just don't know of it), and we leave users doing 2i (when they think they are, and/or we definitely want them to be, using SAI)"To paraphrase, we want people to use SAI once it's available where possible, and the default behavior of CREATE INDEX could be at odds w/ that.The proposal we seem to have landed on is something like the following:For 5.0:1.) Disable by default the creation of new legacy 2i via CREATE INDEX.2.) Leave CREATE CUSTOM INDEX...USING... available by default.(Note: How this would interact w/ the existing secondary_indexes_enabled YAML options isn't clear yet.)Post-5.0:1.) Deprecate and eventually remove SASI when SAI hits full feature parity w/ it.2.) Replace both CREATE INDEX and CREATE CUSTOM INDEX w/ something of a hybrid between the two. For example, CREATE INDEX...USING...WITH. This would both be flexible enough to accommodate index implementation selection and prescriptive enough to force the user to make a decision (and wouldn't change the legacy behavior of the existing CREATE INDEX). In this world, creating a legacy 2i might look something like CREATE INDEX...USING `legacy`.3.) Eventually deprecate CREATE CUSTOM INDEX...USING.Eventually we would have a single enabled DDL statement for index creation that would be minimal but also explicit/able to handle some evolution.What does everyone think?
-- Jonathan Ellisco-founder, http://www.datastax.com@spyced
-- you are the apple of my eye !

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-09 Thread Benedict

HNSW can in principle be made into a distributed index. But that would be quite a different paradigm to SAI.On 9 May 2023, at 19:30, Patrick McFadin  wrote:Under the goals section, there is this line:Scatter/gather across replicas, combining topK from each to get global topK.But what I'm hearing is, exactly how will that happen? Maybe this is an SAI question too. How is that verified in SAI?On Tue, May 9, 2023 at 11:07 AM David Capwell  wrote:Approach section doesn’t go over how this will handle cross replica search, this would be good to flesh out… given results have a real ranking, the current 2i logic may yield incorrect results… so would think we need num_ranges / rf queries in the best case, with some new capability to sort the results?  If my assumption is correct, then how errors are handled should also be fleshed out… Example: 1k cluster without vnode and RF=3, so 333 queries fanned out to match, then coordinator needs to sort… if 1 of the queries fails and can’t fall back to peers… does the query fail (I assume so)?On May 8, 2023, at 7:20 PM, Jonathan Ellis  wrote:Hi all,Following the recent discussion threads, I would like to propose CEP-30 to add Approximate Nearest Neighbor (ANN) Vector Search via Storage-Attached Indexes (SAI) to Apache Cassandra.The primary goal of this proposal is to implement ANN vector search capabilities, making Cassandra more useful to AI developers and organizations managing large datasets that can benefit from fast similarity search.The implementation will leverage Lucene's Hierarchical Navigable Small World (HNSW) library and introduce a new CQL data type for vector embeddings, a new SAI index for ANN search functionality, and a new CQL operator for performing ANN search queries.We are targeting the 5.0 release for this feature, in conjunction with the release of SAI. The proposed changes will maintain compatibility with existing Cassandra functionality and compose well with the already-approved SAI features.Please find the full CEP document here: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes-- Jonathan Ellisco-founder, http://www.datastax.com@spyced

Re: [POLL] Vector type for ML

2023-05-04 Thread Benedict

I would expect that the type of index would be specified anyway?I don’t think it’s good API design to have the field define the index you create - only to shape what is permitted.A HNSW index is very specific and should be asked for specifically, not implicitly, IMO.On 4 May 2023, at 11:47, Mike Adamson wrote:For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR should be used to simply imply non-null, as this would be very unintuitive. More logical would be NONNULL, if this is the only condition being applied. Alternatively for arrays we could default to NONNULL and later introduce NULLABLE if we want to permit nulls.I have a small issue relating to not having a specific VECTOR tag on the data type. The driver behind adding this datatype is the hnsw index that is being added to consume this data. If we have a generic array datatype, what is the expectation going to be for users who create an index on it? The hnsw index will support only floats initially so we would have to reject any non-float arrays if an attempt was made to create an hnsw index on it. While there is no problem with doing this, there would be a problem if, in the future, we allow indexing in arrays in the same way that we index collections. In this case we would then need to have the user select what type of index they want at creation time.Can I add another proposal that we allow a VECTOR or DENSE (this is a well known term in the ML space) keyword that could be used when the array is going to be used for ML workloads. This would be optional and would function similarly to FROZEN in that it would limit the functionality of the array to ML usage. On Thu, 4 May 2023 at 09:45, Benedict <bened...@apache.org> wrote:Hurrah for initial agreement.For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR should be used to simply imply non-null, as this would be very unintuitive. More logical would be NONNULL, if this is the only condition being applied. Alternatively for arrays we could default to NONNULL and later introduce NULLABLE if we want to permit nulls.If the word vector is to be used it makes more sense to make it look like a list, so VECTOR as here the word VECTOR is clearly not redundant.So, I vote:1) (NON NULL) FLOAT[N]2) FLOAT[N] (Non null by default)3) VECTOROn 4 May 2023, at 08:52, Mick Semb Wever <m...@apache.org> wrote:Did we agree on a CQL syntax?I don’t believe there has been a pool on CQL syntax… my understanding reading all the threads is that there are ~4-5 options and non are -1ed, so believe we are waiting for majority rule on this?Re-reading that thread, IIUC the valid choices remaining are…1. VECTOR FLOAT[n]2. FLOAT VECTOR[n]3. VECTOR4. VECTOR[n]5. ARRAY6. NON-NULL FROZENYes I'm putting my preference (1) first ;) because (banging on) if the future of CQL will have FLOAT[n] and FROZEN, where the VECTOR keyword is: for general cql users; just meaning "non-null and frozen", these gel best together.Options (5) and (6) are for those that feel we can and should provide this type without introducing the vector keyword.

-- Mike AdamsonEngineering+1 650 389 6000 | datastax.comFind DataStax Online:

Re: [POLL] Vector type for ML

2023-05-04 Thread Benedict

Hurrah for initial agreement.

For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is 
redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR 
should be used to simply imply non-null, as this would be very unintuitive. 
More logical would be NONNULL, if this is the only condition being applied. 
Alternatively for arrays we could default to NONNULL and later introduce 
NULLABLE if we want to permit nulls.

If the word vector is to be used it makes more sense to make it look like a 
list, so VECTOR as here the word VECTOR is clearly not redundant.

So, I vote:

1) (NON NULL) FLOAT[N]
2) FLOAT[N]   (Non null by default)
3) VECTOR

> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
> 
> 
>>> Did we agree on a CQL syntax?
>> I don’t believe there has been a pool on CQL syntax… my understanding 
>> reading all the threads is that there are ~4-5 options and non are -1ed, so 
>> believe we are waiting for majority rule on this?
> 
> 
> Re-reading that thread, IIUC the valid choices remaining are…
> 
> 1. VECTOR FLOAT[n]
> 2. FLOAT VECTOR[n]
> 3. VECTOR
> 4. VECTOR[n]
> 5. ARRAY
> 6. NON-NULL FROZEN
> 
> 
> Yes I'm putting my preference (1) first ;) because (banging on) if the future 
> of CQL will have FLOAT[n] and FROZEN, where the VECTOR keyword is: 
> for general cql users; just meaning "non-null and frozen", these gel best 
> together.
> 
> Options (5) and (6) are for those that feel we can and should provide this 
> type without introducing the vector keyword.
> 
>

Re: [POLL] Vector type for ML

But it’s so trivial it was already implemented by David in the span of ten minutes? If anything, we’re slowing progress down by refusing to do the extra types, as we’re busy arguing about it rather than delivering a feature?FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) support types beyond float. Not that we should start with float.So, this whole debate is a mess, I think. But hey ho.On 2 May 2023, at 20:57, Patrick McFadin  wrote:I'll speak up on that one. If you look at my ranked voting, that is where my head is. I get accused of scope creep (a lot) and looking at the initial proposal Jonathan put on the ML it was mostly "Developers are adopting vector search at a furious pace and I think I have a simple way of adding support to keep Cassandra relevant for these use cases" Instead of just focusing on this use case, I feel the arguments have bike shedded into scope creep which means it will take forever to get into the project.My preference is to see one thing validated with an MVP and get it into the hands of developers sooner so we can continue to iterate based on actual usage. It doesn't say your points are wrong or your opinions are broken, I'm voting for what I think will be awesome for users sooner. PatrickOn Tue, May 2, 2023 at 12:29 PM Benedict <bened...@apache.org> wrote:Could folk voting against a general purpose type (that could well be called a vector) briefly explain their reasoning?We established in the other thread that it’s technically trivial, meaning folk must think it is strictly superior to only support float rather than eg all numeric types (note: for the type, not the ANN). I am surprised, and the blurbs accompanying votes so far don’t seem to touch on this, mostly just endorsing the idea of a vector.On 2 May 2023, at 20:20, Patrick McFadin <pmcfa...@gmail.com> wrote:A > B > C on both polls. Having talked to several users in the community that are highly excited about this change, this gets to what developers want to do at Cassandra scale: store embeddings and retrieve them. On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña <adelap...@apache.org> wrote:A > B > CI don't think that ML is such a niche application that it can't have its own CQL data type. Also, vectors are mathematical elements that have more applications that ML.On Tue, 2 May 2023 at 19:15, Mick Semb Wever <m...@apache.org> wrote:On Tue, 2 May 2023 at 17:14, Jonathan Ellis <jbel...@gmail.com> wrote:Should we add a vector type to Cassandra designed to meet the needs of machine learning use cases, specifically feature and embedding vectors for training, inference, and vector search?  ML vectors are fixed-dimension (fixed-length) sequences of numeric types, with no nulls allowed, and with no need for random access. The ML industry overwhelmingly uses float32 vectors, to the point that the industry-leading special-purpose vector database ONLY supports that data type.This poll is to gauge consensus subsequent to the recent discussion thread at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.Please rank the discussed options from most preferred option to least, e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B = A (C is my preference, followed by B or A approximately equally.)(A) I am in favor of adding a vector type for floats; I do not believe we need to tie it to any particular implementation details.(B) I am okay with adding a vector type but I believe we must add array types that compose with all Cassandra types first, and make vectors a special case of arrays-without-null-elements.(C) I am not in favor of adding a built-in vector type.A  > B > CB is stated as "must add array types…".  I think this is a bit loaded.  If B was the (A + the implementation needs to be a non-null frozen float32 array, serialisation forward compatible with other frozen arrays later implemented) I would put this before (A).  Especially because it's been shown already this is easy to implement.

Re: [POLL] Vector type for ML

Could folk voting against a general purpose type (that could well be called a vector) briefly explain their reasoning?We established in the other thread that it’s technically trivial, meaning folk must think it is strictly superior to only support float rather than eg all numeric types (note: for the type, not the ANN). I am surprised, and the blurbs accompanying votes so far don’t seem to touch on this, mostly just endorsing the idea of a vector.On 2 May 2023, at 20:20, Patrick McFadin  wrote:A > B > C on both polls. Having talked to several users in the community that are highly excited about this change, this gets to what developers want to do at Cassandra scale: store embeddings and retrieve them. On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña  wrote:A > B > CI don't think that ML is such a niche application that it can't have its own CQL data type. Also, vectors are mathematical elements that have more applications that ML.On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:Should we add a vector type to Cassandra designed to meet the needs of machine learning use cases, specifically feature and embedding vectors for training, inference, and vector search?  ML vectors are fixed-dimension (fixed-length) sequences of numeric types, with no nulls allowed, and with no need for random access. The ML industry overwhelmingly uses float32 vectors, to the point that the industry-leading special-purpose vector database ONLY supports that data type.This poll is to gauge consensus subsequent to the recent discussion thread at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.Please rank the discussed options from most preferred option to least, e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B = A (C is my preference, followed by B or A approximately equally.)(A) I am in favor of adding a vector type for floats; I do not believe we need to tie it to any particular implementation details.(B) I am okay with adding a vector type but I believe we must add array types that compose with all Cassandra types first, and make vectors a special case of arrays-without-null-elements.(C) I am not in favor of adding a built-in vector type.A  > B > CB is stated as "must add array types…".  I think this is a bit loaded.  If B was the (A + the implementation needs to be a non-null frozen float32 array, serialisation forward compatible with other frozen arrays later implemented) I would put this before (A).  Especially because it's been shown already this is easy to implement.

Re: [POLL] Vector type for ML

This is not the poll I thought we would be conducting, and I don’t really support its framing. There are two parallel questions: what the functionality should be and how they should be exposed. This poll compresses the optionality poorly.Whether or not we support a “vector” concept (or something isomorphic with it), the first question this poll wants to answer is:A) Should we introduce a new CQL collection type that is unique to ML and *only* supports float32B) Should we introduce a type that is general purpose, and supports all Cassandra types, so that this may be used to support ML (and perhaps other) workloadsC) Should we not introduce new types to CQL at allFor this question, I vote B only.Once this question is answered it makes sense to answer how it will be exposed semantically/syntactically. On 2 May 2023, at 16:43, Jonathan Ellis  wrote:My preference: A > B > C.  Vectors are distinct enough from arrays that we should not make adding the latter a prerequisite for adding the former.On Tue, May 2, 2023 at 10:13 AM Jonathan Ellis  wrote:Should we add a vector type to Cassandra designed to meet the needs of machine learning use cases, specifically feature and embedding vectors for training, inference, and vector search?  ML vectors are fixed-dimension (fixed-length) sequences of numeric types, with no nulls allowed, and with no need for random access. The ML industry overwhelmingly uses float32 vectors, to the point that the industry-leading special-purpose vector database ONLY supports that data type.This poll is to gauge consensus subsequent to the recent discussion thread at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.Please rank the discussed options from most preferred option to least, e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B = A (C is my preference, followed by B or A approximately equally.)(A) I am in favor of adding a vector type for floats; I do not believe we need to tie it to any particular implementation details.(B) I am okay with adding a vector type but I believe we must add array types that compose with all Cassandra types first, and make vectors a special case of arrays-without-null-elements.(C) I am not in favor of adding a built-in vector type.-- Jonathan Ellisco-founder, http://www.datastax.com@spyced
-- Jonathan Ellisco-founder, http://www.datastax.com@spyced

Re: [DISCUSS] New data type for vector search