Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Bowen Song via dev
From an operator's view, I think the most reliable indicator is not the 
total count of corruption events, but the frequency of the events. Let 
me try to explain that over some examples:


1. many corruption events in short period of time, then nothing after that
   The disk is probably still healthy.
   The spike in corruption events could be the result of reading some
   bad blocks that hasn't been accessed for a long time
   A warning in the log is preferred.
2. sparse corruption events over many years, the total number is high
   The disk is probably still healthy.
   As long as the frequency does not have an obvious increasing trend,
   it should be fine.
   A warning in the log is preferred.
3. clusters of corruption events started recently and continues to
   happen for days or weeks
   The disk is probably faulty.
   Unless the access pattern from the application side has changed,
   this is a fairly reliable indicator that the disk has failed or is
   about to.
   Initially, a warning in the log is preferred. If this persists for
   too long (configurable number of days?), raise the severity level to
   error, and depending on the disk_failure_policy, may stop or kill
   the node.
4. many corruption events happening continuously
   The disk is probably faulty.
   Other than faulty disk or damaged data (e.g. data getting
   overwritten by a rogue application, like a virus), nothing else
   could explain this situation.
   An error in the log is preferred, and depending on the
   disk_failure_policy, may stop or kill the node.

Internally, inside Cassandra, this could be implemented as a fixed 
number of scaling sized time buckets, arranged in such way that the 
event frequency over different sized time window can be calculated and 
compared to other recent time windows of the same size.
For example: 24x hourly buckets, 30x daily buckets and 24x monthly 
buckets will only need to store 78 integers, but will show the 
difference between the above 4 examples.
Externally, exposing those time buckets via the MBeans should be 
sufficient, maybe an additional cumulative counter can be added too.


Failing that, a cumulative counter exposed via MBeans is fine. As an 
operator, I can always deal with that in other tools, such as Prometheus.


On 09/03/2023 20:57, Abe Ratnofsky wrote:
> there's a point at which a host limping along is better put down and 
replaced


I did a basic literature review and it looks like load (total 
program-erase cycles), disk age, and operating temperature all lead to 
BER increases. We don't need to build a whole model of disk failure, 
we could probably get a lot of mileage out of a warn / failure 
threshold for number of automatic corruption repairs.


Under this model, Cassandra could automatically repair X (3?) 
corruption events before warning a user ("time to replace this host"), 
and Y (10?) corruption events before forcing itself down.


But it would be good to get a better sense of user expectations here. 
Bowen - how would you want Cassandra to handle frequent disk 
corruption events?


--
Abe


On Mar 9, 2023, at 12:44 PM, Josh McKenzie  wrote:

I'm not seeing any reasons why CEP-21 would make this more difficult 
to implement
I think I communicated poorly - I was just trying to point out that 
there's a point at which a host limping along is better put down and 
replaced than piecemeal flagging range after range dead and working 
around it, and there's no immediately obvious "Correct" answer to 
where that point is regardless of what mechanism we're using to hold 
a cluster-wide view of topology.



...CEP-21 makes this sequencing safe...
For sure - I wouldn't advocate for any kind of "automated corrupt 
data repair" in a pre-CEP-21 world.


On Thu, Mar 9, 2023, at 2:56 PM, Abe Ratnofsky wrote:
I'm not seeing any reasons why CEP-21 would make this more difficult 
to implement, besides the fact that it hasn't landed yet.


There are two major potential pitfalls that CEP-21 would help us avoid:
1. Bit-errors beget further bit-errors, so we ought to be resistant 
to a high frequency of corruption events
2. Avoid token ownership changes when attempting to stream a 
corrupted token


I found some data supporting (1) - 
https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf


If we detect bit-errors and store them in system_distributed, then 
we need a capacity to throttle that load and ensure that consistency 
is maintained.


When we attempt to rectify any bit-error by streaming data from 
peers, we implicitly take a lock on token ownership. A user needs to 
know that it is unsafe to change token ownership in a cluster that 
is currently in the process of repairing a corruption error on one 
of its instances' disks. CEP-21 makes this sequencing safe, and 
provides abstractions to better expose this information to operators.


--
Abe

On Mar 9, 2023, at 10:55 AM, Josh McKenzie  
wrote:


Personally, I'd like to see the fix for 

Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Jeremy Hanna
It was mainly to integrate with Hadoop - I used it from 0.6 to 1.2 in 
production prior to starting at DataStax and at that time I was stitching 
together Cloudera's distribution of Hadoop with Cassandra.  Back then there 
were others that used it as well.  As far as I know, usage dropped off when the 
Spark Cassandra Connector got pretty mature.  It enabled people to take an off 
the shelf Hadoop distribution and run the Hadoop processes on the same nodes or 
external to the Cassandra cluster and get topology information to do things 
like Hadoop splits and things like that through the Hadoop interfaces.  I think 
the version lag is an indication that it hasn't been used recently.  Also, like 
others have said, the Spark Cassandra Connector is really what people should be 
using at this point imo.  That or depending on the use case, Apple's bulk 
reader: https://github.com/jberragan/spark-cassandra-bulkreader that is 
mentioned on https://issues.apache.org/jira/browse/CASSANDRA-16222.

> On Mar 9, 2023, at 12:00 PM, Rahul Xavier Singh 
>  wrote:
> 
> What is the hadoop code for? For interacting from Hadoop via CQL, or Thrift 
> if it's that old, or directly looking at SSTables? Been using C* since 2 and 
> have never used it. 
> 
> Agree to deprecate in next possible 4.1.x version and remove in 5.0 
> 
> Rahul Singh
> Chief Executive Officer | Business Platform Architect
> m: 202.905.2818 e: rahul.si...@anant.us  li: 
> http://linkedin.com/in/xingh ca: http://calendly.com/xingh
> 
> We create, support, and manage real-time global data & analytics platforms 
> for the modern enterprise.
> 
> Anant | https://anant.us 
> 3 Washington Circle, Suite 301
> Washington, D.C. 20037
> 
> http://Cassandra.Link  : The best resources for 
> Apache Cassandra
> 
> 
> On Thu, Mar 9, 2023 at 12:53 PM Brandon Williams  > wrote:
>> I think if we reach consensus here that decides it. I too vote to
>> deprecate in 4.1.x.  This means we would remove it in 5.0.
>> 
>> Kind Regards,
>> Brandon
>> 
>> On Thu, Mar 9, 2023 at 11:32 AM Ekaterina Dimitrova
>> mailto:e.dimitr...@gmail.com>> wrote:
>> >
>> > Deprecation sounds good to me, but I am not completely sure in which 
>> > version we can do it. If it is possible to add a deprecation warning in 
>> > the 4.x series or at least 4.1.x - I vote for that.
>> >
>> > On Thu, 9 Mar 2023 at 12:14, Jacek Lewandowski 
>> > mailto:lewandowski.ja...@gmail.com>> wrote:
>> >>
>> >> Is it possible to deprecate it in the 4.1.x patch release? :)
>> >>
>> >>
>> >> - - -- --- -  -
>> >> Jacek Lewandowski
>> >>
>> >>
>> >> czw., 9 mar 2023 o 18:11 Brandon Williams > >> > napisał(a):
>> >>>
>> >>> This is my feeling too, but I think we should accomplish this by
>> >>> deprecating it first.  I don't expect anything will change after the
>> >>> deprecation period.
>> >>>
>> >>> Kind Regards,
>> >>> Brandon
>> >>>
>> >>> On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
>> >>> mailto:lewandowski.ja...@gmail.com>> wrote:
>> >>> >
>> >>> > I vote for removing it entirely.
>> >>> >
>> >>> > thanks
>> >>> > - - -- --- -  -
>> >>> > Jacek Lewandowski
>> >>> >
>> >>> >
>> >>> > czw., 9 mar 2023 o 18:07 Miklosovic, Stefan 
>> >>> > mailto:stefan.mikloso...@netapp.com>> 
>> >>> > napisał(a):
>> >>> >>
>> >>> >> Derek,
>> >>> >>
>> >>> >> I have couple more points ... I do not think that extracting it to a 
>> >>> >> separate repository is "win". That code is on Hadoop 1.0.3. We would 
>> >>> >> be spending a lot of work on extracting it just to extract 10 years 
>> >>> >> old code with occasional updates (in my humble opinion just to make 
>> >>> >> it compilable again if the code around changes). What good is in 
>> >>> >> that? We would have one more place to take care of ... Now we at 
>> >>> >> least have it all in one place.
>> >>> >>
>> >>> >> I believe we have four options:
>> >>> >>
>> >>> >> 1) leave it there so it will be like this is for next years with 
>> >>> >> questionable and diminishing usage
>> >>> >> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
>> >>> >> 3) 2) and extract it to a separate repository but if we do 2) we can 
>> >>> >> just leave it there
>> >>> >> 4) remove it
>> >>> >>
>> >>> >> 
>> >>> >> From: Derek Chen-Becker > >>> >> >
>> >>> >> Sent: Thursday, March 9, 2023 15:55
>> >>> >> To: dev@cassandra.apache.org 
>> >>> >> Subject: Re: Role of Hadoop code in Cassandra 5.0
>> >>> >>
>> >>> >> NetApp Security WARNING: This is an external email. Do not click 
>> >>> >> links or open attachments unless you recognize the sender and know 
>> >>> >> the content is safe.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> I think the question isn't "Who ... is still using that?" but more 
>> >>> >> "are we 

Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Rahul Xavier Singh
What is the hadoop code for? For interacting from Hadoop via CQL, or Thrift
if it's that old, or directly looking at SSTables? Been using C* since 2
and have never used it.

Agree to deprecate in next possible 4.1.x version and remove in 5.0

Rahul Singh

Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
http://calendly.com/xingh

*We create, support, and manage real-time global data & analytics platforms
for the modern enterprise.*

*Anant | https://anant.us *

3 Washington Circle, Suite 301

Washington, D.C. 20037

*http://Cassandra.Link * : The best resources for
Apache Cassandra


On Thu, Mar 9, 2023 at 12:53 PM Brandon Williams  wrote:

> I think if we reach consensus here that decides it. I too vote to
> deprecate in 4.1.x.  This means we would remove it in 5.0.
>
> Kind Regards,
> Brandon
>
> On Thu, Mar 9, 2023 at 11:32 AM Ekaterina Dimitrova
>  wrote:
> >
> > Deprecation sounds good to me, but I am not completely sure in which
> version we can do it. If it is possible to add a deprecation warning in the
> 4.x series or at least 4.1.x - I vote for that.
> >
> > On Thu, 9 Mar 2023 at 12:14, Jacek Lewandowski <
> lewandowski.ja...@gmail.com> wrote:
> >>
> >> Is it possible to deprecate it in the 4.1.x patch release? :)
> >>
> >>
> >> - - -- --- -  -
> >> Jacek Lewandowski
> >>
> >>
> >> czw., 9 mar 2023 o 18:11 Brandon Williams 
> napisał(a):
> >>>
> >>> This is my feeling too, but I think we should accomplish this by
> >>> deprecating it first.  I don't expect anything will change after the
> >>> deprecation period.
> >>>
> >>> Kind Regards,
> >>> Brandon
> >>>
> >>> On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
> >>>  wrote:
> >>> >
> >>> > I vote for removing it entirely.
> >>> >
> >>> > thanks
> >>> > - - -- --- -  -
> >>> > Jacek Lewandowski
> >>> >
> >>> >
> >>> > czw., 9 mar 2023 o 18:07 Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> napisał(a):
> >>> >>
> >>> >> Derek,
> >>> >>
> >>> >> I have couple more points ... I do not think that extracting it to
> a separate repository is "win". That code is on Hadoop 1.0.3. We would be
> spending a lot of work on extracting it just to extract 10 years old code
> with occasional updates (in my humble opinion just to make it compilable
> again if the code around changes). What good is in that? We would have one
> more place to take care of ... Now we at least have it all in one place.
> >>> >>
> >>> >> I believe we have four options:
> >>> >>
> >>> >> 1) leave it there so it will be like this is for next years with
> questionable and diminishing usage
> >>> >> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
> >>> >> 3) 2) and extract it to a separate repository but if we do 2) we
> can just leave it there
> >>> >> 4) remove it
> >>> >>
> >>> >> 
> >>> >> From: Derek Chen-Becker 
> >>> >> Sent: Thursday, March 9, 2023 15:55
> >>> >> To: dev@cassandra.apache.org
> >>> >> Subject: Re: Role of Hadoop code in Cassandra 5.0
> >>> >>
> >>> >> NetApp Security WARNING: This is an external email. Do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.
> >>> >>
> >>> >>
> >>> >>
> >>> >> I think the question isn't "Who ... is still using that?" but more
> "are we actually going to support it?" If we're on a version that old it
> would appear that we've basically abandoned it, although there do appear to
> have been refactoring (for other things) commits in the last couple of
> years. I would be in favor of removal from 5.0, but at the very least,
> could it be moved into a separate repo/package so that it's not pulling a
> relatively large dependency subtree from Hadoop into our main codebase?
> >>> >>
> >>> >> Cheers,
> >>> >>
> >>> >> Derek
> >>> >>
> >>> >> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> wrote:
> >>> >> Hi list,
> >>> >>
> >>> >> I stumbled upon Hadoop package again. I think there was some
> discussion about the relevancy of Hadoop code some time ago but I would
> like to ask this again.
> >>> >>
> >>> >> Do you think Hadoop code (1) is still relevant in 5.0? Who in the
> industry is still using that?
> >>> >>
> >>> >> We might drop a lot of code and some Hadoop dependencies too (3)
> (even their scope is "provided"). The version of Hadoop we build upon is
> 1.0.3 which was released 10 years ago. This code does not have any tests
> nor documentation on the website.
> >>> >>
> >>> >> There seems to be issues like this (2) and it seems like the
> solution is to, basically, use Spark Cassandra connector instead which I
> would say is quite reasonable.
> >>> >>
> >>> >> Regards
> >>> >>
> >>> >> (1)
> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
> >>> >> (2)

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Bowen Song via dev

   /When we attempt to rectify any bit-error by streaming data from
   peers, we implicitly take a lock on token ownership. A user needs to
   know that it is unsafe to change token ownership in a cluster that
   is currently in the process of repairing a corruption error on one
   of its instances' disks./

I'm not sure about this.

Based on my knowledge, streaming does not require a lock on the token 
ownership, if the node subsequently lost the ownership of the token 
range being streamed, it will just end up with some extra SSTable files 
containing useless data, and the files will get deleted when nodetool 
cleanup is run.


BTW, just pointing out the obvious, streaming is neither repairing nor 
bootstrapping. The latter two may require a lock on the token ownership.


On 09/03/2023 19:56, Abe Ratnofsky wrote:
I'm not seeing any reasons why CEP-21 would make this more difficult 
to implement, besides the fact that it hasn't landed yet.


There are two major potential pitfalls that CEP-21 would help us avoid:
1. Bit-errors beget further bit-errors, so we ought to be resistant to 
a high frequency of corruption events
2. Avoid token ownership changes when attempting to stream a corrupted 
token


I found some data supporting (1) - 
https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf


If we detect bit-errors and store them in system_distributed, then we 
need a capacity to throttle that load and ensure that consistency is 
maintained.


When we attempt to rectify any bit-error by streaming data from peers, 
we implicitly take a lock on token ownership. A user needs to know 
that it is unsafe to change token ownership in a cluster that is 
currently in the process of repairing a corruption error on one of its 
instances' disks. CEP-21 makes this sequencing safe, and provides 
abstractions to better expose this information to operators.


--
Abe


On Mar 9, 2023, at 10:55 AM, Josh McKenzie  wrote:

Personally, I'd like to see the fix for this issue come after 
CEP-21. It could be feasible to implement a fix before then, that 
detects bit-errors on the read path and refuses to respond to the 
coordinator, implicitly having speculative execution handle the 
retry against another replica while repair of that range happens. 
But that feels suboptimal to me when a better framework is on the 
horizon.
I originally typed something in agreement with you but the more I 
think about this, the more a node-local "reject queries for specific 
token ranges" degradation profile seems like it _could_ work. I don't 
see an obvious way to remove the need for a human-in-the-loop on 
fixing things in a pre-CEP-21 world without opening pandora's box 
(Gossip + TMD + non-deterministic agreement on ownership state 
cluster-wide /cry).


And even in a post CEP-21 world you're definitely in the "at what 
point is it better to declare a host dead and replace it" fuzzy 
territory where there's no immediately correct answers.


A system_distributed table of corrupt token ranges that are currently 
being rejected by replicas with a mechanism to kick off a repair of 
those ranges could be interesting.


On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
Thanks for proposing this discussion Bowen. I see a few different 
issues here:


1. How do we safely handle corruption of a handful of tokens without 
taking an entire instance offline for re-bootstrap? This includes 
refusal to serve read requests for the corrupted token(s), and 
correct repair of the data.
2. How do we expose the corruption rate to operators, in a way that 
lets them decide whether a full disk replacement is worthwhile?
3. When CEP-21 lands it should become feasible to support ownership 
draining, which would let us migrate read traffic for a given token 
range away from an instance where that range is corrupted. Is it 
worth planning a fix for this issue before CEP-21 lands?


I'm also curious whether there's any existing literature on how 
different filesystems and storage media accommodate bit-errors 
(correctable and uncorrectable), so we can be consistent with those 
behaviors.


Personally, I'd like to see the fix for this issue come after 
CEP-21. It could be feasible to implement a fix before then, that 
detects bit-errors on the read path and refuses to respond to the 
coordinator, implicitly having speculative execution handle the 
retry against another replica while repair of that range happens. 
But that feels suboptimal to me when a better framework is on the 
horizon.


--
Abe

On Mar 9, 2023, at 8:23 AM, Bowen Song via dev 
 wrote:


Hi Jeremiah,

I'm fully aware of that, which is why I said that deleting the 
affected SSTable files is "less safe".


If the "bad blocks" logic is implemented and the node abort the 
current read query when hitting a bad block, it should remain safe, 
as the data in other SSTable files will not be used. The streamed 
data should contain the unexpired tombstones, and that's 

New episode of The Apache Cassandra (R) Corner podcast!

2023-03-09 Thread Aaron Ploetz
Link to the next episode:
https://drive.google.com/file/d/1IePasf681bU-7xRNl4tBzWvVG28y4tQK/view?usp=share_link

s2Ep3 - Loren Sands-Ramshaw
(You may have to download it to play)

FYI - Experimenting with a video podcast on this one.

It will remain in staging for 72 hours, going live (assuming no objections)
by Sunday, March 12th (17:00 UTC).

If anyone should have any questions or comments, or if you want to be a
guest, please reach out to me.

For my guest pipeline, I have recording sessions scheduled with:
- Valeri Karpov (MeanIT Software)

Looking for additional guests, so if you know someone who has a great use
case, let me know!

Thanks, everyone!

Aaron Ploetz


Re: [EXTERNAL] Re: [DISCUSS] Next release date

2023-03-09 Thread Mick Semb Wever
> > > One place we've been weak historically is in distinguishing between 
> > > tickets we consider "nice to have" and things that are "blockers". We 
> > > don't have any metadata that currently distinguishes those two, so 
> > > determining what our burndown leading up to 5.0 looks like is a lot more 
> > > data massaging and hand-waving than I'd prefer right now.
> >
> > We distinguish "blockers" with `Priority=Urgent` or `Severity=Critical`, or 
> > by linking the ticket as blocking to a specific ticket that spells it out. 
> > We do have the metadata, but yes it requires some work…
>
> For everything not urgent or a blocker, does it matter whether something has 
> a fixver of where we think it's going to land or where we'd like to see it 
> land? At the end of the day, neither of those scenarios will actually shift a 
> release date if we're proactively putting "blocker / urgent" status on new 
> features, improvements, and bugs we think are significant enough to delay a 
> release right?


Ooops, actually we were using the -beta, and -rc fixVersion
placeholders to denote the blockers once "the bridge was crossed"
(while Urgent and Critical is used more broadly, e.g. patch releases).
If we use this approach, then we could add a 5.0-alpha placeholder
that indicates a consensus on tickets blocking the branching (if we
agree alpha1 should be cut at the same time we branch…). IMHO such
tickets should also still be marked as Urgent, but I suggest we use
Urgent/Critical as an initial state, and the fixVersion placeholders
where we have consensus or it is according to our release criteria
:shrug:


Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Abe Ratnofsky
> there's a point at which a host limping along is better put down and replaced

I did a basic literature review and it looks like load (total program-erase 
cycles), disk age, and operating temperature all lead to BER increases. We 
don't need to build a whole model of disk failure, we could probably get a lot 
of mileage out of a warn / failure threshold for number of automatic corruption 
repairs.

Under this model, Cassandra could automatically repair X (3?) corruption events 
before warning a user ("time to replace this host"), and Y (10?) corruption 
events before forcing itself down.

But it would be good to get a better sense of user expectations here. Bowen - 
how would you want Cassandra to handle frequent disk corruption events?

--
Abe

> On Mar 9, 2023, at 12:44 PM, Josh McKenzie  wrote:
> 
>> I'm not seeing any reasons why CEP-21 would make this more difficult to 
>> implement
> I think I communicated poorly - I was just trying to point out that there's a 
> point at which a host limping along is better put down and replaced than 
> piecemeal flagging range after range dead and working around it, and there's 
> no immediately obvious "Correct" answer to where that point is regardless of 
> what mechanism we're using to hold a cluster-wide view of topology.
> 
>> ...CEP-21 makes this sequencing safe...
> For sure - I wouldn't advocate for any kind of "automated corrupt data 
> repair" in a pre-CEP-21 world.
> 
> On Thu, Mar 9, 2023, at 2:56 PM, Abe Ratnofsky wrote:
>> I'm not seeing any reasons why CEP-21 would make this more difficult to 
>> implement, besides the fact that it hasn't landed yet.
>> 
>> There are two major potential pitfalls that CEP-21 would help us avoid:
>> 1. Bit-errors beget further bit-errors, so we ought to be resistant to a 
>> high frequency of corruption events
>> 2. Avoid token ownership changes when attempting to stream a corrupted token
>> 
>> I found some data supporting (1) - 
>> https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf
>> 
>> If we detect bit-errors and store them in system_distributed, then we need a 
>> capacity to throttle that load and ensure that consistency is maintained.
>> 
>> When we attempt to rectify any bit-error by streaming data from peers, we 
>> implicitly take a lock on token ownership. A user needs to know that it is 
>> unsafe to change token ownership in a cluster that is currently in the 
>> process of repairing a corruption error on one of its instances' disks. 
>> CEP-21 makes this sequencing safe, and provides abstractions to better 
>> expose this information to operators.
>> 
>> --
>> Abe
>> 
>>> On Mar 9, 2023, at 10:55 AM, Josh McKenzie  wrote:
>>> 
 Personally, I'd like to see the fix for this issue come after CEP-21. It 
 could be feasible to implement a fix before then, that detects bit-errors 
 on the read path and refuses to respond to the coordinator, implicitly 
 having speculative execution handle the retry against another replica 
 while repair of that range happens. But that feels suboptimal to me when a 
 better framework is on the horizon.
>>> I originally typed something in agreement with you but the more I think 
>>> about this, the more a node-local "reject queries for specific token 
>>> ranges" degradation profile seems like it _could_ work. I don't see an 
>>> obvious way to remove the need for a human-in-the-loop on fixing things in 
>>> a pre-CEP-21 world without opening pandora's box (Gossip + TMD + 
>>> non-deterministic agreement on ownership state cluster-wide /cry).
>>> 
>>> And even in a post CEP-21 world you're definitely in the "at what point is 
>>> it better to declare a host dead and replace it" fuzzy territory where 
>>> there's no immediately correct answers.
>>> 
>>> A system_distributed table of corrupt token ranges that are currently being 
>>> rejected by replicas with a mechanism to kick off a repair of those ranges 
>>> could be interesting.
>>> 
>>> On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
 Thanks for proposing this discussion Bowen. I see a few different issues 
 here:
 
 1. How do we safely handle corruption of a handful of tokens without 
 taking an entire instance offline for re-bootstrap? This includes refusal 
 to serve read requests for the corrupted token(s), and correct repair of 
 the data.
 2. How do we expose the corruption rate to operators, in a way that lets 
 them decide whether a full disk replacement is worthwhile?
 3. When CEP-21 lands it should become feasible to support ownership 
 draining, which would let us migrate read traffic for a given token range 
 away from an instance where that range is corrupted. Is it worth planning 
 a fix for this issue before CEP-21 lands?
 
 I'm also curious whether there's any existing literature on how different 
 filesystems and storage media accommodate bit-errors (correctable and 

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Josh McKenzie
> I'm not seeing any reasons why CEP-21 would make this more difficult to 
> implement
I think I communicated poorly - I was just trying to point out that there's a 
point at which a host limping along is better put down and replaced than 
piecemeal flagging range after range dead and working around it, and there's no 
immediately obvious "Correct" answer to where that point is regardless of what 
mechanism we're using to hold a cluster-wide view of topology.

> ...CEP-21 makes this sequencing safe...
For sure - I wouldn't advocate for any kind of "automated corrupt data repair" 
in a pre-CEP-21 world.

On Thu, Mar 9, 2023, at 2:56 PM, Abe Ratnofsky wrote:
> I'm not seeing any reasons why CEP-21 would make this more difficult to 
> implement, besides the fact that it hasn't landed yet.
> 
> There are two major potential pitfalls that CEP-21 would help us avoid:
> 1. Bit-errors beget further bit-errors, so we ought to be resistant to a high 
> frequency of corruption events
> 2. Avoid token ownership changes when attempting to stream a corrupted token
> 
> I found some data supporting (1) - 
> https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf
> 
> If we detect bit-errors and store them in system_distributed, then we need a 
> capacity to throttle that load and ensure that consistency is maintained.
> 
> When we attempt to rectify any bit-error by streaming data from peers, we 
> implicitly take a lock on token ownership. A user needs to know that it is 
> unsafe to change token ownership in a cluster that is currently in the 
> process of repairing a corruption error on one of its instances' disks. 
> CEP-21 makes this sequencing safe, and provides abstractions to better expose 
> this information to operators.
> 
> --
> Abe
> 
>> On Mar 9, 2023, at 10:55 AM, Josh McKenzie  wrote:
>> 
>>> Personally, I'd like to see the fix for this issue come after CEP-21. It 
>>> could be feasible to implement a fix before then, that detects bit-errors 
>>> on the read path and refuses to respond to the coordinator, implicitly 
>>> having speculative execution handle the retry against another replica while 
>>> repair of that range happens. But that feels suboptimal to me when a better 
>>> framework is on the horizon.
>> I originally typed something in agreement with you but the more I think 
>> about this, the more a node-local "reject queries for specific token ranges" 
>> degradation profile seems like it _could_ work. I don't see an obvious way 
>> to remove the need for a human-in-the-loop on fixing things in a pre-CEP-21 
>> world without opening pandora's box (Gossip + TMD + non-deterministic 
>> agreement on ownership state cluster-wide /cry).
>> 
>> And even in a post CEP-21 world you're definitely in the "at what point is 
>> it better to declare a host dead and replace it" fuzzy territory where 
>> there's no immediately correct answers.
>> 
>> A system_distributed table of corrupt token ranges that are currently being 
>> rejected by replicas with a mechanism to kick off a repair of those ranges 
>> could be interesting.
>> 
>> On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
>>> Thanks for proposing this discussion Bowen. I see a few different issues 
>>> here:
>>> 
>>> 1. How do we safely handle corruption of a handful of tokens without taking 
>>> an entire instance offline for re-bootstrap? This includes refusal to serve 
>>> read requests for the corrupted token(s), and correct repair of the data.
>>> 2. How do we expose the corruption rate to operators, in a way that lets 
>>> them decide whether a full disk replacement is worthwhile?
>>> 3. When CEP-21 lands it should become feasible to support ownership 
>>> draining, which would let us migrate read traffic for a given token range 
>>> away from an instance where that range is corrupted. Is it worth planning a 
>>> fix for this issue before CEP-21 lands?
>>> 
>>> I'm also curious whether there's any existing literature on how different 
>>> filesystems and storage media accommodate bit-errors (correctable and 
>>> uncorrectable), so we can be consistent with those behaviors.
>>> 
>>> Personally, I'd like to see the fix for this issue come after CEP-21. It 
>>> could be feasible to implement a fix before then, that detects bit-errors 
>>> on the read path and refuses to respond to the coordinator, implicitly 
>>> having speculative execution handle the retry against another replica while 
>>> repair of that range happens. But that feels suboptimal to me when a better 
>>> framework is on the horizon.
>>> 
>>> --
>>> Abe
>>> 
 On Mar 9, 2023, at 8:23 AM, Bowen Song via dev  
 wrote:
 
 Hi Jeremiah,
 
 I'm fully aware of that, which is why I said that deleting the affected 
 SSTable files is "less safe".
 
 If the "bad blocks" logic is implemented and the node abort the current 
 read query when hitting a bad block, it should remain safe, as the data in 
 other 

Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Derek Chen-Becker
Honestly, I don't think moving it out in its current state is a win,
either. I'm +1 to deprecation in 4.1.x and removal in 5.0. If someone in
the community wants or needs the Hadoop code it should be in a separate
repo/package just like the Spark Connector.

Derek

On Thu, Mar 9, 2023 at 10:07 AM Miklosovic, Stefan <
stefan.mikloso...@netapp.com> wrote:

> Derek,
>
> I have couple more points ... I do not think that extracting it to a
> separate repository is "win". That code is on Hadoop 1.0.3. We would be
> spending a lot of work on extracting it just to extract 10 years old code
> with occasional updates (in my humble opinion just to make it compilable
> again if the code around changes). What good is in that? We would have one
> more place to take care of ... Now we at least have it all in one place.
>
> I believe we have four options:
>
> 1) leave it there so it will be like this is for next years with
> questionable and diminishing usage
> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
> 3) 2) and extract it to a separate repository but if we do 2) we can just
> leave it there
> 4) remove it
>
> 
> From: Derek Chen-Becker 
> Sent: Thursday, March 9, 2023 15:55
> To: dev@cassandra.apache.org
> Subject: Re: Role of Hadoop code in Cassandra 5.0
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
> I think the question isn't "Who ... is still using that?" but more "are we
> actually going to support it?" If we're on a version that old it would
> appear that we've basically abandoned it, although there do appear to have
> been refactoring (for other things) commits in the last couple of years. I
> would be in favor of removal from 5.0, but at the very least, could it be
> moved into a separate repo/package so that it's not pulling a relatively
> large dependency subtree from Hadoop into our main codebase?
>
> Cheers,
>
> Derek
>
> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> wrote:
> Hi list,
>
> I stumbled upon Hadoop package again. I think there was some discussion
> about the relevancy of Hadoop code some time ago but I would like to ask
> this again.
>
> Do you think Hadoop code (1) is still relevant in 5.0? Who in the industry
> is still using that?
>
> We might drop a lot of code and some Hadoop dependencies too (3) (even
> their scope is "provided"). The version of Hadoop we build upon is 1.0.3
> which was released 10 years ago. This code does not have any tests nor
> documentation on the website.
>
> There seems to be issues like this (2) and it seems like the solution is
> to, basically, use Spark Cassandra connector instead which I would say is
> quite reasonable.
>
> Regards
>
> (1)
> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
> (3)
> https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589
>
>
> --
> +---+
> | Derek Chen-Becker |
> | GPG Key available at https://keybase.io/dchenbecker and   |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---+
>
>

-- 
+---+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and   |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---+


Re: [EXTERNAL] Re: [DISCUSS] Next release date

2023-03-09 Thread Josh McKenzie
> We do have the metadata, but yes it requires some work…
My wording was poor; we have the *potential* to have this metadata, but to my 
knowledge we don't have a muscle of consistently setting this, or any kind of 
heuristic to determine when something should block a release or not. At least 
on 4.0 and 4.1, it seemed this was a bridge we crossed informally in the run up 
to a date trying to figure out what to include or discard.

> The project previously made an agreement to one release a year,
I don't recall the details (and searching our... rather active threads is an 
undertaking) - our site has a blog post here: 
https://cassandra.apache.org/_/blog/Apache-Cassandra-Changelog-7-May-2021.html, 
that states: "The community has agreed to one release every year, plus periodic 
trunk snapshots". While it reads like "one a calendar year" to me, at the end 
of the day what's important to me is we do right by our users. So whether we 
interpret that as every 12 months, once per calendar year, once every July with 
a freeze in May train style, all fine by me actually. I more or less stand by 
"just not 'release monthly' and not 'release once every three years'. :) Got 
any clarity there?

> I (and others) wish to do the exercise of running through our 5.x list and 
> pushing out everything we can see with no commitment or activity (and also 
> closing out old and now irrelevant/inapplicable tickets) (and this will be 
> done via a proposed filter). But a question here is the fixVersion can infer 
> where a ticket can be applied (appropriateness) or where we foresee it 
> landing (roadmap). 
I'm +1 to this. If people want something to be different they can just toggle 
it back and bring it to the ML or slack.

For everything not urgent or a blocker, does it matter whether something has a 
fixver of where we think it's going to land or where we'd like to see it land? 
At the end of the day, neither of those scenarios will actually shift a release 
date if we're proactively putting "blocker / urgent" status on new features, 
improvements, and bugs we think are significant enough to delay a release right?

On Thu, Mar 9, 2023, at 3:17 PM, Mick Semb Wever wrote:
>> One place we've been weak historically is in distinguishing between tickets 
>> we consider "nice to have" and things that are "blockers". We don't have any 
>> metadata that currently distinguishes those two, so determining what our 
>> burndown leading up to 5.0 looks like is a lot more data massaging and 
>> hand-waving than I'd prefer right now.
> 
> 
> We distinguish "blockers" with `Priority=Urgent` or `Severity=Critical`, or 
> by linking the ticket as blocking to a specific ticket that spells it out. We 
> do have the metadata, but yes it requires some work…
> 
> The project previously made an agreement to one release a year, akin to a 
> release train model, which helps justify why fixVersion 5.x has just fallen 
> to be "next". (And then there is no "burn-down" in such a model.) 
> 
> Our release criteria, especially post-branch, demonstrates that we do 
> introduce and rely on "blockers". If we agree that certain exceptional CEPs 
> are "blockers", a la warrant delaying the release date, using this approach 
> seems to fit in appropriately.
> 
> When I (just) folded fixVersion 4.2 into 5.0 (and 4.x into 5.x), I also 
> created 5.1.x and 6.x.  I (and others) wish to do the exercise of running 
> through our 5.x list and pushing out everything we can see with no commitment 
> or activity (and also closing out old and now irrelevant/inapplicable 
> tickets) (and this will be done via a proposed filter). But a question here 
> is the fixVersion can infer where a ticket can be applied (appropriateness) 
> or where we foresee it landing (roadmap). For example we mark bugs with the 
> fixVersions ideally they should be applied to, regardless of whether anyone 
> comes to address them or not. 
> 
> 
> 


Re: [EXTERNAL] Re: [DISCUSS] Next release date

2023-03-09 Thread Mick Semb Wever
>
> One place we've been weak historically is in distinguishing between
> tickets we consider "nice to have" and things that are "blockers". We don't
> have any metadata that currently distinguishes those two, so determining
> what our burndown leading up to 5.0 looks like is a lot more data massaging
> and hand-waving than I'd prefer right now.
>


We distinguish "blockers" with `Priority=Urgent` or `Severity=Critical`, or
by linking the ticket as blocking to a specific ticket that spells it out.
We do have the metadata, but yes it requires some work…

The project previously made an agreement to one release a year, akin to a
release train model, which helps justify why fixVersion 5.x has just fallen
to be "next". (And then there is no "burn-down" in such a model.)

Our release criteria, especially post-branch, demonstrates that we do
introduce and rely on "blockers". If we agree that certain exceptional CEPs
are "blockers", a la warrant delaying the release date, using this approach
seems to fit in appropriately.

When I (just) folded fixVersion 4.2 into 5.0 (and 4.x into 5.x), I also
created 5.1.x and 6.x.  I (and others) wish to do the exercise of running
through our 5.x list and pushing out everything we can see with no
commitment or activity (and also closing out old and now
irrelevant/inapplicable tickets) (and this will be done via a proposed
filter). But a question here is the fixVersion can infer where a ticket can
be applied (appropriateness) or where we foresee it landing (roadmap). For
example we mark bugs with the fixVersions ideally they should be applied
to, regardless of whether anyone comes to address them or not.


Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Jacek Lewandowski
Is there a ticket for that?

- - -- --- -  -
Jacek Lewandowski


czw., 9 mar 2023 o 20:27 Mick Semb Wever  napisał(a):

>
>
> On Thu, 9 Mar 2023 at 18:54, Brandon Williams  wrote:
>
>> I think if we reach consensus here that decides it. I too vote to
>> deprecate in 4.1.x.  This means we would remove it in 5.0.
>>
>
>
> +1
>
>
>


Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Abe Ratnofsky
I'm not seeing any reasons why CEP-21 would make this more difficult to 
implement, besides the fact that it hasn't landed yet.

There are two major potential pitfalls that CEP-21 would help us avoid:
1. Bit-errors beget further bit-errors, so we ought to be resistant to a high 
frequency of corruption events
2. Avoid token ownership changes when attempting to stream a corrupted token

I found some data supporting (1) - 
https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf

If we detect bit-errors and store them in system_distributed, then we need a 
capacity to throttle that load and ensure that consistency is maintained.

When we attempt to rectify any bit-error by streaming data from peers, we 
implicitly take a lock on token ownership. A user needs to know that it is 
unsafe to change token ownership in a cluster that is currently in the process 
of repairing a corruption error on one of its instances' disks. CEP-21 makes 
this sequencing safe, and provides abstractions to better expose this 
information to operators.

--
Abe

> On Mar 9, 2023, at 10:55 AM, Josh McKenzie  wrote:
> 
>> Personally, I'd like to see the fix for this issue come after CEP-21. It 
>> could be feasible to implement a fix before then, that detects bit-errors on 
>> the read path and refuses to respond to the coordinator, implicitly having 
>> speculative execution handle the retry against another replica while repair 
>> of that range happens. But that feels suboptimal to me when a better 
>> framework is on the horizon.
> I originally typed something in agreement with you but the more I think about 
> this, the more a node-local "reject queries for specific token ranges" 
> degradation profile seems like it _could_ work. I don't see an obvious way to 
> remove the need for a human-in-the-loop on fixing things in a pre-CEP-21 
> world without opening pandora's box (Gossip + TMD + non-deterministic 
> agreement on ownership state cluster-wide /cry).
> 
> And even in a post CEP-21 world you're definitely in the "at what point is it 
> better to declare a host dead and replace it" fuzzy territory where there's 
> no immediately correct answers.
> 
> A system_distributed table of corrupt token ranges that are currently being 
> rejected by replicas with a mechanism to kick off a repair of those ranges 
> could be interesting.
> 
> On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
>> Thanks for proposing this discussion Bowen. I see a few different issues 
>> here:
>> 
>> 1. How do we safely handle corruption of a handful of tokens without taking 
>> an entire instance offline for re-bootstrap? This includes refusal to serve 
>> read requests for the corrupted token(s), and correct repair of the data.
>> 2. How do we expose the corruption rate to operators, in a way that lets 
>> them decide whether a full disk replacement is worthwhile?
>> 3. When CEP-21 lands it should become feasible to support ownership 
>> draining, which would let us migrate read traffic for a given token range 
>> away from an instance where that range is corrupted. Is it worth planning a 
>> fix for this issue before CEP-21 lands?
>> 
>> I'm also curious whether there's any existing literature on how different 
>> filesystems and storage media accommodate bit-errors (correctable and 
>> uncorrectable), so we can be consistent with those behaviors.
>> 
>> Personally, I'd like to see the fix for this issue come after CEP-21. It 
>> could be feasible to implement a fix before then, that detects bit-errors on 
>> the read path and refuses to respond to the coordinator, implicitly having 
>> speculative execution handle the retry against another replica while repair 
>> of that range happens. But that feels suboptimal to me when a better 
>> framework is on the horizon.
>> 
>> --
>> Abe
>> 
>>> On Mar 9, 2023, at 8:23 AM, Bowen Song via dev  
>>> wrote:
>>> 
>>> Hi Jeremiah,
>>> 
>>> I'm fully aware of that, which is why I said that deleting the affected 
>>> SSTable files is "less safe".
>>> 
>>> If the "bad blocks" logic is implemented and the node abort the current 
>>> read query when hitting a bad block, it should remain safe, as the data in 
>>> other SSTable files will not be used. The streamed data should contain the 
>>> unexpired tombstones, and that's enough to keep the data consistent on the 
>>> node.
>>> 
>>> 
>>> Cheers,
>>> Bowen
>>> 
>>> 
>>> 
>>> On 09/03/2023 15:58, Jeremiah D Jordan wrote:
 It is actually more complicated than just removing the sstable and running 
 repair.
 
 In the face of expired tombstones that might be covering data in other 
 sstables the only safe way to deal with a bad sstable is wipe the token 
 range in the bad sstable and rebuild/bootstrap that range (or wipe/rebuild 
 the whole node which is usually the easier way).  If there are expired 
 tombstones in play, it means they could have already been compacted away 
 on the other 

Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Mick Semb Wever
On Thu, 9 Mar 2023 at 18:54, Brandon Williams  wrote:

> I think if we reach consensus here that decides it. I too vote to
> deprecate in 4.1.x.  This means we would remove it in 5.0.
>


+1


Re: [EXTERNAL] Re: [DISCUSS] Next release date

2023-03-09 Thread Ekaterina Dimitrova
There is also this roadmap page but we haven’t updated it lately. It
contains still 4.1 updates from last year.

https://cwiki.apache.org/confluence/display/CASSANDRA/Roadmap

On Thu, 9 Mar 2023 at 13:51, Josh McKenzie  wrote:

> Added an "Epics" quick filter; could help visualize what our high priority
> features are for given releases:
>
>
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484=2649
>
> Our cumulative flow diagram of 5.0 related tickets is pretty large.
> Probably not a great indicator for the body of what we expect to land in
> the release:
>
>
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484=CASSANDRA=reporting=cumulativeFlowDiagram=1212=1412=1413=2116=2117=2118=2130=2133=2124=2127=2021-12-20=2023-03-09
>
> One place we've been weak historically is in distinguishing between
> tickets we consider "nice to have" and things that are "blockers". We don't
> have any metadata that currently distinguishes those two, so determining
> what our burndown leading up to 5.0 looks like is a lot more data massaging
> and hand-waving than I'd prefer right now.
>
> I've been deep on some other issues for awhile but hope to get more
> involved in this space + ci within the next month or so.
>
> On Thu, Mar 9, 2023, at 9:15 AM, Mick Semb Wever wrote:
>
> I've also found some useful Cassandra's JIRA dashboards for previous
> releases to track progress and scope, but we don't have anything
> similar for the next release. Should we create it?
> Cassandra 4.0GAScope
> Cassandra 4.1 GA scope
>
>
>
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484
>
>
>


Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Josh McKenzie
> Personally, I'd like to see the fix for this issue come after CEP-21. It 
> could be feasible to implement a fix before then, that detects bit-errors on 
> the read path and refuses to respond to the coordinator, implicitly having 
> speculative execution handle the retry against another replica while repair 
> of that range happens. But that feels suboptimal to me when a better 
> framework is on the horizon.
I originally typed something in agreement with you but the more I think about 
this, the more a node-local "reject queries for specific token ranges" 
degradation profile seems like it _could_ work. I don't see an obvious way to 
remove the need for a human-in-the-loop on fixing things in a pre-CEP-21 world 
without opening pandora's box (Gossip + TMD + non-deterministic agreement on 
ownership state cluster-wide /cry).

And even in a post CEP-21 world you're definitely in the "at what point is it 
better to declare a host dead and replace it" fuzzy territory where there's no 
immediately correct answers.

A system_distributed table of corrupt token ranges that are currently being 
rejected by replicas with a mechanism to kick off a repair of those ranges 
could be interesting.

On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
> Thanks for proposing this discussion Bowen. I see a few different issues here:
> 
> 1. How do we safely handle corruption of a handful of tokens without taking 
> an entire instance offline for re-bootstrap? This includes refusal to serve 
> read requests for the corrupted token(s), and correct repair of the data.
> 2. How do we expose the corruption rate to operators, in a way that lets them 
> decide whether a full disk replacement is worthwhile?
> 3. When CEP-21 lands it should become feasible to support ownership draining, 
> which would let us migrate read traffic for a given token range away from an 
> instance where that range is corrupted. Is it worth planning a fix for this 
> issue before CEP-21 lands?
> 
> I'm also curious whether there's any existing literature on how different 
> filesystems and storage media accommodate bit-errors (correctable and 
> uncorrectable), so we can be consistent with those behaviors.
> 
> Personally, I'd like to see the fix for this issue come after CEP-21. It 
> could be feasible to implement a fix before then, that detects bit-errors on 
> the read path and refuses to respond to the coordinator, implicitly having 
> speculative execution handle the retry against another replica while repair 
> of that range happens. But that feels suboptimal to me when a better 
> framework is on the horizon.
> 
> --
> Abe
> 
>> On Mar 9, 2023, at 8:23 AM, Bowen Song via dev  
>> wrote:
>> 
>> Hi Jeremiah,
>> 
>> I'm fully aware of that, which is why I said that deleting the affected 
>> SSTable files is "less safe".
>> 
>> If the "bad blocks" logic is implemented and the node abort the current read 
>> query when hitting a bad block, it should remain safe, as the data in other 
>> SSTable files will not be used. The streamed data should contain the 
>> unexpired tombstones, and that's enough to keep the data consistent on the 
>> node.
>> 
>> 
>> Cheers,
>> Bowen
>> 
>> 
>> 
>> On 09/03/2023 15:58, Jeremiah D Jordan wrote:
>>> It is actually more complicated than just removing the sstable and running 
>>> repair.
>>> 
>>> In the face of expired tombstones that might be covering data in other 
>>> sstables the only safe way to deal with a bad sstable is wipe the token 
>>> range in the bad sstable and rebuild/bootstrap that range (or wipe/rebuild 
>>> the whole node which is usually the easier way).  If there are expired 
>>> tombstones in play, it means they could have already been compacted away on 
>>> the other replicas, but may not have compacted away on the current replica, 
>>> meaning the data they cover could still be present in other sstables on 
>>> this node.  Removing the sstable will mean resurrecting that data.  And 
>>> pulling the range from other nodes does not help because they can have 
>>> already compacted away the tombstone, so you won’t get it back.
>>> 
>>> Tl;DR you can’t just remove the one sstable you have to remove all data in 
>>> the token range covered by the sstable (aka all data that sstable may have 
>>> had a tombstone covering).  Then you can stream from the other nodes to get 
>>> the data back.
>>> 
>>> -Jeremiah
>>> 
 On Mar 8, 2023, at 7:24 AM, Bowen Song via dev  
 wrote:
 
 At the moment, when a read error, such as unrecoverable bit error or data 
 corruption, occurs in the SSTable data files, regardless of the 
 disk_failure_policy configuration, manual (or to be precise, external) 
 intervention is required to recover from the error.
 
 Commonly, there's two approach to recover from such error:
 
  1. The safer, but slower recover strategy: replace the entire node.
  2. The less safe, but faster recover strategy: shut down the node, delete 
 the 

Re: [EXTERNAL] Re: [DISCUSS] Next release date

2023-03-09 Thread Josh McKenzie
Added an "Epics" quick filter; could help visualize what our high priority 
features are for given releases:

https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484=2649

Our cumulative flow diagram of 5.0 related tickets is pretty large. Probably 
not a great indicator for the body of what we expect to land in the release:

https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484=CASSANDRA=reporting=cumulativeFlowDiagram=1212=1412=1413=2116=2117=2118=2130=2133=2124=2127=2021-12-20=2023-03-09

One place we've been weak historically is in distinguishing between tickets we 
consider "nice to have" and things that are "blockers". We don't have any 
metadata that currently distinguishes those two, so determining what our 
burndown leading up to 5.0 looks like is a lot more data massaging and 
hand-waving than I'd prefer right now.

I've been deep on some other issues for awhile but hope to get more involved in 
this space + ci within the next month or so.

On Thu, Mar 9, 2023, at 9:15 AM, Mick Semb Wever wrote:
>> I've also found some useful Cassandra's JIRA dashboards for previous
>> releases to track progress and scope, but we don't have anything
>> similar for the next release. Should we create it?
>> Cassandra 4.0GAScope
>> Cassandra 4.1 GA scope
> 
> 
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484  
> 


Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Francisco Guerrero
+1 (nb) for deprecation in 4.x and removal in 5.0

On 2023/03/09 18:04:27 Jeremy Hanna wrote:
> +1 from me to deprecate in 4.x and remove in 5.0.
> 
> > On Mar 9, 2023, at 12:01 PM, J. D. Jordan  wrote:
> > 
> > +1 from me to deprecate in 4.x and remove in 5.0.
> > 
> > -Jeremiah
> > 
> >> On Mar 9, 2023, at 11:53 AM, Brandon Williams  wrote:
> >> 
> >> I think if we reach consensus here that decides it. I too vote to
> >> deprecate in 4.1.x.  This means we would remove it in 5.0.
> >> 
> >> Kind Regards,
> >> Brandon
> >> 
> >>> On Thu, Mar 9, 2023 at 11:32 AM Ekaterina Dimitrova
> >>>  wrote:
> >>> 
> >>> Deprecation sounds good to me, but I am not completely sure in which 
> >>> version we can do it. If it is possible to add a deprecation warning in 
> >>> the 4.x series or at least 4.1.x - I vote for that.
> >>> 
>  On Thu, 9 Mar 2023 at 12:14, Jacek Lewandowski 
>   wrote:
>  
>  Is it possible to deprecate it in the 4.1.x patch release? :)
>  
>  
>  - - -- --- -  -
>  Jacek Lewandowski
>  
>  
>  czw., 9 mar 2023 o 18:11 Brandon Williams  napisał(a):
> > 
> > This is my feeling too, but I think we should accomplish this by
> > deprecating it first.  I don't expect anything will change after the
> > deprecation period.
> > 
> > Kind Regards,
> > Brandon
> > 
> > On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
> >  wrote:
> >> 
> >> I vote for removing it entirely.
> >> 
> >> thanks
> >> - - -- --- -  -
> >> Jacek Lewandowski
> >> 
> >> 
> >> czw., 9 mar 2023 o 18:07 Miklosovic, Stefan 
> >>  napisał(a):
> >>> 
> >>> Derek,
> >>> 
> >>> I have couple more points ... I do not think that extracting it to a 
> >>> separate repository is "win". That code is on Hadoop 1.0.3. We would 
> >>> be spending a lot of work on extracting it just to extract 10 years 
> >>> old code with occasional updates (in my humble opinion just to make 
> >>> it compilable again if the code around changes). What good is in 
> >>> that? We would have one more place to take care of ... Now we at 
> >>> least have it all in one place.
> >>> 
> >>> I believe we have four options:
> >>> 
> >>> 1) leave it there so it will be like this is for next years with 
> >>> questionable and diminishing usage
> >>> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
> >>> 3) 2) and extract it to a separate repository but if we do 2) we can 
> >>> just leave it there
> >>> 4) remove it
> >>> 
> >>> 
> >>> From: Derek Chen-Becker 
> >>> Sent: Thursday, March 9, 2023 15:55
> >>> To: dev@cassandra.apache.org
> >>> Subject: Re: Role of Hadoop code in Cassandra 5.0
> >>> 
> >>> NetApp Security WARNING: This is an external email. Do not click 
> >>> links or open attachments unless you recognize the sender and know 
> >>> the content is safe.
> >>> 
> >>> 
> >>> 
> >>> I think the question isn't "Who ... is still using that?" but more 
> >>> "are we actually going to support it?" If we're on a version that old 
> >>> it would appear that we've basically abandoned it, although there do 
> >>> appear to have been refactoring (for other things) commits in the 
> >>> last couple of years. I would be in favor of removal from 5.0, but at 
> >>> the very least, could it be moved into a separate repo/package so 
> >>> that it's not pulling a relatively large dependency subtree from 
> >>> Hadoop into our main codebase?
> >>> 
> >>> Cheers,
> >>> 
> >>> Derek
> >>> 
> >>> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan 
> >>> mailto:stefan.mikloso...@netapp.com>> 
> >>> wrote:
> >>> Hi list,
> >>> 
> >>> I stumbled upon Hadoop package again. I think there was some 
> >>> discussion about the relevancy of Hadoop code some time ago but I 
> >>> would like to ask this again.
> >>> 
> >>> Do you think Hadoop code (1) is still relevant in 5.0? Who in the 
> >>> industry is still using that?
> >>> 
> >>> We might drop a lot of code and some Hadoop dependencies too (3) 
> >>> (even their scope is "provided"). The version of Hadoop we build upon 
> >>> is 1.0.3 which was released 10 years ago. This code does not have any 
> >>> tests nor documentation on the website.
> >>> 
> >>> There seems to be issues like this (2) and it seems like the solution 
> >>> is to, basically, use Spark Cassandra connector instead which I would 
> >>> say is quite reasonable.
> >>> 
> >>> Regards
> >>> 
> >>> (1) 
> >>> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
> >>> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
> 

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Abe Ratnofsky
Thanks for proposing this discussion Bowen. I see a few different issues here:

1. How do we safely handle corruption of a handful of tokens without taking an 
entire instance offline for re-bootstrap? This includes refusal to serve read 
requests for the corrupted token(s), and correct repair of the data.
2. How do we expose the corruption rate to operators, in a way that lets them 
decide whether a full disk replacement is worthwhile?
3. When CEP-21 lands it should become feasible to support ownership draining, 
which would let us migrate read traffic for a given token range away from an 
instance where that range is corrupted. Is it worth planning a fix for this 
issue before CEP-21 lands?

I'm also curious whether there's any existing literature on how different 
filesystems and storage media accommodate bit-errors (correctable and 
uncorrectable), so we can be consistent with those behaviors.

Personally, I'd like to see the fix for this issue come after CEP-21. It could 
be feasible to implement a fix before then, that detects bit-errors on the read 
path and refuses to respond to the coordinator, implicitly having speculative 
execution handle the retry against another replica while repair of that range 
happens. But that feels suboptimal to me when a better framework is on the 
horizon.

--
Abe

> On Mar 9, 2023, at 8:23 AM, Bowen Song via dev  
> wrote:
> 
> Hi Jeremiah,
> 
> I'm fully aware of that, which is why I said that deleting the affected 
> SSTable files is "less safe".
> 
> If the "bad blocks" logic is implemented and the node abort the current read 
> query when hitting a bad block, it should remain safe, as the data in other 
> SSTable files will not be used. The streamed data should contain the 
> unexpired tombstones, and that's enough to keep the data consistent on the 
> node.
> 
> Cheers,
> Bowen
> 
> 
> 
> On 09/03/2023 15:58, Jeremiah D Jordan wrote:
>> It is actually more complicated than just removing the sstable and running 
>> repair.
>> 
>> In the face of expired tombstones that might be covering data in other 
>> sstables the only safe way to deal with a bad sstable is wipe the token 
>> range in the bad sstable and rebuild/bootstrap that range (or wipe/rebuild 
>> the whole node which is usually the easier way).  If there are expired 
>> tombstones in play, it means they could have already been compacted away on 
>> the other replicas, but may not have compacted away on the current replica, 
>> meaning the data they cover could still be present in other sstables on this 
>> node.  Removing the sstable will mean resurrecting that data.  And pulling 
>> the range from other nodes does not help because they can have already 
>> compacted away the tombstone, so you won’t get it back.
>> 
>> Tl;DR you can’t just remove the one sstable you have to remove all data in 
>> the token range covered by the sstable (aka all data that sstable may have 
>> had a tombstone covering).  Then you can stream from the other nodes to get 
>> the data back.
>> 
>> -Jeremiah
>> 
>>> On Mar 8, 2023, at 7:24 AM, Bowen Song via dev  
>>>  wrote:
>>> 
>>> At the moment, when a read error, such as unrecoverable bit error or data 
>>> corruption, occurs in the SSTable data files, regardless of the 
>>> disk_failure_policy configuration, manual (or to be precise, external) 
>>> intervention is required to recover from the error.
>>> 
>>> Commonly, there's two approach to recover from such error:
>>> 
>>> The safer, but slower recover strategy: replace the entire node.
>>> The less safe, but faster recover strategy: shut down the node, delete the 
>>> affected SSTable file(s), and then bring the node back online and run 
>>> repair.
>>> Based on my understanding of Cassandra, it should be possible to recover 
>>> from such error by marking the affected token range in the existing SSTable 
>>> as "corrupted" and stop reading from them (e.g. creating a "bad block" file 
>>> or in memory), and then streaming the affected token range from the healthy 
>>> replicas. The corrupted SSTable file can then be removed upon the next 
>>> successful compaction involving it, or alternatively an anti-compaction is 
>>> performed on it to remove the corrupted data.
>>> 
>>> The advantage of this strategy is:
>>> 
>>> Reduced node down time - node restart or replacement is not needed
>>> Less data streaming is required - only the affected token range
>>> Faster recovery time - less streaming and delayed compaction or 
>>> anti-compaction
>>> No less safe than replacing the entire node
>>> This process can be automated internally, removing the need for operator 
>>> inputs
>>> The disadvantage is added complexity on the SSTable read path and it may 
>>> mask disk failures from the operator who is not paying attention to it.
>>> 
>>> What do you think about this?
>>> 
>> 



Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Jeremy Hanna
+1 from me to deprecate in 4.x and remove in 5.0.

> On Mar 9, 2023, at 12:01 PM, J. D. Jordan  wrote:
> 
> +1 from me to deprecate in 4.x and remove in 5.0.
> 
> -Jeremiah
> 
>> On Mar 9, 2023, at 11:53 AM, Brandon Williams  wrote:
>> 
>> I think if we reach consensus here that decides it. I too vote to
>> deprecate in 4.1.x.  This means we would remove it in 5.0.
>> 
>> Kind Regards,
>> Brandon
>> 
>>> On Thu, Mar 9, 2023 at 11:32 AM Ekaterina Dimitrova
>>>  wrote:
>>> 
>>> Deprecation sounds good to me, but I am not completely sure in which 
>>> version we can do it. If it is possible to add a deprecation warning in the 
>>> 4.x series or at least 4.1.x - I vote for that.
>>> 
 On Thu, 9 Mar 2023 at 12:14, Jacek Lewandowski 
  wrote:
 
 Is it possible to deprecate it in the 4.1.x patch release? :)
 
 
 - - -- --- -  -
 Jacek Lewandowski
 
 
 czw., 9 mar 2023 o 18:11 Brandon Williams  napisał(a):
> 
> This is my feeling too, but I think we should accomplish this by
> deprecating it first.  I don't expect anything will change after the
> deprecation period.
> 
> Kind Regards,
> Brandon
> 
> On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
>  wrote:
>> 
>> I vote for removing it entirely.
>> 
>> thanks
>> - - -- --- -  -
>> Jacek Lewandowski
>> 
>> 
>> czw., 9 mar 2023 o 18:07 Miklosovic, Stefan 
>>  napisał(a):
>>> 
>>> Derek,
>>> 
>>> I have couple more points ... I do not think that extracting it to a 
>>> separate repository is "win". That code is on Hadoop 1.0.3. We would be 
>>> spending a lot of work on extracting it just to extract 10 years old 
>>> code with occasional updates (in my humble opinion just to make it 
>>> compilable again if the code around changes). What good is in that? We 
>>> would have one more place to take care of ... Now we at least have it 
>>> all in one place.
>>> 
>>> I believe we have four options:
>>> 
>>> 1) leave it there so it will be like this is for next years with 
>>> questionable and diminishing usage
>>> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
>>> 3) 2) and extract it to a separate repository but if we do 2) we can 
>>> just leave it there
>>> 4) remove it
>>> 
>>> 
>>> From: Derek Chen-Becker 
>>> Sent: Thursday, March 9, 2023 15:55
>>> To: dev@cassandra.apache.org
>>> Subject: Re: Role of Hadoop code in Cassandra 5.0
>>> 
>>> NetApp Security WARNING: This is an external email. Do not click links 
>>> or open attachments unless you recognize the sender and know the 
>>> content is safe.
>>> 
>>> 
>>> 
>>> I think the question isn't "Who ... is still using that?" but more "are 
>>> we actually going to support it?" If we're on a version that old it 
>>> would appear that we've basically abandoned it, although there do 
>>> appear to have been refactoring (for other things) commits in the last 
>>> couple of years. I would be in favor of removal from 5.0, but at the 
>>> very least, could it be moved into a separate repo/package so that it's 
>>> not pulling a relatively large dependency subtree from Hadoop into our 
>>> main codebase?
>>> 
>>> Cheers,
>>> 
>>> Derek
>>> 
>>> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan 
>>> mailto:stefan.mikloso...@netapp.com>> 
>>> wrote:
>>> Hi list,
>>> 
>>> I stumbled upon Hadoop package again. I think there was some discussion 
>>> about the relevancy of Hadoop code some time ago but I would like to 
>>> ask this again.
>>> 
>>> Do you think Hadoop code (1) is still relevant in 5.0? Who in the 
>>> industry is still using that?
>>> 
>>> We might drop a lot of code and some Hadoop dependencies too (3) (even 
>>> their scope is "provided"). The version of Hadoop we build upon is 
>>> 1.0.3 which was released 10 years ago. This code does not have any 
>>> tests nor documentation on the website.
>>> 
>>> There seems to be issues like this (2) and it seems like the solution 
>>> is to, basically, use Spark Cassandra connector instead which I would 
>>> say is quite reasonable.
>>> 
>>> Regards
>>> 
>>> (1) 
>>> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
>>> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
>>> (3) 
>>> https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589
>>> 
>>> 
>>> --
>>> +---+
>>> | Derek Chen-Becker |
>>> | GPG Key available at 
>>> 

Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread J. D. Jordan
+1 from me to deprecate in 4.x and remove in 5.0.

-Jeremiah

> On Mar 9, 2023, at 11:53 AM, Brandon Williams  wrote:
> 
> I think if we reach consensus here that decides it. I too vote to
> deprecate in 4.1.x.  This means we would remove it in 5.0.
> 
> Kind Regards,
> Brandon
> 
>> On Thu, Mar 9, 2023 at 11:32 AM Ekaterina Dimitrova
>>  wrote:
>> 
>> Deprecation sounds good to me, but I am not completely sure in which version 
>> we can do it. If it is possible to add a deprecation warning in the 4.x 
>> series or at least 4.1.x - I vote for that.
>> 
>>> On Thu, 9 Mar 2023 at 12:14, Jacek Lewandowski 
>>>  wrote:
>>> 
>>> Is it possible to deprecate it in the 4.1.x patch release? :)
>>> 
>>> 
>>> - - -- --- -  -
>>> Jacek Lewandowski
>>> 
>>> 
>>> czw., 9 mar 2023 o 18:11 Brandon Williams  napisał(a):
 
 This is my feeling too, but I think we should accomplish this by
 deprecating it first.  I don't expect anything will change after the
 deprecation period.
 
 Kind Regards,
 Brandon
 
 On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
  wrote:
> 
> I vote for removing it entirely.
> 
> thanks
> - - -- --- -  -
> Jacek Lewandowski
> 
> 
> czw., 9 mar 2023 o 18:07 Miklosovic, Stefan 
>  napisał(a):
>> 
>> Derek,
>> 
>> I have couple more points ... I do not think that extracting it to a 
>> separate repository is "win". That code is on Hadoop 1.0.3. We would be 
>> spending a lot of work on extracting it just to extract 10 years old 
>> code with occasional updates (in my humble opinion just to make it 
>> compilable again if the code around changes). What good is in that? We 
>> would have one more place to take care of ... Now we at least have it 
>> all in one place.
>> 
>> I believe we have four options:
>> 
>> 1) leave it there so it will be like this is for next years with 
>> questionable and diminishing usage
>> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
>> 3) 2) and extract it to a separate repository but if we do 2) we can 
>> just leave it there
>> 4) remove it
>> 
>> 
>> From: Derek Chen-Becker 
>> Sent: Thursday, March 9, 2023 15:55
>> To: dev@cassandra.apache.org
>> Subject: Re: Role of Hadoop code in Cassandra 5.0
>> 
>> NetApp Security WARNING: This is an external email. Do not click links 
>> or open attachments unless you recognize the sender and know the content 
>> is safe.
>> 
>> 
>> 
>> I think the question isn't "Who ... is still using that?" but more "are 
>> we actually going to support it?" If we're on a version that old it 
>> would appear that we've basically abandoned it, although there do appear 
>> to have been refactoring (for other things) commits in the last couple 
>> of years. I would be in favor of removal from 5.0, but at the very 
>> least, could it be moved into a separate repo/package so that it's not 
>> pulling a relatively large dependency subtree from Hadoop into our main 
>> codebase?
>> 
>> Cheers,
>> 
>> Derek
>> 
>> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan 
>> mailto:stefan.mikloso...@netapp.com>> 
>> wrote:
>> Hi list,
>> 
>> I stumbled upon Hadoop package again. I think there was some discussion 
>> about the relevancy of Hadoop code some time ago but I would like to ask 
>> this again.
>> 
>> Do you think Hadoop code (1) is still relevant in 5.0? Who in the 
>> industry is still using that?
>> 
>> We might drop a lot of code and some Hadoop dependencies too (3) (even 
>> their scope is "provided"). The version of Hadoop we build upon is 1.0.3 
>> which was released 10 years ago. This code does not have any tests nor 
>> documentation on the website.
>> 
>> There seems to be issues like this (2) and it seems like the solution is 
>> to, basically, use Spark Cassandra connector instead which I would say 
>> is quite reasonable.
>> 
>> Regards
>> 
>> (1) 
>> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
>> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
>> (3) 
>> https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589
>> 
>> 
>> --
>> +---+
>> | Derek Chen-Becker |
>> | GPG Key available at 
>> https://urldefense.com/v3/__https://keybase.io/dchenbecker__;!!PbtH5S7Ebw!YbHPCIGqxJHtAbvxPSXFEvnZgLrmvIE2AQ3Aw3BAgvCksv9ALniyHYVvU42wxrAGSNybhgjhwoAeyss$
>>   and   |
>> | 
>> 

Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Brandon Williams
I think if we reach consensus here that decides it. I too vote to
deprecate in 4.1.x.  This means we would remove it in 5.0.

Kind Regards,
Brandon

On Thu, Mar 9, 2023 at 11:32 AM Ekaterina Dimitrova
 wrote:
>
> Deprecation sounds good to me, but I am not completely sure in which version 
> we can do it. If it is possible to add a deprecation warning in the 4.x 
> series or at least 4.1.x - I vote for that.
>
> On Thu, 9 Mar 2023 at 12:14, Jacek Lewandowski  
> wrote:
>>
>> Is it possible to deprecate it in the 4.1.x patch release? :)
>>
>>
>> - - -- --- -  -
>> Jacek Lewandowski
>>
>>
>> czw., 9 mar 2023 o 18:11 Brandon Williams  napisał(a):
>>>
>>> This is my feeling too, but I think we should accomplish this by
>>> deprecating it first.  I don't expect anything will change after the
>>> deprecation period.
>>>
>>> Kind Regards,
>>> Brandon
>>>
>>> On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
>>>  wrote:
>>> >
>>> > I vote for removing it entirely.
>>> >
>>> > thanks
>>> > - - -- --- -  -
>>> > Jacek Lewandowski
>>> >
>>> >
>>> > czw., 9 mar 2023 o 18:07 Miklosovic, Stefan 
>>> >  napisał(a):
>>> >>
>>> >> Derek,
>>> >>
>>> >> I have couple more points ... I do not think that extracting it to a 
>>> >> separate repository is "win". That code is on Hadoop 1.0.3. We would be 
>>> >> spending a lot of work on extracting it just to extract 10 years old 
>>> >> code with occasional updates (in my humble opinion just to make it 
>>> >> compilable again if the code around changes). What good is in that? We 
>>> >> would have one more place to take care of ... Now we at least have it 
>>> >> all in one place.
>>> >>
>>> >> I believe we have four options:
>>> >>
>>> >> 1) leave it there so it will be like this is for next years with 
>>> >> questionable and diminishing usage
>>> >> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
>>> >> 3) 2) and extract it to a separate repository but if we do 2) we can 
>>> >> just leave it there
>>> >> 4) remove it
>>> >>
>>> >> 
>>> >> From: Derek Chen-Becker 
>>> >> Sent: Thursday, March 9, 2023 15:55
>>> >> To: dev@cassandra.apache.org
>>> >> Subject: Re: Role of Hadoop code in Cassandra 5.0
>>> >>
>>> >> NetApp Security WARNING: This is an external email. Do not click links 
>>> >> or open attachments unless you recognize the sender and know the content 
>>> >> is safe.
>>> >>
>>> >>
>>> >>
>>> >> I think the question isn't "Who ... is still using that?" but more "are 
>>> >> we actually going to support it?" If we're on a version that old it 
>>> >> would appear that we've basically abandoned it, although there do appear 
>>> >> to have been refactoring (for other things) commits in the last couple 
>>> >> of years. I would be in favor of removal from 5.0, but at the very 
>>> >> least, could it be moved into a separate repo/package so that it's not 
>>> >> pulling a relatively large dependency subtree from Hadoop into our main 
>>> >> codebase?
>>> >>
>>> >> Cheers,
>>> >>
>>> >> Derek
>>> >>
>>> >> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan 
>>> >> mailto:stefan.mikloso...@netapp.com>> 
>>> >> wrote:
>>> >> Hi list,
>>> >>
>>> >> I stumbled upon Hadoop package again. I think there was some discussion 
>>> >> about the relevancy of Hadoop code some time ago but I would like to ask 
>>> >> this again.
>>> >>
>>> >> Do you think Hadoop code (1) is still relevant in 5.0? Who in the 
>>> >> industry is still using that?
>>> >>
>>> >> We might drop a lot of code and some Hadoop dependencies too (3) (even 
>>> >> their scope is "provided"). The version of Hadoop we build upon is 1.0.3 
>>> >> which was released 10 years ago. This code does not have any tests nor 
>>> >> documentation on the website.
>>> >>
>>> >> There seems to be issues like this (2) and it seems like the solution is 
>>> >> to, basically, use Spark Cassandra connector instead which I would say 
>>> >> is quite reasonable.
>>> >>
>>> >> Regards
>>> >>
>>> >> (1) 
>>> >> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
>>> >> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
>>> >> (3) 
>>> >> https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589
>>> >>
>>> >>
>>> >> --
>>> >> +---+
>>> >> | Derek Chen-Becker |
>>> >> | GPG Key available at https://keybase.io/dchenbecker and   |
>>> >> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>>> >> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>> >> +---+
>>> >>


Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Miklosovic, Stefan
Deprecation would mean that the code has to be there whole 5.0 so we can remove 
it for real in 6.0?


From: Ekaterina Dimitrova 
Sent: Thursday, March 9, 2023 18:32
To: dev@cassandra.apache.org
Subject: Re: Role of Hadoop code in Cassandra 5.0

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



Deprecation sounds good to me, but I am not completely sure in which version we 
can do it. If it is possible to add a deprecation warning in the 4.x series or 
at least 4.1.x - I vote for that.

On Thu, 9 Mar 2023 at 12:14, Jacek Lewandowski 
mailto:lewandowski.ja...@gmail.com>> wrote:
Is it possible to deprecate it in the 4.1.x patch release? :)


- - -- --- -  -
Jacek Lewandowski


czw., 9 mar 2023 o 18:11 Brandon Williams 
mailto:dri...@gmail.com>> napisał(a):
This is my feeling too, but I think we should accomplish this by
deprecating it first.  I don't expect anything will change after the
deprecation period.

Kind Regards,
Brandon

On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
mailto:lewandowski.ja...@gmail.com>> wrote:
>
> I vote for removing it entirely.
>
> thanks
> - - -- --- -  -
> Jacek Lewandowski
>
>
> czw., 9 mar 2023 o 18:07 Miklosovic, Stefan 
> mailto:stefan.mikloso...@netapp.com>> 
> napisał(a):
>>
>> Derek,
>>
>> I have couple more points ... I do not think that extracting it to a 
>> separate repository is "win". That code is on Hadoop 1.0.3. We would be 
>> spending a lot of work on extracting it just to extract 10 years old code 
>> with occasional updates (in my humble opinion just to make it compilable 
>> again if the code around changes). What good is in that? We would have one 
>> more place to take care of ... Now we at least have it all in one place.
>>
>> I believe we have four options:
>>
>> 1) leave it there so it will be like this is for next years with 
>> questionable and diminishing usage
>> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
>> 3) 2) and extract it to a separate repository but if we do 2) we can just 
>> leave it there
>> 4) remove it
>>
>> 
>> From: Derek Chen-Becker mailto:de...@chen-becker.org>>
>> Sent: Thursday, March 9, 2023 15:55
>> To: dev@cassandra.apache.org
>> Subject: Re: Role of Hadoop code in Cassandra 5.0
>>
>> NetApp Security WARNING: This is an external email. Do not click links or 
>> open attachments unless you recognize the sender and know the content is 
>> safe.
>>
>>
>>
>> I think the question isn't "Who ... is still using that?" but more "are we 
>> actually going to support it?" If we're on a version that old it would 
>> appear that we've basically abandoned it, although there do appear to have 
>> been refactoring (for other things) commits in the last couple of years. I 
>> would be in favor of removal from 5.0, but at the very least, could it be 
>> moved into a separate repo/package so that it's not pulling a relatively 
>> large dependency subtree from Hadoop into our main codebase?
>>
>> Cheers,
>>
>> Derek
>>
>> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan 
>> mailto:stefan.mikloso...@netapp.com>>>
>>  wrote:
>> Hi list,
>>
>> I stumbled upon Hadoop package again. I think there was some discussion 
>> about the relevancy of Hadoop code some time ago but I would like to ask 
>> this again.
>>
>> Do you think Hadoop code (1) is still relevant in 5.0? Who in the industry 
>> is still using that?
>>
>> We might drop a lot of code and some Hadoop dependencies too (3) (even their 
>> scope is "provided"). The version of Hadoop we build upon is 1.0.3 which was 
>> released 10 years ago. This code does not have any tests nor documentation 
>> on the website.
>>
>> There seems to be issues like this (2) and it seems like the solution is to, 
>> basically, use Spark Cassandra connector instead which I would say is quite 
>> reasonable.
>>
>> Regards
>>
>> (1) 
>> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
>> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
>> (3) 
>> https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589
>>
>>
>> --
>> +---+
>> | Derek Chen-Becker |
>> | GPG Key available at https://keybase.io/dchenbecker and   |
>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>> +---+
>>


Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Ekaterina Dimitrova
Deprecation sounds good to me, but I am not completely sure in which
version we can do it. If it is possible to add a deprecation warning in the
4.x series or at least 4.1.x - I vote for that.

On Thu, 9 Mar 2023 at 12:14, Jacek Lewandowski 
wrote:

> Is it possible to deprecate it in the 4.1.x patch release? :)
>
>
> - - -- --- -  -
> Jacek Lewandowski
>
>
> czw., 9 mar 2023 o 18:11 Brandon Williams  napisał(a):
>
>> This is my feeling too, but I think we should accomplish this by
>> deprecating it first.  I don't expect anything will change after the
>> deprecation period.
>>
>> Kind Regards,
>> Brandon
>>
>> On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
>>  wrote:
>> >
>> > I vote for removing it entirely.
>> >
>> > thanks
>> > - - -- --- -  -
>> > Jacek Lewandowski
>> >
>> >
>> > czw., 9 mar 2023 o 18:07 Miklosovic, Stefan <
>> stefan.mikloso...@netapp.com> napisał(a):
>> >>
>> >> Derek,
>> >>
>> >> I have couple more points ... I do not think that extracting it to a
>> separate repository is "win". That code is on Hadoop 1.0.3. We would be
>> spending a lot of work on extracting it just to extract 10 years old code
>> with occasional updates (in my humble opinion just to make it compilable
>> again if the code around changes). What good is in that? We would have one
>> more place to take care of ... Now we at least have it all in one place.
>> >>
>> >> I believe we have four options:
>> >>
>> >> 1) leave it there so it will be like this is for next years with
>> questionable and diminishing usage
>> >> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
>> >> 3) 2) and extract it to a separate repository but if we do 2) we can
>> just leave it there
>> >> 4) remove it
>> >>
>> >> 
>> >> From: Derek Chen-Becker 
>> >> Sent: Thursday, March 9, 2023 15:55
>> >> To: dev@cassandra.apache.org
>> >> Subject: Re: Role of Hadoop code in Cassandra 5.0
>> >>
>> >> NetApp Security WARNING: This is an external email. Do not click links
>> or open attachments unless you recognize the sender and know the content is
>> safe.
>> >>
>> >>
>> >>
>> >> I think the question isn't "Who ... is still using that?" but more
>> "are we actually going to support it?" If we're on a version that old it
>> would appear that we've basically abandoned it, although there do appear to
>> have been refactoring (for other things) commits in the last couple of
>> years. I would be in favor of removal from 5.0, but at the very least,
>> could it be moved into a separate repo/package so that it's not pulling a
>> relatively large dependency subtree from Hadoop into our main codebase?
>> >>
>> >> Cheers,
>> >>
>> >> Derek
>> >>
>> >> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan <
>> stefan.mikloso...@netapp.com> wrote:
>> >> Hi list,
>> >>
>> >> I stumbled upon Hadoop package again. I think there was some
>> discussion about the relevancy of Hadoop code some time ago but I would
>> like to ask this again.
>> >>
>> >> Do you think Hadoop code (1) is still relevant in 5.0? Who in the
>> industry is still using that?
>> >>
>> >> We might drop a lot of code and some Hadoop dependencies too (3) (even
>> their scope is "provided"). The version of Hadoop we build upon is 1.0.3
>> which was released 10 years ago. This code does not have any tests nor
>> documentation on the website.
>> >>
>> >> There seems to be issues like this (2) and it seems like the solution
>> is to, basically, use Spark Cassandra connector instead which I would say
>> is quite reasonable.
>> >>
>> >> Regards
>> >>
>> >> (1)
>> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
>> >> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
>> >> (3)
>> https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589
>> >>
>> >>
>> >> --
>> >> +---+
>> >> | Derek Chen-Becker |
>> >> | GPG Key available at https://keybase.io/dchenbecker and   |
>> >> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>> >> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>> >> +---+
>> >>
>>
>


Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Jacek Lewandowski
Is it possible to deprecate it in the 4.1.x patch release? :)


- - -- --- -  -
Jacek Lewandowski


czw., 9 mar 2023 o 18:11 Brandon Williams  napisał(a):

> This is my feeling too, but I think we should accomplish this by
> deprecating it first.  I don't expect anything will change after the
> deprecation period.
>
> Kind Regards,
> Brandon
>
> On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
>  wrote:
> >
> > I vote for removing it entirely.
> >
> > thanks
> > - - -- --- -  -
> > Jacek Lewandowski
> >
> >
> > czw., 9 mar 2023 o 18:07 Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> napisał(a):
> >>
> >> Derek,
> >>
> >> I have couple more points ... I do not think that extracting it to a
> separate repository is "win". That code is on Hadoop 1.0.3. We would be
> spending a lot of work on extracting it just to extract 10 years old code
> with occasional updates (in my humble opinion just to make it compilable
> again if the code around changes). What good is in that? We would have one
> more place to take care of ... Now we at least have it all in one place.
> >>
> >> I believe we have four options:
> >>
> >> 1) leave it there so it will be like this is for next years with
> questionable and diminishing usage
> >> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
> >> 3) 2) and extract it to a separate repository but if we do 2) we can
> just leave it there
> >> 4) remove it
> >>
> >> 
> >> From: Derek Chen-Becker 
> >> Sent: Thursday, March 9, 2023 15:55
> >> To: dev@cassandra.apache.org
> >> Subject: Re: Role of Hadoop code in Cassandra 5.0
> >>
> >> NetApp Security WARNING: This is an external email. Do not click links
> or open attachments unless you recognize the sender and know the content is
> safe.
> >>
> >>
> >>
> >> I think the question isn't "Who ... is still using that?" but more "are
> we actually going to support it?" If we're on a version that old it would
> appear that we've basically abandoned it, although there do appear to have
> been refactoring (for other things) commits in the last couple of years. I
> would be in favor of removal from 5.0, but at the very least, could it be
> moved into a separate repo/package so that it's not pulling a relatively
> large dependency subtree from Hadoop into our main codebase?
> >>
> >> Cheers,
> >>
> >> Derek
> >>
> >> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> wrote:
> >> Hi list,
> >>
> >> I stumbled upon Hadoop package again. I think there was some discussion
> about the relevancy of Hadoop code some time ago but I would like to ask
> this again.
> >>
> >> Do you think Hadoop code (1) is still relevant in 5.0? Who in the
> industry is still using that?
> >>
> >> We might drop a lot of code and some Hadoop dependencies too (3) (even
> their scope is "provided"). The version of Hadoop we build upon is 1.0.3
> which was released 10 years ago. This code does not have any tests nor
> documentation on the website.
> >>
> >> There seems to be issues like this (2) and it seems like the solution
> is to, basically, use Spark Cassandra connector instead which I would say
> is quite reasonable.
> >>
> >> Regards
> >>
> >> (1)
> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
> >> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
> >> (3)
> https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589
> >>
> >>
> >> --
> >> +---+
> >> | Derek Chen-Becker |
> >> | GPG Key available at https://keybase.io/dchenbecker and   |
> >> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> >> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> >> +---+
> >>
>


Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Jacek Lewandowski
... because - why Hadoop? This is something to be made as a separate
project if there is a need for that. Just like the Spark Cassandra
Connector. Why do we need to include Hadoop specific classes and no
specific stuff for other frameworks?

- - -- --- -  -
Jacek Lewandowski


czw., 9 mar 2023 o 18:08 Jacek Lewandowski 
napisał(a):

> I vote for removing it entirely.
>
> thanks
> - - -- --- -  -
> Jacek Lewandowski
>
>
> czw., 9 mar 2023 o 18:07 Miklosovic, Stefan 
> napisał(a):
>
>> Derek,
>>
>> I have couple more points ... I do not think that extracting it to a
>> separate repository is "win". That code is on Hadoop 1.0.3. We would be
>> spending a lot of work on extracting it just to extract 10 years old code
>> with occasional updates (in my humble opinion just to make it compilable
>> again if the code around changes). What good is in that? We would have one
>> more place to take care of ... Now we at least have it all in one place.
>>
>> I believe we have four options:
>>
>> 1) leave it there so it will be like this is for next years with
>> questionable and diminishing usage
>> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
>> 3) 2) and extract it to a separate repository but if we do 2) we can just
>> leave it there
>> 4) remove it
>>
>> 
>> From: Derek Chen-Becker 
>> Sent: Thursday, March 9, 2023 15:55
>> To: dev@cassandra.apache.org
>> Subject: Re: Role of Hadoop code in Cassandra 5.0
>>
>> NetApp Security WARNING: This is an external email. Do not click links or
>> open attachments unless you recognize the sender and know the content is
>> safe.
>>
>>
>>
>> I think the question isn't "Who ... is still using that?" but more "are
>> we actually going to support it?" If we're on a version that old it would
>> appear that we've basically abandoned it, although there do appear to have
>> been refactoring (for other things) commits in the last couple of years. I
>> would be in favor of removal from 5.0, but at the very least, could it be
>> moved into a separate repo/package so that it's not pulling a relatively
>> large dependency subtree from Hadoop into our main codebase?
>>
>> Cheers,
>>
>> Derek
>>
>> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan <
>> stefan.mikloso...@netapp.com> wrote:
>> Hi list,
>>
>> I stumbled upon Hadoop package again. I think there was some discussion
>> about the relevancy of Hadoop code some time ago but I would like to ask
>> this again.
>>
>> Do you think Hadoop code (1) is still relevant in 5.0? Who in the
>> industry is still using that?
>>
>> We might drop a lot of code and some Hadoop dependencies too (3) (even
>> their scope is "provided"). The version of Hadoop we build upon is 1.0.3
>> which was released 10 years ago. This code does not have any tests nor
>> documentation on the website.
>>
>> There seems to be issues like this (2) and it seems like the solution is
>> to, basically, use Spark Cassandra connector instead which I would say is
>> quite reasonable.
>>
>> Regards
>>
>> (1)
>> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
>> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
>> (3)
>> https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589
>>
>>
>> --
>> +---+
>> | Derek Chen-Becker |
>> | GPG Key available at https://keybase.io/dchenbecker and   |
>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>> +---+
>>
>>


Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Brandon Williams
This is my feeling too, but I think we should accomplish this by
deprecating it first.  I don't expect anything will change after the
deprecation period.

Kind Regards,
Brandon

On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
 wrote:
>
> I vote for removing it entirely.
>
> thanks
> - - -- --- -  -
> Jacek Lewandowski
>
>
> czw., 9 mar 2023 o 18:07 Miklosovic, Stefan  
> napisał(a):
>>
>> Derek,
>>
>> I have couple more points ... I do not think that extracting it to a 
>> separate repository is "win". That code is on Hadoop 1.0.3. We would be 
>> spending a lot of work on extracting it just to extract 10 years old code 
>> with occasional updates (in my humble opinion just to make it compilable 
>> again if the code around changes). What good is in that? We would have one 
>> more place to take care of ... Now we at least have it all in one place.
>>
>> I believe we have four options:
>>
>> 1) leave it there so it will be like this is for next years with 
>> questionable and diminishing usage
>> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
>> 3) 2) and extract it to a separate repository but if we do 2) we can just 
>> leave it there
>> 4) remove it
>>
>> 
>> From: Derek Chen-Becker 
>> Sent: Thursday, March 9, 2023 15:55
>> To: dev@cassandra.apache.org
>> Subject: Re: Role of Hadoop code in Cassandra 5.0
>>
>> NetApp Security WARNING: This is an external email. Do not click links or 
>> open attachments unless you recognize the sender and know the content is 
>> safe.
>>
>>
>>
>> I think the question isn't "Who ... is still using that?" but more "are we 
>> actually going to support it?" If we're on a version that old it would 
>> appear that we've basically abandoned it, although there do appear to have 
>> been refactoring (for other things) commits in the last couple of years. I 
>> would be in favor of removal from 5.0, but at the very least, could it be 
>> moved into a separate repo/package so that it's not pulling a relatively 
>> large dependency subtree from Hadoop into our main codebase?
>>
>> Cheers,
>>
>> Derek
>>
>> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan 
>> mailto:stefan.mikloso...@netapp.com>> wrote:
>> Hi list,
>>
>> I stumbled upon Hadoop package again. I think there was some discussion 
>> about the relevancy of Hadoop code some time ago but I would like to ask 
>> this again.
>>
>> Do you think Hadoop code (1) is still relevant in 5.0? Who in the industry 
>> is still using that?
>>
>> We might drop a lot of code and some Hadoop dependencies too (3) (even their 
>> scope is "provided"). The version of Hadoop we build upon is 1.0.3 which was 
>> released 10 years ago. This code does not have any tests nor documentation 
>> on the website.
>>
>> There seems to be issues like this (2) and it seems like the solution is to, 
>> basically, use Spark Cassandra connector instead which I would say is quite 
>> reasonable.
>>
>> Regards
>>
>> (1) 
>> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
>> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
>> (3) 
>> https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589
>>
>>
>> --
>> +---+
>> | Derek Chen-Becker |
>> | GPG Key available at https://keybase.io/dchenbecker and   |
>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>> +---+
>>


Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Jacek Lewandowski
I vote for removing it entirely.

thanks
- - -- --- -  -
Jacek Lewandowski


czw., 9 mar 2023 o 18:07 Miklosovic, Stefan 
napisał(a):

> Derek,
>
> I have couple more points ... I do not think that extracting it to a
> separate repository is "win". That code is on Hadoop 1.0.3. We would be
> spending a lot of work on extracting it just to extract 10 years old code
> with occasional updates (in my humble opinion just to make it compilable
> again if the code around changes). What good is in that? We would have one
> more place to take care of ... Now we at least have it all in one place.
>
> I believe we have four options:
>
> 1) leave it there so it will be like this is for next years with
> questionable and diminishing usage
> 2) update it to Hadoop 3.3 (I wonder who is going to do that)
> 3) 2) and extract it to a separate repository but if we do 2) we can just
> leave it there
> 4) remove it
>
> 
> From: Derek Chen-Becker 
> Sent: Thursday, March 9, 2023 15:55
> To: dev@cassandra.apache.org
> Subject: Re: Role of Hadoop code in Cassandra 5.0
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
> I think the question isn't "Who ... is still using that?" but more "are we
> actually going to support it?" If we're on a version that old it would
> appear that we've basically abandoned it, although there do appear to have
> been refactoring (for other things) commits in the last couple of years. I
> would be in favor of removal from 5.0, but at the very least, could it be
> moved into a separate repo/package so that it's not pulling a relatively
> large dependency subtree from Hadoop into our main codebase?
>
> Cheers,
>
> Derek
>
> On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> wrote:
> Hi list,
>
> I stumbled upon Hadoop package again. I think there was some discussion
> about the relevancy of Hadoop code some time ago but I would like to ask
> this again.
>
> Do you think Hadoop code (1) is still relevant in 5.0? Who in the industry
> is still using that?
>
> We might drop a lot of code and some Hadoop dependencies too (3) (even
> their scope is "provided"). The version of Hadoop we build upon is 1.0.3
> which was released 10 years ago. This code does not have any tests nor
> documentation on the website.
>
> There seems to be issues like this (2) and it seems like the solution is
> to, basically, use Spark Cassandra connector instead which I would say is
> quite reasonable.
>
> Regards
>
> (1)
> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
> (3)
> https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589
>
>
> --
> +---+
> | Derek Chen-Becker |
> | GPG Key available at https://keybase.io/dchenbecker and   |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---+
>
>


Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Miklosovic, Stefan
Derek,

I have couple more points ... I do not think that extracting it to a separate 
repository is "win". That code is on Hadoop 1.0.3. We would be spending a lot 
of work on extracting it just to extract 10 years old code with occasional 
updates (in my humble opinion just to make it compilable again if the code 
around changes). What good is in that? We would have one more place to take 
care of ... Now we at least have it all in one place.

I believe we have four options:

1) leave it there so it will be like this is for next years with questionable 
and diminishing usage
2) update it to Hadoop 3.3 (I wonder who is going to do that)
3) 2) and extract it to a separate repository but if we do 2) we can just leave 
it there
4) remove it


From: Derek Chen-Becker 
Sent: Thursday, March 9, 2023 15:55
To: dev@cassandra.apache.org
Subject: Re: Role of Hadoop code in Cassandra 5.0

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



I think the question isn't "Who ... is still using that?" but more "are we 
actually going to support it?" If we're on a version that old it would appear 
that we've basically abandoned it, although there do appear to have been 
refactoring (for other things) commits in the last couple of years. I would be 
in favor of removal from 5.0, but at the very least, could it be moved into a 
separate repo/package so that it's not pulling a relatively large dependency 
subtree from Hadoop into our main codebase?

Cheers,

Derek

On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan 
mailto:stefan.mikloso...@netapp.com>> wrote:
Hi list,

I stumbled upon Hadoop package again. I think there was some discussion about 
the relevancy of Hadoop code some time ago but I would like to ask this again.

Do you think Hadoop code (1) is still relevant in 5.0? Who in the industry is 
still using that?

We might drop a lot of code and some Hadoop dependencies too (3) (even their 
scope is "provided"). The version of Hadoop we build upon is 1.0.3 which was 
released 10 years ago. This code does not have any tests nor documentation on 
the website.

There seems to be issues like this (2) and it seems like the solution is to, 
basically, use Spark Cassandra connector instead which I would say is quite 
reasonable.

Regards

(1) 
https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
(2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
(3) 
https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589


--
+---+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and   |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---+



Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Bowen Song via dev

Hi Jeremiah,

I'm fully aware of that, which is why I said that deleting the affected 
SSTable files is "less safe".


If the "bad blocks" logic is implemented and the node abort the current 
read query when hitting a bad block, it should remain safe, as the data 
in other SSTable files will not be used. The streamed data should 
contain the unexpired tombstones, and that's enough to keep the data 
consistent on the node.


Cheers,
Bowen


On 09/03/2023 15:58, Jeremiah D Jordan wrote:
It is actually more complicated than just removing the sstable and 
running repair.


In the face of expired tombstones that might be covering data in other 
sstables the only safe way to deal with a bad sstable is wipe the 
token range in the bad sstable and rebuild/bootstrap that range (or 
wipe/rebuild the whole node which is usually the easier way).  If 
there are expired tombstones in play, it means they could have already 
been compacted away on the other replicas, but may not have compacted 
away on the current replica, meaning the data they cover could still 
be present in other sstables on this node.  Removing the sstable will 
mean resurrecting that data.  And pulling the range from other nodes 
does not help because they can have already compacted away the 
tombstone, so you won’t get it back.


Tl;DR you can’t just remove the one sstable you have to remove all 
data in the token range covered by the sstable (aka all data that 
sstable may have had a tombstone covering).  Then you can stream from 
the other nodes to get the data back.


-Jeremiah

On Mar 8, 2023, at 7:24 AM, Bowen Song via dev 
 wrote:


At the moment, when a read error, such as unrecoverable bit error or 
data corruption, occurs in the SSTable data files, regardless of the 
disk_failure_policy configuration, manual (or to be precise, 
external) intervention is required to recover from the error.


Commonly, there's two approach to recover from such error:

 1. The safer, but slower recover strategy: replace the entire node.
 2. The less safe, but faster recover strategy: shut down the node,
delete the affected SSTable file(s), and then bring the node back
online and run repair.

Based on my understanding of Cassandra, it should be possible to 
recover from such error by marking the affected token range in the 
existing SSTable as "corrupted" and stop reading from them (e.g. 
creating a "bad block" file or in memory), and then streaming the 
affected token range from the healthy replicas. The corrupted SSTable 
file can then be removed upon the next successful compaction 
involving it, or alternatively an anti-compaction is performed on it 
to remove the corrupted data.


The advantage of this strategy is:

  * Reduced node down time - node restart or replacement is not needed
  * Less data streaming is required - only the affected token range
  * Faster recovery time - less streaming and delayed compaction or
anti-compaction
  * No less safe than replacing the entire node
  * This process can be automated internally, removing the need for
operator inputs

The disadvantage is added complexity on the SSTable read path and it 
may mask disk failures from the operator who is not paying attention 
to it.


What do you think about this?



Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Jeremiah D Jordan
It is actually more complicated than just removing the sstable and running 
repair.

In the face of expired tombstones that might be covering data in other sstables 
the only safe way to deal with a bad sstable is wipe the token range in the bad 
sstable and rebuild/bootstrap that range (or wipe/rebuild the whole node which 
is usually the easier way).  If there are expired tombstones in play, it means 
they could have already been compacted away on the other replicas, but may not 
have compacted away on the current replica, meaning the data they cover could 
still be present in other sstables on this node.  Removing the sstable will 
mean resurrecting that data.  And pulling the range from other nodes does not 
help because they can have already compacted away the tombstone, so you won’t 
get it back.

Tl;DR you can’t just remove the one sstable you have to remove all data in the 
token range covered by the sstable (aka all data that sstable may have had a 
tombstone covering).  Then you can stream from the other nodes to get the data 
back.

-Jeremiah

> On Mar 8, 2023, at 7:24 AM, Bowen Song via dev  
> wrote:
> 
> At the moment, when a read error, such as unrecoverable bit error or data 
> corruption, occurs in the SSTable data files, regardless of the 
> disk_failure_policy configuration, manual (or to be precise, external) 
> intervention is required to recover from the error.
> 
> Commonly, there's two approach to recover from such error:
> 
> The safer, but slower recover strategy: replace the entire node.
> The less safe, but faster recover strategy: shut down the node, delete the 
> affected SSTable file(s), and then bring the node back online and run repair.
> Based on my understanding of Cassandra, it should be possible to recover from 
> such error by marking the affected token range in the existing SSTable as 
> "corrupted" and stop reading from them (e.g. creating a "bad block" file or 
> in memory), and then streaming the affected token range from the healthy 
> replicas. The corrupted SSTable file can then be removed upon the next 
> successful compaction involving it, or alternatively an anti-compaction is 
> performed on it to remove the corrupted data.
> 
> The advantage of this strategy is:
> 
> Reduced node down time - node restart or replacement is not needed
> Less data streaming is required - only the affected token range
> Faster recovery time - less streaming and delayed compaction or 
> anti-compaction
> No less safe than replacing the entire node
> This process can be automated internally, removing the need for operator 
> inputs
> The disadvantage is added complexity on the SSTable read path and it may mask 
> disk failures from the operator who is not paying attention to it.
> 
> What do you think about this?
> 



Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Miklosovic, Stefan
What about asking somebody from Hadoop project to update it directly in 
Cassandra? I think these people have loads of experience in integrations like 
this. If we bumped the version to something 3.3.x, refreshed the code and put 
some tests on top, I think we could just leave it there for couple more years 
again.


From: Derek Chen-Becker 
Sent: Thursday, March 9, 2023 15:55
To: dev@cassandra.apache.org
Subject: Re: Role of Hadoop code in Cassandra 5.0

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



I think the question isn't "Who ... is still using that?" but more "are we 
actually going to support it?" If we're on a version that old it would appear 
that we've basically abandoned it, although there do appear to have been 
refactoring (for other things) commits in the last couple of years. I would be 
in favor of removal from 5.0, but at the very least, could it be moved into a 
separate repo/package so that it's not pulling a relatively large dependency 
subtree from Hadoop into our main codebase?

Cheers,

Derek

On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan 
mailto:stefan.mikloso...@netapp.com>> wrote:
Hi list,

I stumbled upon Hadoop package again. I think there was some discussion about 
the relevancy of Hadoop code some time ago but I would like to ask this again.

Do you think Hadoop code (1) is still relevant in 5.0? Who in the industry is 
still using that?

We might drop a lot of code and some Hadoop dependencies too (3) (even their 
scope is "provided"). The version of Hadoop we build upon is 1.0.3 which was 
released 10 years ago. This code does not have any tests nor documentation on 
the website.

There seems to be issues like this (2) and it seems like the solution is to, 
basically, use Spark Cassandra connector instead which I would say is quite 
reasonable.

Regards

(1) 
https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
(2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
(3) 
https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589


--
+---+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and   |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---+



Re: Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Derek Chen-Becker
I think the question isn't "Who ... is still using that?" but more "are we
actually going to support it?" If we're on a version that old it would
appear that we've basically abandoned it, although there do appear to have
been refactoring (for other things) commits in the last couple of years. I
would be in favor of removal from 5.0, but at the very least, could it be
moved into a separate repo/package so that it's not pulling a relatively
large dependency subtree from Hadoop into our main codebase?

Cheers,

Derek

On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan <
stefan.mikloso...@netapp.com> wrote:

> Hi list,
>
> I stumbled upon Hadoop package again. I think there was some discussion
> about the relevancy of Hadoop code some time ago but I would like to ask
> this again.
>
> Do you think Hadoop code (1) is still relevant in 5.0? Who in the industry
> is still using that?
>
> We might drop a lot of code and some Hadoop dependencies too (3) (even
> their scope is "provided"). The version of Hadoop we build upon is 1.0.3
> which was released 10 years ago. This code does not have any tests nor
> documentation on the website.
>
> There seems to be issues like this (2) and it seems like the solution is
> to, basically, use Spark Cassandra connector instead which I would say is
> quite reasonable.
>
> Regards
>
> (1)
> https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
> (2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
> (3)
> https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589



-- 
+---+
| Derek Chen-Becker |
| GPG Key available at https://keybase.io/dchenbecker and   |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---+


Re: [EXTERNAL] Re: [DISCUSS] Next release date

2023-03-09 Thread Mick Semb Wever
>
> I've also found some useful Cassandra's JIRA dashboards for previous
> releases to track progress and scope, but we don't have anything
> similar for the next release. Should we create it?
> Cassandra 4.0GAScope
> Cassandra 4.1 GA scope
>


https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484


Re: [EXTERNAL] Re: [DISCUSS] Next release date

2023-03-09 Thread Maxim Muzafarov
When I was a release manager for another Apache project, I found it
useful to create confluence pages for the upcoming release, both for
transparency of release dates and for benchmarks. Of course, the dates
can be updated when we will have a better understanding of the scope
of the release.
Do we want something similar?

Here is an example:
https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+2.10

I've also found some useful Cassandra's JIRA dashboards for previous
releases to track progress and scope, but we don't have anything
similar for the next release. Should we create it?
Cassandra 4.0GAScope
Cassandra 4.1 GA scope

Example:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=546

On Thu, 9 Mar 2023 at 10:13, Branimir Lambov  wrote:
>
> CEPs 25 (trie-indexed sstables) and 26 (unified compaction strategy) should 
> both be ready for review by mid-April.
>
> Both are around 10k LOC, fairly isolated, and in need of a committer to 
> review.
>
> Regards,
> Branimir
>
> On Mon, Mar 6, 2023 at 11:25 AM Benjamin Lerer  wrote:
>>
>> Sorry, I realized that when I started the discussion I probably did not 
>> frame it enough as I see that it is now going into different directions.
>> The concerns I am seeing are:
>> 1) A too small amount of time between releases  is inefficient from a 
>> development perspective and from a user perspective. From a development 
>> point of view because we are missing time to deliver some features. From a 
>> user perspective because they cannot follow with the upgrade.
>> 2) Some features are so anticipated (Accord being the one mentioned) that 
>> people would prefer to delay the release to make sure that it is available 
>> as soon as possible.
>> 3) We do not know how long we need to go from the freeze to GA. We hope for 
>> 2 months but our last experience was 6 months. So delaying the release could 
>> mean not releasing this year.
>> 4) For people doing marketing it is really hard to promote a product when 
>> you do not know when the release will come and what features might be there.
>>
>> All those concerns are probably even made worse by the fact that we do not 
>> have a clear visibility on where we are.
>>
>> Should we clarify that part first by getting an idea of the status of the 
>> different CEPs and other big pieces of work? From there we could agree on 
>> some timeline for the freeze. We could then discuss how to make predictable 
>> the time from freeze to GA.
>>
>>
>>
>> Le sam. 4 mars 2023 à 18:14, Josh McKenzie  a écrit :
>>>
>>> (for convenience sake, I'm referring to both Major and Minor semver 
>>> releases as "major" in this email)
>>>
>>> The big feature from our perspective for 5.0 is ACCORD (CEP-15) and I would 
>>> advocate to delay until this has sufficient quality to be in production.
>>>
>>> This approach can be pretty unpredictable in this domain; often unforeseen 
>>> things come up in implementation that can give you a long tail on something 
>>> being production ready. For the record - I don't intend to single Accord 
>>> out at all on this front, quite the opposite given how much rigor's gone 
>>> into the design and implementation. I'm just thinking from my personal 
>>> experience: everything I've worked on, overseen, or followed closely on 
>>> this codebase always has a few tricks up its sleeve along the way to having 
>>> edge-cases stabilized.
>>>
>>> Much like on some other recent topics, I think there's a nuanced middle 
>>> ground where we take things on a case-by-case basis. Some factors that have 
>>> come up in this thread that resonated with me:
>>>
>>> For a given potential release date 'X':
>>> 1. How long has it been since the last release?
>>> 2. How long do we expect qualification to take from a "freeze" (i.e. no new 
>>> improvement or features, branch) point?
>>> 3. What body of merged production ready work is available?
>>> 4. What body of new work do we have high confidence will be ready within Y 
>>> time?
>>>
>>> I think it's worth defining a loose "minimum bound and upper bound" on 
>>> release cycles we want to try and stick with barring extenuating 
>>> circumstances. For instance: try not to release sooner than maybe 10 months 
>>> out from a prior major, and try not to release later than 18 months out 
>>> from a prior major. Make exceptions if truly exceptional things land, are 
>>> about to land, or bugs are discovered around those boundaries.
>>>
>>> Applying the above framework to what we have in flight, our last release 
>>> date, expectations on CI, etc - targeting an early fall freeze (pending CEP 
>>> status) and mid to late fall or December release "feels right" to me.
>>>
>>> With the exception, of course, that if something merges earlier, is stable, 
>>> and we feel is valuable enough to cut a major based on that, we do it.
>>>
>>> ~Josh
>>>
>>> On Fri, Mar 3, 2023, at 7:37 PM, German Eichberger via dev wrote:
>>>
>>> Hi,
>>>
>>> We shouldn't release just for releases 

Role of Hadoop code in Cassandra 5.0

2023-03-09 Thread Miklosovic, Stefan
Hi list,

I stumbled upon Hadoop package again. I think there was some discussion about 
the relevancy of Hadoop code some time ago but I would like to ask this again.

Do you think Hadoop code (1) is still relevant in 5.0? Who in the industry is 
still using that?

We might drop a lot of code and some Hadoop dependencies too (3) (even their 
scope is "provided"). The version of Hadoop we build upon is 1.0.3 which was 
released 10 years ago. This code does not have any tests nor documentation on 
the website.

There seems to be issues like this (2) and it seems like the solution is to, 
basically, use Spark Cassandra connector instead which I would say is quite 
reasonable.

Regards

(1) 
https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
(2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
(3) 
https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589

Re: [RELEASE] Apache Cassandra 4.0.8 released

2023-03-09 Thread Brandon Williams
It was reported in CASSANDRA-18307 that the Debian and Redhat packages
for 4.0.8 did not make it to the jfrog repository - this has now been
corrected, sorry for any inconvenience.

Kind Regards,
Brandon

On Tue, Feb 14, 2023 at 3:39 PM Miklosovic, Stefan
 wrote:
>
> The Cassandra team is pleased to announce the release of Apache Cassandra 
> version 4.0.8.
>
> Apache Cassandra is a fully distributed database. It is the right choice when 
> you need scalability and high availability without compromising performance.
>
>  http://cassandra.apache.org/
>
> Downloads of source and binary distributions are listed in our download 
> section:
>
>  http://cassandra.apache.org/download/
>
> This version is a bug fix release[1] on the 4.0 series. As always, please pay 
> attention to the release notes[2] and Let us know[3] if you were to encounter 
> any problem.
>
> [WARNING] Debian and RedHat package repositories have moved! Debian 
> /etc/apt/sources.list.d/cassandra.sources.list and RedHat 
> /etc/yum.repos.d/cassandra.repo files must be updated to the new repository 
> URLs. For Debian it is now https://debian.cassandra.apache.org . For RedHat 
> it is now https://redhat.cassandra.apache.org/40x/ .
>
> Enjoy!
>
> [1]: CHANGES.txt 
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-4.0.8
> [2]: NEWS.txt 
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/cassandra-4.0.8
> [3]: https://issues.apache.org/jira/browse/CASSANDRA


Re: [EXTERNAL] Re: [DISCUSS] Next release date

2023-03-09 Thread Branimir Lambov
CEPs 25 (trie-indexed sstables) and 26 (unified compaction strategy) should
both be ready for review by mid-April.

Both are around 10k LOC, fairly isolated, and in need of a committer to
review.

Regards,
Branimir

On Mon, Mar 6, 2023 at 11:25 AM Benjamin Lerer  wrote:

> Sorry, I realized that when I started the discussion I probably did not
> frame it enough as I see that it is now going into different directions.
> The concerns I am seeing are:
> 1) A too small amount of time between releases  is inefficient from a
> development perspective and from a user perspective. From a development
> point of view because we are missing time to deliver some features. From a
> user perspective because they cannot follow with the upgrade.
> 2) Some features are so anticipated (Accord being the one mentioned) that
> people would prefer to delay the release to make sure that it is available
> as soon as possible.
> 3) We do not know how long we need to go from the freeze to GA. We hope
> for 2 months but our last experience was 6 months. So delaying the release
> could mean not releasing this year.
> 4) For people doing marketing it is really hard to promote a product when
> you do not know when the release will come and what features might be there.
>
> All those concerns are probably even made worse by the fact that we do not
> have a clear visibility on where we are.
>
> Should we clarify that part first by getting an idea of the status of the
> different CEPs and other big pieces of work? From there we could agree on
> some timeline for the freeze. We could then discuss how to make predictable
> the time from freeze to GA.
>
>
>
> Le sam. 4 mars 2023 à 18:14, Josh McKenzie  a
> écrit :
>
>> (for convenience sake, I'm referring to both Major and Minor semver
>> releases as "major" in this email)
>>
>> The big feature from our perspective for 5.0 is ACCORD (CEP-15) and I
>> would advocate to delay until this has sufficient quality to be in
>> production.
>>
>> This approach can be pretty unpredictable in this domain; often
>> unforeseen things come up in implementation that can give you a long tail
>> on something being production ready. For the record - I don't intend to
>> single Accord out *at all* on this front, quite the opposite given how
>> much rigor's gone into the design and implementation. I'm just thinking
>> from my personal experience: everything I've worked on, overseen, or
>> followed closely on this codebase always has a few tricks up its sleeve
>> along the way to having edge-cases stabilized.
>>
>> Much like on some other recent topics, I think there's a nuanced middle
>> ground where we take things on a case-by-case basis. Some factors that have
>> come up in this thread that resonated with me:
>>
>> For a given potential release date 'X':
>> 1. How long has it been since the last release?
>> 2. How long do we expect qualification to take from a "freeze" (i.e. no
>> new improvement or features, branch) point?
>> 3. What body of merged production ready work is available?
>> 4. What body of new work do we have high confidence will be ready within
>> Y time?
>>
>> I think it's worth defining a loose "minimum bound and upper bound" on
>> release cycles we want to try and stick with barring extenuating
>> circumstances. For instance: try not to release sooner than maybe 10 months
>> out from a prior major, and try not to release later than 18 months out
>> from a prior major. Make exceptions if truly exceptional things land, are
>> about to land, or bugs are discovered around those boundaries.
>>
>> Applying the above framework to what we have in flight, our last release
>> date, expectations on CI, etc - targeting an early fall freeze (pending CEP
>> status) and mid to late fall or December release "feels right" to me.
>>
>> With the exception, of course, that if something merges earlier, is
>> stable, and we feel is valuable enough to cut a major based on that, we do
>> it.
>>
>> ~Josh
>>
>> On Fri, Mar 3, 2023, at 7:37 PM, German Eichberger via dev wrote:
>>
>> Hi,
>>
>> We shouldn't release just for releases sake. Are there enough new
>> features and are they working well enough (quality!).
>>
>> The big feature from our perspective for 5.0 is ACCORD (CEP-15) and I
>> would advocate to delay until this has sufficient quality to be in
>> production.
>>
>> Just because something is released doesn't mean anyone is gonna use it.
>> To add some operator perspective: Every time there is a new release we need
>> to decide
>> 1) are we supporting it
>> 2) which other release can we deprecate
>>
>> and potentially migrate people - which is also a tough sell if there are
>> no significant features and/or breaking changes.  So from my perspective
>> less frequent releases are better - after all we haven't gotten around to
>> support 4.1 
>>
>> The 5.0 release is also coupled with deprecating  3.11 which is what a
>> significant amount of people are using - given 4.1 took longer I am not
>> sure how many