Re: [DISCUSS] Deprecate and remove resumable bootstrap and decommission

Bowen Song via dev Wed, 03 Aug 2022 16:42:16 -0700

That was Cassandra 3.11, before the introduction of zero copy. But Imust say I'm not certain whether the new zero copy streaming can preventthe long GC pauses, as I haven't tried it.


On 03/08/2022 23:37, Josh McKenzie wrote:

I had to resume the bootstrap once or twice in order to get thesenodes finish joinning the cluster.

Was this before or after the addition of zero copy streaming? Thepremise is that the pain point resumable bootstrap targets ismitigated by the much faster bootstrapping times without thecorrectness risks.


On Wed, Aug 3, 2022, at 6:21 PM, Bowen Song via dev wrote:


That would have to be assessed on a case by case basis.

* When the code doesn't delete data, which means there's a zeroprobability of resurrecting deleted data, I will still use resumablebootstrap.

* When resurrected data doesn't pose a problem to the system, itoften can still be an acceptable behaviour to save hours or days ofbootstrapping time. I may use resumable bootstrap.

* In other cases, where data correctness is important and there's achance for resurrecting deleted data, I would certainly not use it ifI had known it in advance (which I don't).



On 03/08/2022 23:11, Jeff Jirsa wrote:

The hypothetical concern described is around potential dataresurrection - would you still use resumable bootstrap if you knewthat data deleted during those STW pauses was improperly resurrected?

On Wed, Aug 3, 2022 at 2:40 PM Bowen Song via dev<[email protected] <mailto:[email protected]>> wrote:


    I have benefited from the resumable bootstrap before, and I'm in
    favour of keeping the feature around.

    I've had streaming failures due to long STW GC pauses on some
    bootstrapping nodes, and I had to resume the bootstrap once or
    twice in order to get these nodes finish joinning the cluster.
    They had not experienced more long STW GC pauses since they
    joined the cluster. I would imagine I will spend a lots of time
    tuning the GC parameters in order get these nodes to join if the
    resumable bootstrapping feature is removed. Also, I'm not
    concerned about racing conditions involving repairs, because we
    don't run repairs while we are adding new nodes (to minimize the
    additional load on the cluster).


    On 03/08/2022 19:46, Josh McKenzie wrote:

    Context: https://issues.apache.org/jira/browse/CASSANDRA-17679
    <https://issues.apache.org/jira/browse/CASSANDRA-17679>

    From the .yaml comment on the param I was working on adding:
    In certain environments, operators may want to disable resumable bootstrap 
in order to avoid potential correctness violations or data loss scenarios. 
Largelythis  centers around nodes going down during bootstrap, tombstones being 
written, and potential races with repair. Bydefault  we leavethis  on as it's 
been enabledfor  quite some time, however the option to disable it is more 
palatable now that we have zero copy streaming as that greatly accelerates


    Given zero copy streaming in the system and the general
    unexplored correctness concerns of
    https://issues.apache.org/jira/browse/CASSANDRA-8838
    <https://issues.apache.org/jira/browse/CASSANDRA-8838>,
    specifically pointed out by Jeff here:
    
https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234
    
<https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234>,
 I've
    been chatting w/Paulo about this and we've both concluded we
    think the functionality should be made configurable, default
    off (?), deprecated in 4.2 and then completely removed next.

    - First: anyone have any concerns with the general arc of
    "remove resumable bootstrap and decommission"?
    - Second: Should we leave them enabled by default in 4.2 or
    disabled?
    - Third: Should we consider revisiting older branches with this
    functionality and making it toggle-able?

    ~Josh

Re: [DISCUSS] Deprecate and remove resumable bootstrap and decommission

Reply via email to