Thanks, Jeff, for the detailed steps and summary.
We will keep the community (this thread) up to date on how it plays out in
our fleet.

Jaydeep

On Fri, May 5, 2023 at 9:10 AM Jeff Jirsa <jji...@gmail.com> wrote:

> Lots of caveats on these suggestions, let me try to hit most of them.
>
> Cleanup in parallel is good and fine and common. Limit number of threads
> in cleanup if you're using lots of vnodes, so each node runs one at a time
> and not all nodes use all your cores at the same time.
> If a host is fully offline, you can ALSO use replace address first boot.
> It'll stream data right to that host with the same token assignments you
> had before, and no cleanup is needed then. Strictly speaking, to avoid
> resurrection here, you'd want to run repair on the replicas of the down
> host (for vnodes, probably the whole cluster), but your current process
> doesnt guarantee that either (decom + bootstrap may resurrect, strictly
> speaking).
> Dropping vnodes will reduce the replicas that have to be cleaned up, but
> also potentially increase your imbalance on each replacement.
>
> Cassandra should still do this on its own, and I think once CEP-21 is
> committed, this should be one of the first enhancement tickets.
>
> Until then, LeveledCompactionStrategy really does make cleanup fast and
> cheap, at the cost of higher IO the rest of the time. If you can tolerate
> that higher IO, you'll probably appreciate LCS anyway (faster reads, faster
> data deletion than STCS). It's a lot of IO compared to STCS though.
>
>
>
> On Fri, May 5, 2023 at 9:02 AM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Thanks all for your valuable inputs. We will try some of the suggested
>> methods in this thread, and see how it goes. We will keep you updated on
>> our progress.
>> Thanks a lot once again!
>>
>> Jaydeep
>>
>> On Fri, May 5, 2023 at 8:55 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> Depending on the number of vnodes per server, the probability and
>>> severity (i.e. the size of the affected token ranges) of an availability
>>> degradation due to a server failure during node replacement may be small.
>>> You also have the choice of increasing the RF if that's still not
>>> acceptable.
>>>
>>> Also, reducing number of vnodes per server can limit the number of
>>> servers affected by replacing a single server, therefore reducing the
>>> amount of time required to run "nodetool cleanup" if it is run sequentially.
>>>
>>> Finally, you may choose to run "nodetool cleanup" concurrently on
>>> multiple nodes to reduce the amount of time required to complete it.
>>>
>>>
>>> On 05/05/2023 16:26, Runtian Liu wrote:
>>>
>>> We are doing the "adding a node then decommissioning a node" to
>>> achieve better availability. Replacing a node need to shut down one node
>>> first, if another node is down during the node replacement period, we will
>>> get availability drop because most of our use case is local_quorum with
>>> replication factor 3.
>>>
>>> On Fri, May 5, 2023 at 5:59 AM Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
>>>> Have you thought of using "-Dcassandra.replace_address_first_boot=..."
>>>> (or "-Dcassandra.replace_address=..." if you are using an older version)?
>>>> This will not result in a topology change, which means "nodetool cleanup"
>>>> is not needed after the operation is completed.
>>>> On 05/05/2023 05:24, Jaydeep Chovatia wrote:
>>>>
>>>> Thanks, Jeff!
>>>> But in our environment we replace nodes quite often for various
>>>> optimization purposes, etc. say, almost 1 node per day (node *addition*
>>>> followed by node *decommission*, which of course changes the
>>>> topology), and we have a cluster of size 100 nodes with 300GB per node. If
>>>> we have to run cleanup on 100 nodes after every replacement, then it could
>>>> take forever.
>>>> What is the recommendation until we get this fixed in Cassandra itself
>>>> as part of compaction (w/o externally triggering *cleanup*)?
>>>>
>>>> Jaydeep
>>>>
>>>> On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>>
>>>>> Cleanup is fast and cheap and basically a no-op if you haven’t changed
>>>>> the ring
>>>>>
>>>>> After cassandra has transactional cluster metadata to make ring
>>>>> changes strongly consistent, cassandra should do this in every compaction.
>>>>> But until then it’s left for operators to run when they’re sure the state
>>>>> of the ring is correct .
>>>>>
>>>>>
>>>>>
>>>>> On May 4, 2023, at 7:41 PM, Jaydeep Chovatia <
>>>>> chovatia.jayd...@gmail.com> wrote:
>>>>>
>>>>> 
>>>>> Isn't this considered a kind of *bug* in Cassandra because as we know
>>>>> *cleanup* is a lengthy and unreliable operation, so relying on the
>>>>> *cleanup* means higher chances of data resurrection?
>>>>> Do you think we should discard the unowned token-ranges as part of the
>>>>> regular compaction itself? What are the pitfalls of doing this as part of
>>>>> compaction itself?
>>>>>
>>>>> Jaydeep
>>>>>
>>>>> On Thu, May 4, 2023 at 7:25 PM guo Maxwell <cclive1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> compact ion will just merge duplicate data and remove delete data in
>>>>>> this node .if you add or remove one node for the cluster, I think clean 
>>>>>> up
>>>>>> is needed. if clean up failed, I think we should come to see the reason.
>>>>>>
>>>>>> Runtian Liu <curly...@gmail.com> 于2023年5月5日周五 06:37写道:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Is cleanup the sole method to remove data that does not belong to a
>>>>>>> specific node? In a cluster, where nodes are added or decommissioned 
>>>>>>> from
>>>>>>> time to time, failure to run cleanup may lead to data resurrection 
>>>>>>> issues,
>>>>>>> as deleted data may remain on the node that lost ownership of certain
>>>>>>> partitions. Or is it true that normal compactions can also handle data
>>>>>>> removal for nodes that no longer have ownership of certain data?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Runtian
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> you are the apple of my eye !
>>>>>>
>>>>>

Reply via email to