Re: [DISCUSS] CEP-40: Data Transfer Using Cassandra Sidecar for Live Migrating Instances

Venkata Hari Krishna Nukala Wed, 26 Jun 2024 10:03:21 -0700

Hello all,

I had started a voting mail thread for this cep with the subject line:
"[VOTE] CEP-40: Data Transfer Using Cassandra Sidecar for Live Migrating
Instances". Looks like it went unnoticed (no one voted yet). Can you vote
in the voting thread?


Thanks!
Hari


On Fri, Jun 21, 2024 at 8:33 PM Venkata Hari Krishna Nukala <
n.v.harikrishna.apa...@gmail.com> wrote:

> Hi all,
>
> I did not hear anything in the last 10+ days. I am taking it as a positive
> sign and proceeding to the voting stage for this CEP.
>
> Thanks!
> Hari
>
> On Fri, Jun 7, 2024 at 10:26 PM Venkata Hari Krishna Nukala <
> n.v.harikrishna.apa...@gmail.com> wrote:
>
>>
>> Summarizing the discussion happened so far
>>
>>
>>
>> *Data copy using rsync vs SideCar*
>> Data copy via rsync is an incomplete solution and has to be executed
>> outside of the Cassandra ecosystem. Data copy via Sidecar is valuable for
>> Cassandra to have an ecosystem-native approach outside the streaming path
>> which excludes repairs, decommissions and bootstraps. Proposed solution
>> poses fewer security concerns than rsync. An ecosystem-native approach is
>> more instrumentable and measurable than rsync. Tooling can be built on top
>> of it.
>>
>> *File digest/checksum*
>>
>> Initial proposal mentioned that combination of file path and size is used
>> to verify that destination and source have the same set of data. Scott, Jon
>> and Dinesh expressed concerns about hitting corner cases where just
>> verifying path & size is not good enough. I had updated CEP-40 to have
>> binary level file verification using a digest algorithm.
>>
>> *Managing C* life cycle with Sidecar*
>>
>> The migration process proposed requires biring up and down the Cassandra
>> instances. This CEP called out that bringing the instances up/down is not
>> in scope. Jon and Jordan expressed that adding this ability to make this
>> entire workflow self managed is the biggest win.
>>
>> Managing C* lifecycle (safely start, stop & restart) is already
>> considered in scope for CEP-1
>> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224#CEP1:ApacheCassandraManagementProcess(es)-3.Lifecycle(safelystart,stop,restartC*)>.
>> It can be leveraged when implemented as part of CEP-1.
>>
>> *Abstraction of how files get moved, backup and restore*
>>
>> Jordan & German mentioned that having an abstracttion how files get moved
>> / put in place would help allow others to plugin alternative means of data
>> movement like pulling down from backups/S3/any other source. Jeff added the
>> following points. 1) If you think of it instead as “change backup/restore
>> mechanism to be able to safely restore from a running instance”, you may
>> end up with a cleaner abstraction that’s easier to think about (and may
>> also be easier to generalize in clouds where you have other tools available
>> ). 2) “ensure the original source node isn’t running” , “migrate the
>> config”, “choose and copy a snapshot” , maybe “forcibly exclude the
>> original instance from the cluster” are all things the restore code is
>> going to need to do anyway, and if restore doesn’t do that today, it seems
>> like we can solve it once. It accomplishes the original goal, in largely
>> the same fashion, it just makes the logic reusable for other purposes?
>>
>> Jon and Jordan mentioned that framing of replacing a node as restoring a
>> node and then kicking off a replace node is an interesting perspective. The
>> data copy task mentioned in this CEP can be viewed as a restore task/job
>> which treats another running Sidecar as a source. When it is generalised to
>> support other sources like S3 or disk snapshots, support for many use cases
>> can be added like restoring from S3 or disk snapshots.
>>
>> Updated CEP-40 with the details how files get moved and put in place
>> which can be treated as default implementation for live migration. Having a
>> cleaner abstraction/interface for source is added as one of the goals. The
>> data copy task can be untied with live migration so that it can be used to
>> copy data from any source(remote) to local. This way it can be leveraged
>> across different use cases. Data copy task endpoint can be tailored to
>> accommodate different plugins during implementation. Francisco mentioned
>> that Sidecar has now the ability to restore data from S3 (for Analytics
>> library) and it can be extended for live migration, backup and restore, and
>> others.
>>
>> *Supporting Live migration with-in Cassandra process instead of sidecar*
>>
>> Paulo and Ariel raised a point about supporting migration in the main
>> process via entire sstable streaming which could also help people who
>> aren't running the Sidecar.
>>
>> Jon, Francisco, Jordan, Scott & Dinesh mentioned the following benefits
>> of doing live migration. Sidecar can be used for coordination to start and
>> stop instances or do things that require something out process. Sidecar
>> would be able to migrate from a Cassandra instance that is already dead and
>> cannot recover (not because of disk issues). If we are considering the main
>> process then we have to do some additional work to ensure that it doesn’t
>> put pressure on the JVM and introduce latency. The host replacement process
>> also puts a lot of stress on gossip and is a great way to encounter all
>> sorts of painful races if you perform it hundreds or thousands of times
>> (but shouldn’t be a problem in TCM-world). It is also valuable to have a
>> paved path implementation of a safe migration/forklift state machine when
>> you’re in a bind, or need to do this hundreds or thousands of times.
>>
>> *Migrating a specific keyspace to a dedicated cluster*
>>
>> Patrick brought up an interesting use case. In his words: In many cases,
>> multiple tenants present cause the cluster to overpressure. The best
>> solution in that case is to migrate the largest keyspace to a dedicated
>> cluster. Live migration but a bit more complicated. No chance of doing this
>> manually without some serious brain surgery on c* and downtime.
>>
>> With the proposed solution, keyspaces can be copied selectively by
>> supplying inclusion/exclusion filters to data copy task API. If the writes
>> for these keyspaces can't be stopped at the source, then destination may
>> not reach the point where 100% of the data of selected keyspaces match the
>> source. For me, it sounds doable assuming that the writes can be stopped
>> for sometime. As Patrick calledout, it is a bit more complicated when
>> downtime is not acceptable. Like Josh mentioned, it can be a stretch goal
>> or v2 of this CEP.
>>
>> *Live migration + TCM*
>>
>> Alex raised a point that cluster may lose both durability and
>> availability during the second phase of data copy. This migration can be
>> done in a more durable manner with TCM by leaving the source node as both a
>> read and write target, and allow the new node to be the target for writes.
>> It can eliminate the second phase of data copying. Jordan also agrees that
>> it is a more durable approach. Alex also felt that it would be good to have
>> a good understanding of availability and durability guarantees we want to
>> provide with it, and have it stated explicitly, for both "source node down"
>> and "source node up" cases. Me and Alex had an offline discussion, and
>> brought up a point that this may require extra care for making sure that
>> initial data copy sstables aren't involved in the regular node sstable
>> lifecycle, since in that case we may inadvertently remove or compact them.
>>
>> Alex is fine to do it either as part of the current CEP, or as a follow
>> up.
>>
>> ----
>>
>> Hope I have covered all points and addressed them in the CEP. Please call
>> out if I miss anything. I feel that we reached a point where we had a fair
>> amount of discussion and the discussions is fairly converged. Is it a good
>> time to call for voting? Do I have to wait or do anything else before
>> requesting votes?
>>
>> Thanks!
>> Hari
>>
>> On Thu, May 30, 2024 at 7:54 PM Alex Petrov <al...@coffeenco.de> wrote:
>>
>>> Alex, just want to make sure that I understand your point correctly. Are
>>> you suggesting this sequence of operations with TCM?
>>>
>>> * Make config changes
>>> * Do the initial data copy
>>> * Make destination part of write placements (same as source)
>>> * Start destination instance
>>> * Decommission the source
>>> * Enable reads for destination by making it part of read placements (as
>>> source)
>>>
>>>
>>> Almost. I am suggesting reuse the logic we have in TCM and already use
>>> for bootstraps and replacements. I think the way it'll be sequencing will
>>> be something like:
>>>   * Make config changes
>>>   * Start destination instance
>>>   * Make destination part of write placements (same as source)
>>>   * Do the initial data copy
>>>   * Load sstables from the initial data copy
>>>   * Enable reads for destination by making it part of read placements
>>>   * Decommission the source
>>>
>>> We've also had a short discussion offline, and brought up a good point
>>> that this may require extra care for making sure that initial data copy
>>> sstables aren't involved in the regular node sstable lifecycle, since in
>>> that case we may inadvertently remove or compact them.
>>>
>>> > It is a fair point. It is good to have the understanding of
>>> availability and durability guarantees during migration. I can create a
>>> JIRA for it later.
>>>
>>> Sounds good. As I mentioned, I'm fine either way: if we do it as a part
>>> of CEP, or as a follow-up.
>>>
>>> On Sun, May 12, 2024, at 8:18 PM, Venkata Hari Krishna Nukala wrote:
>>>
>>> Replies from my side for the other points of the discussion:
>>> *Managing C* life cycle with Sidecar*
>>>
>>> >lifecycle / orchestration portion is the more challenging aspect. It
>>> would be nice to address that as well so we don’t end up with something
>>> like repair where the building blocks are there but the hard parts are left
>>> to the operator
>>>
>>> CEP-1 has lifecycle operations under scope.
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224#CEP1:ApacheCassandraManagementProcess(es)-3.Lifecycle(safelystart,stop,restartC*).
>>> I think it can be leveraged when implemented as part of CEP-1.
>>>
>>> *On backup & restore use case*
>>>
>>> I see similarities between backup/restore & this migration. But I feel
>>> there will be considerable differences while implementing it and we might
>>> need to tailor the API to make it usable for backup & restore. I think
>>> making the code/logic reusable can be an implicit goal. Does calling backup
>>> & restore - a stretch goal or creating a separate CEP sounds fair?
>>>
>>> *Migrate the largest keyspace to a dedicated cluster*
>>>
>>> Parick, proposed API can help to copy specific keyspace data to another
>>> cluster. "No chance of doing this manually without some serious brain
>>> surgery on c* and downtime." - sounds a bit tricky to me. Since the
>>> clusters are independent, doing it without any coordination between
>>> clusters and downtime sounds like a case this CEP is not targeting at the
>>> moment.
>>>
>>> *Live migration + TCM*
>>>
>>> >We can implement CEP-40 using a similar approach: we can leave the
>>> source node as both a read and write target, and allow the new node to be a
>>> target for (pending) writes. Unfortunately, this does not help with
>>> availability (in fact, it decreases write availability, since we will have
>>> to collect 2+1 mandatory write responses instead of just 2), but increases
>>> durability, and I think helps to fully eliminate the second phase. This
>>> also increases read availability when the source node is up, since we can
>>> still use the source node as a part of the read quorum.
>>>
>>> Alex, just want to make sure that I understand your point correctly. Are
>>> you suggesting this sequence of operations with TCM?
>>>
>>> * Make config changes
>>> * Do the initial data copy
>>> * Make destination part of write placements (same as source)
>>> * Start destination instance
>>> * Decommission the source
>>> * Enable reads for destination by making it part of read placements (as
>>> source)
>>>
>>> >I am also not against to have this to be done post-factum, after
>>> implementation of CEP in its current form, but I think it would be good to
>>> have good understanding of availability and durability guarantees we want
>>> to provide with it, and have it stated explicitly, for both "source node
>>> down" and "source node up" cases.
>>>
>>> It is a fair point. It is good to have the understanding of availability
>>> and durability guarantees during migration. I can create a JIRA for it
>>> later.
>>>
>>> Thanks!
>>> Hari
>>>
>>> On Thu, May 2, 2024 at 12:30 PM Alex Petrov <al...@coffeenco.de> wrote:
>>>
>>>
>>> Thank you for input!
>>>
>>> > Would it be possible to create a new type of write target node?  The
>>> new write target node is notified of writes (like any other write node) but
>>> does not participate in the write availability calculation.
>>>
>>> We could make a some kind of optional write, but unfortunately this way
>>> we can not codify our consistency level. Since we already use a notion of
>>> pending ranges that requires 1 extra ack, and we as a community are OK with
>>> it, I think for simplicity we should stick to the same notion.
>>>
>>> If there is a lot of interest in this kind of availability/durability
>>> tradeoff, we should discuss all implications in a separate CEP, but then it
>>> probably would make sense to make it available for all operations.
>>>
>>> My personal opinion is that if we can't guarantee/rely on the number of
>>> acks, this may accidentally mislead people as they would expect it to work
>>> and lead to surprises when it does not.
>>>
>>> On Wed, May 1, 2024, at 4:38 PM, Claude Warren, Jr via dev wrote:
>>>
>>> Alex,
>>>
>>>  you write:
>>>
>>> We can implement CEP-40 using a similar approach: we can leave the
>>> source node as both a read and write target, and allow the new node to be a
>>> target for (pending) writes. Unfortunately, this does not help with
>>> availability (in fact, it decreases write availability, since we will have
>>> to collect 2+1 mandatory write responses instead of just 2), but increases
>>> durability, and I think helps to fully eliminate the second phase. This
>>> also increases read availability when the source node is up, since we can
>>> still use the source node as a part of read quorum.
>>>
>>>
>>> Would it be possible to create a new type of write target node?  The new
>>> write target node is notified of writes (like any other write node) but
>>> does not participate in the write availability calculation.  In this way a
>>> node this is being migrated to could receive writes and have minimal impact
>>> on the current operation of the cluster?
>>>
>>> Claude
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 12:33 PM Alex Petrov <al...@coffeenco.de> wrote:
>>>
>>>
>>> Thank you for submitting this CEP!
>>>
>>> Wanted to discuss this point from the description:
>>>
>>> > How to bring up/down Cassandra/Sidecar instances or making/applying
>>> config changes are outside the scope of this document.
>>>
>>> One advantage of doing migration via sidecar is the fact that we can
>>> stream sstables to the target node from the source node while the source
>>> node is down. Also if the source node is down, it does not matter if we
>>> can’t use it as a write target However, if we are replacing a live node, we
>>> do lose both durability and availability during the second copy phase.
>>> There are copious other advantages described by others in the thread above.
>>>
>>> For example, we have three adjacent nodes A,B,C and simple RF 3. C
>>> (source) is up and is being replaced with live-migrated D (destination).
>>> According to the described process in CEP-40, we perform streaming in 2
>>> phases: first one is a full copy (similar to bootstrap/replacement in
>>> cassandra), and the second one is just a diff. The second phase is still
>>> going to take a non-trivial amount of time, and is likely to last at very
>>> least minutes. During this time, we only have nodes A and B as both read
>>> and write targets, with no alternatives: we have to have both of them
>>> present for any operation, and losing either one of them leaves us with
>>> only one copy of data.
>>>
>>> To contrast this, TCM bootstrap process is 4-step: between the old owner
>>> being phased out and the new owner brought in, we always ensure r/w quorum
>>> consistency and liveness of at least 2 nodes for the read quorum, 3 nodes
>>> available for reads in best case, and 2+1 pending replica for the write
>>> quorum, with 4 nodes (3 existing owners + 1 pending) being available for
>>> writes in best case. Replacement in TCM is implemented similarly, with the
>>> old node remaining an (unavailable) read target, but new node already being
>>> the target for (pending) writes.
>>>
>>> We can implement CEP-40 using a similar approach: we can leave the
>>> source node as both a read and write target, and allow the new node to be a
>>> target for (pending) writes. Unfortunately, this does not help with
>>> availability (in fact, it decreases write availability, since we will have
>>> to collect 2+1 mandatory write responses instead of just 2), but increases
>>> durability, and I think helps to fully eliminate the second phase. This
>>> also increases read availability when the source node is up, since we can
>>> still use the source node as a part of read quorum.
>>>
>>> I think if we want to call this feature "live migration", since this
>>> term is used in hypervisor community to describe an instant and
>>> uninterrupted instance migration from one host to the other without guest
>>> instance being able to notice as much as the time jump, we may want to
>>> provide similar guarantees.
>>>
>>> I am also not against to have this to be done post-factum, after
>>> implementation of CEP in its current form, but I think it would be good to
>>> have good understanding of availability and durability guarantees we want
>>> to provide with it, and have it stated explicitly, for both "source node
>>> down" and "source node up" cases. That said, since we will have to
>>> integrate CEP-40 with TCM, and will have to ensure correctness of sstable
>>> diffing for the second phase, it might make sense to consider reusing some
>>> of the existing replacement logic from TCM. Just to make sure this is
>>> mentioned explicitly, my proposal is only concerned with the second copy
>>> phase, without any implications about the first.
>>>
>>> Thank you,
>>> --Alex
>>>
>>> On Fri, Apr 5, 2024, at 12:46 PM, Venkata Hari Krishna Nukala wrote:
>>>
>>> Hi all,
>>>
>>> I have filed CEP-40 [1] for live migrating Cassandra instances using the
>>> Cassandra Sidecar.
>>>
>>> When someone needs to move all or a portion of the Cassandra nodes
>>> belonging to a cluster to different hosts, the traditional approach of
>>> Cassandra node replacement can be time-consuming due to repairs and the
>>> bootstrapping of new nodes. Depending on the volume of the storage service
>>> load, replacements (repair + bootstrap) may take anywhere from a few hours
>>> to days.
>>>
>>> Proposing a Sidecar based solution to address these challenges. This
>>> solution proposes transferring data from the old host (source) to the new
>>> host (destination) and then bringing up the Cassandra process at the
>>> destination, to enable fast instance migration. This approach would help to
>>> minimise node downtime, as it is based on a Sidecar solution for data
>>> transfer and avoids repairs and bootstrap.
>>>
>>> Looking forward to the discussions.
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-40%3A+Data+Transfer+Using+Cassandra+Sidecar+for+Live+Migrating+Instances
>>>
>>> Thanks!
>>> Hari
>>>
>>>
>>>
>>>
>>>

Re: [DISCUSS] CEP-40: Data Transfer Using Cassandra Sidecar for Live Migrating Instances

Reply via email to