Hi Patrick,

> attacking some of the same requirements that Graal and Quarkus are trying to solve

thanks for the support! Yes, Graal (and Leyden) are kind of competing solutions for the startup problem. We're trying to hit the sweet spot between not requiring significant redesign (as is sometimes the case with Graal AOT) and having more bang than what a fully transparent solution can give. Quarkus is also known for fast startup but it is more orthogonal to CRaC - in fact Quarkus already has some support for CRaC and the two can be aligned to get even better performance together.

> Topology information shouldn't be assumed

Is there already an automatic process that will update the topology information on reconnect? I guess that what we should prevent is the 'manual' update (forcing node back up) to override fresh topology update. Also, if there's a process invoking the driver concurrently to checkpoint, we might get the control connection established too early; that's not a big problem since the checkpoint will fail and we can retry.

Thank you!

Radim

On 10. 03. 25 18:17, Patrick McFadin wrote:

        
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.


Just speaking up as a supporter for considering this change. From a userland perspective, I've been reading up on CRaC, and I see this attacking some of the same requirements that Graal and Quarkus are trying to solve. This is a worth direction to pursue.

The CqlSession will need to re-connect, and I think that's worth testing. Topology information shouldn't be assumed, especially with something like Token-Aware Routing. Some shortcuts could speed it up, but I can't think of any right now. I like the idea of making it optional and putting it through some scenarios.

Patrick

On Mon, Mar 10, 2025 at 8:03 AM Radim Vansa <rva...@azul.com> wrote:

    Hello Josh,

    thanks for reaching back; answers inline:
    On 10. 03. 25 13:03, Josh McKenzie wrote:

    From skimming the PR on the Spring side and the conversation
    there, it looks like the argument is to have this live inside the
    java driver for Cassandra instead of in the spring-boot lib which
    I can see the argument for.


    Yes; for us it does not really matter where the fix lives as long
    as it's available for the end users. Pushing it towards Cassandra
    has the advantage to provide the greatest fan-out to users, even
    those not consuming through frameworks.


    If we distill this to speak to precisely the problem we're trying
    to address or improvement we're going for here, how would you
    phrase that? i.e. "Take application startup from Nms down to Mms"?


    Yes, optimizing startup time is the most common use-case for CRaC.
    It's rather hard to provide such general numbers: it should be
    order(s) of magnitude. If we speak about hello-world style Spring
    Boot application booting, CRaC improves the startup from seconds
    to tens of milliseconds. That shouldn't differ too much from the
    expected times for a small micro-service, improving latency in
    scale-from-zero situations. This is not limited to microservices,
    though; we've been experimenting with real applications consuming
    hundreds of GB of memory. In that case the application boot can be
    rather complex, loading and pre-processing data from DB etc. where
    the boot takes minutes or more. CRaC can restore such instance in
    a few seconds.


    I ask because that's the "pro" we'll need to weigh against
    updating the driver's topology map of the cluster, resource
    handling and potential leaks on shutdown/startup, and the
    complexity of taking an implementation like this into the driver
    code. Nothing insurmountable of course, just worth weighing the two.

    Can you elaborate about other use cases where the nodes are forced
    down, and what risk does that bring to the overall stability? Is
    there a difference between marking only a subset of nodes down and
    taking all of the nodes down? When we force-close the control
    connection (as the first step), is it possible to get a topology
    update at all and race on the cluster members?

    Thank you!

    Radim



    On Thu, Mar 6, 2025, at 3:34 PM, Radim Vansa wrote:
    Hi all,

    I would like to make applications using Cassandra Java Driver,
    particularly those built with Spring Boot, Quarkus or similar
    frameworks, work with OpenJDK CRaC project [1]. I've already
    created a
    patch for Spring Boot [2] but Spring folks think that these
    changes are
    too dependent on driver internals, suggesting to contribute a
    support to
    Cassandra directly.

    The patch involves closing all connections before checkpoint, and
    re-establishing these after restore. I have implemented that though
    sending a `NodeStateEvent -> FORCED_DOWN` on the bus for all
    connected
    nodes. As a follow-up I could develop some way to inform the
    session
    about a new topology e.g. if the cluster addresses change.

    Before jumping onto implementing a PR I would like to ask what
    you think
    is the best approach to do this. I can think of two ways:

    1) Native CRaC support

    The driver would have a dependency on `org.crac:crac` [3]; this
    is a
    small (13kB) library that provides the interfaces and a dummy noop
    implementation if the target JVM does not support CRaC. Then
    `DefaultSession` would register a `org.crac.Resource`
    implementation
    that would handle the checkpoint. This has the advantage of
    providing
    best fan-out into any project consuming the driver without any
    further work.

    2) Exposing neutral methods

    To save frameworks of relying on internals, `DefaultSession` would
    expose `.suspend()` and `.resume()` methods that would implement
    the
    connection cut-off without importing any dependency. After
    upgrade to
    latest release, frameworks could use these methods in a way that
    suits
    them. I wouldn't add those methods to the `CqlSession` interface
    (as
    that would be breaking change) but only to `DefaultSession`.

    Would Cassandra accept either of these, to let people checkpoint
    (snapshot) their applications and restore them within tens of
    milliseconds? Naturally it is possible to close the session object
    completely and create a new one, but the ideal solution would
    require no
    application changes beyond dependency upgrade.

    Btw. I am aware that there is an inherent race between possible
    topology
    change and shutdown of current nodes (and I am listening for
    hints that
    would let us prevent that), but it is reasonable to expect that
    users
    will checkpoint the application in a quiescent state. And if the
    topology update breaks the checkpoint, it is always possible to
    try it
    again.

    Thank you for your opinions and ideas!

    Radim Vansa


    [1] https://wiki.openjdk.org/display/crac

    [2] https://github.com/spring-projects/spring-boot/pull/44505

    [3] https://mvnrepository.com/artifact/org.crac/crac/1.5.0


Reply via email to