Hi Patrick,
> attacking some of the same requirements that Graal and Quarkus are
trying to solve
thanks for the support! Yes, Graal (and Leyden) are kind of competing
solutions for the startup problem. We're trying to hit the sweet spot
between not requiring significant redesign (as is sometimes the case
with Graal AOT) and having more bang than what a fully transparent
solution can give. Quarkus is also known for fast startup but it is more
orthogonal to CRaC - in fact Quarkus already has some support for CRaC
and the two can be aligned to get even better performance together.
> Topology information shouldn't be assumed
Is there already an automatic process that will update the topology
information on reconnect? I guess that what we should prevent is the
'manual' update (forcing node back up) to override fresh topology
update. Also, if there's a process invoking the driver concurrently to
checkpoint, we might get the control connection established too early;
that's not a big problem since the checkpoint will fail and we can retry.
Thank you!
Radim
On 10. 03. 25 18:17, Patrick McFadin wrote:
Caution: This email originated from outside of the organization. Do
not click links or open attachments unless you recognize the sender
and know the content is safe.
Just speaking up as a supporter for considering this change. From a
userland perspective, I've been reading up on CRaC, and I see this
attacking some of the same requirements that Graal and Quarkus are
trying to solve. This is a worth direction to pursue.
The CqlSession will need to re-connect, and I think that's worth
testing. Topology information shouldn't be assumed, especially with
something like Token-Aware Routing. Some shortcuts could speed it up,
but I can't think of any right now. I like the idea of making it
optional and putting it through some scenarios.
Patrick
On Mon, Mar 10, 2025 at 8:03 AM Radim Vansa <rva...@azul.com> wrote:
Hello Josh,
thanks for reaching back; answers inline:
On 10. 03. 25 13:03, Josh McKenzie wrote:
From skimming the PR on the Spring side and the conversation
there, it looks like the argument is to have this live inside the
java driver for Cassandra instead of in the spring-boot lib which
I can see the argument for.
Yes; for us it does not really matter where the fix lives as long
as it's available for the end users. Pushing it towards Cassandra
has the advantage to provide the greatest fan-out to users, even
those not consuming through frameworks.
If we distill this to speak to precisely the problem we're trying
to address or improvement we're going for here, how would you
phrase that? i.e. "Take application startup from Nms down to Mms"?
Yes, optimizing startup time is the most common use-case for CRaC.
It's rather hard to provide such general numbers: it should be
order(s) of magnitude. If we speak about hello-world style Spring
Boot application booting, CRaC improves the startup from seconds
to tens of milliseconds. That shouldn't differ too much from the
expected times for a small micro-service, improving latency in
scale-from-zero situations. This is not limited to microservices,
though; we've been experimenting with real applications consuming
hundreds of GB of memory. In that case the application boot can be
rather complex, loading and pre-processing data from DB etc. where
the boot takes minutes or more. CRaC can restore such instance in
a few seconds.
I ask because that's the "pro" we'll need to weigh against
updating the driver's topology map of the cluster, resource
handling and potential leaks on shutdown/startup, and the
complexity of taking an implementation like this into the driver
code. Nothing insurmountable of course, just worth weighing the two.
Can you elaborate about other use cases where the nodes are forced
down, and what risk does that bring to the overall stability? Is
there a difference between marking only a subset of nodes down and
taking all of the nodes down? When we force-close the control
connection (as the first step), is it possible to get a topology
update at all and race on the cluster members?
Thank you!
Radim
On Thu, Mar 6, 2025, at 3:34 PM, Radim Vansa wrote:
Hi all,
I would like to make applications using Cassandra Java Driver,
particularly those built with Spring Boot, Quarkus or similar
frameworks, work with OpenJDK CRaC project [1]. I've already
created a
patch for Spring Boot [2] but Spring folks think that these
changes are
too dependent on driver internals, suggesting to contribute a
support to
Cassandra directly.
The patch involves closing all connections before checkpoint, and
re-establishing these after restore. I have implemented that though
sending a `NodeStateEvent -> FORCED_DOWN` on the bus for all
connected
nodes. As a follow-up I could develop some way to inform the
session
about a new topology e.g. if the cluster addresses change.
Before jumping onto implementing a PR I would like to ask what
you think
is the best approach to do this. I can think of two ways:
1) Native CRaC support
The driver would have a dependency on `org.crac:crac` [3]; this
is a
small (13kB) library that provides the interfaces and a dummy noop
implementation if the target JVM does not support CRaC. Then
`DefaultSession` would register a `org.crac.Resource`
implementation
that would handle the checkpoint. This has the advantage of
providing
best fan-out into any project consuming the driver without any
further work.
2) Exposing neutral methods
To save frameworks of relying on internals, `DefaultSession` would
expose `.suspend()` and `.resume()` methods that would implement
the
connection cut-off without importing any dependency. After
upgrade to
latest release, frameworks could use these methods in a way that
suits
them. I wouldn't add those methods to the `CqlSession` interface
(as
that would be breaking change) but only to `DefaultSession`.
Would Cassandra accept either of these, to let people checkpoint
(snapshot) their applications and restore them within tens of
milliseconds? Naturally it is possible to close the session object
completely and create a new one, but the ideal solution would
require no
application changes beyond dependency upgrade.
Btw. I am aware that there is an inherent race between possible
topology
change and shutdown of current nodes (and I am listening for
hints that
would let us prevent that), but it is reasonable to expect that
users
will checkpoint the application in a quiescent state. And if the
topology update breaks the checkpoint, it is always possible to
try it
again.
Thank you for your opinions and ideas!
Radim Vansa
[1] https://wiki.openjdk.org/display/crac
[2] https://github.com/spring-projects/spring-boot/pull/44505
[3] https://mvnrepository.com/artifact/org.crac/crac/1.5.0