Hello Josh,
thanks for reaching back; answers inline:
On 10. 03. 25 13:03, Josh McKenzie wrote:
From skimming the PR on the Spring side and the conversation there, it
looks like the argument is to have this live inside the java driver
for Cassandra instead of in the spring-boot lib which I can see the
argument for.
Yes; for us it does not really matter where the fix lives as long as
it's available for the end users. Pushing it towards Cassandra has the
advantage to provide the greatest fan-out to users, even those not
consuming through frameworks.
If we distill this to speak to precisely the problem we're trying to
address or improvement we're going for here, how would you phrase
that? i.e. "Take application startup from Nms down to Mms"?
Yes, optimizing startup time is the most common use-case for CRaC. It's
rather hard to provide such general numbers: it should be order(s) of
magnitude. If we speak about hello-world style Spring Boot application
booting, CRaC improves the startup from seconds to tens of milliseconds.
That shouldn't differ too much from the expected times for a small
micro-service, improving latency in scale-from-zero situations. This is
not limited to microservices, though; we've been experimenting with real
applications consuming hundreds of GB of memory. In that case the
application boot can be rather complex, loading and pre-processing data
from DB etc. where the boot takes minutes or more. CRaC can restore such
instance in a few seconds.
I ask because that's the "pro" we'll need to weigh against updating
the driver's topology map of the cluster, resource handling and
potential leaks on shutdown/startup, and the complexity of taking an
implementation like this into the driver code. Nothing insurmountable
of course, just worth weighing the two.
Can you elaborate about other use cases where the nodes are forced down,
and what risk does that bring to the overall stability? Is there a
difference between marking only a subset of nodes down and taking all of
the nodes down? When we force-close the control connection (as the first
step), is it possible to get a topology update at all and race on the
cluster members?
Thank you!
Radim
On Thu, Mar 6, 2025, at 3:34 PM, Radim Vansa wrote:
Hi all,
I would like to make applications using Cassandra Java Driver,
particularly those built with Spring Boot, Quarkus or similar
frameworks, work with OpenJDK CRaC project [1]. I've already created a
patch for Spring Boot [2] but Spring folks think that these changes are
too dependent on driver internals, suggesting to contribute a support to
Cassandra directly.
The patch involves closing all connections before checkpoint, and
re-establishing these after restore. I have implemented that though
sending a `NodeStateEvent -> FORCED_DOWN` on the bus for all connected
nodes. As a follow-up I could develop some way to inform the session
about a new topology e.g. if the cluster addresses change.
Before jumping onto implementing a PR I would like to ask what you think
is the best approach to do this. I can think of two ways:
1) Native CRaC support
The driver would have a dependency on `org.crac:crac` [3]; this is a
small (13kB) library that provides the interfaces and a dummy noop
implementation if the target JVM does not support CRaC. Then
`DefaultSession` would register a `org.crac.Resource` implementation
that would handle the checkpoint. This has the advantage of providing
best fan-out into any project consuming the driver without any
further work.
2) Exposing neutral methods
To save frameworks of relying on internals, `DefaultSession` would
expose `.suspend()` and `.resume()` methods that would implement the
connection cut-off without importing any dependency. After upgrade to
latest release, frameworks could use these methods in a way that suits
them. I wouldn't add those methods to the `CqlSession` interface (as
that would be breaking change) but only to `DefaultSession`.
Would Cassandra accept either of these, to let people checkpoint
(snapshot) their applications and restore them within tens of
milliseconds? Naturally it is possible to close the session object
completely and create a new one, but the ideal solution would require no
application changes beyond dependency upgrade.
Btw. I am aware that there is an inherent race between possible topology
change and shutdown of current nodes (and I am listening for hints that
would let us prevent that), but it is reasonable to expect that users
will checkpoint the application in a quiescent state. And if the
topology update breaks the checkpoint, it is always possible to try it
again.
Thank you for your opinions and ideas!
Radim Vansa
[1] https://wiki.openjdk.org/display/crac
[2] https://github.com/spring-projects/spring-boot/pull/44505
[3] https://mvnrepository.com/artifact/org.crac/crac/1.5.0