Hey all - With the uptick in discussion around Cassandra operability and after discussing potential solutions with various members of the community, we would like to propose the addition of a management process/sub-project into Apache Cassandra. The process would be responsible for common operational tasks like bulk execution of nodetool commands, backup/restore, and health checks, among others. We feel we have a proposal that will garner some discussion and debate but is likely to reach consensus. While the community, in large part, agrees that these features should exist “in the database”, there is debate on how they should be implemented. Primarily, whether or not to use an external process or build on CassandraDaemon. This is an important architectural decision but we feel the most critical aspect is not where the code runs but that the operator still interacts with the notion of a single database. Multi-process databases are as old as Postgres and continue to be common in newer systems like Druid. As such, we propose a separate management process for the following reasons: - Resource isolation & Safety: Features in the management process will not affect C*'s read/write path which is critical for stability. An isolated process has several technical advantages including preventing use of unnecessary dependencies in CassandraDaemon, separation of JVM resources like thread pools and heap, and preventing bugs from adversely affecting the main process. In particular, GC tuning can be done separately for the two processes, hopefully helping to improve, or at least not adversely affect, tail latencies of the main process.
- Health Checks & Recovery: Currently users implement health checks in their own sidecar process. Implementing them in the serving process does not make sense because if the JVM running the CassandraDaemon goes south, the healthchecks and potentially any recovery code may not be able to run. Having a management process running in isolation opens up the possibility to not only report the health of the C* process such as long GC pauses or stuck JVM but also to recover from it. Having a list of basic health checks that are tested with every C* release and officially supported will help boost confidence in C* quality and make it easier to operate. - Reduced Risk: By having a separate Daemon we open the possibility to contribute features that otherwise would not have been considered before eg. a UI. A library that started many background threads and is operated completely differently would likely be considered too risky for CassandraDaemon but is a good candidate for the management process. What can go into the management process? - Features that are non-essential for serving reads & writes for eg. Backup/Restore or Running Health Checks against the CassandraDaemon, etc. - Features that do not make the management process critical for functioning of the serving process. In other words, if someone does not wish to use this management process, they are free to disable it. We would like to initially build minimal set of features such as health checks and bulk commands into the first iteration of the management process. We would use the same software stack that is used to build the current CassandraDaemon binary. This would be critical for sharing code between CassandraDaemon & management processes. The code should live in-tree to make this easy. With regards to more in-depth features like repair scheduling and discussions around compaction in or out of CassandraDaemon, while the management process may be a suitable host, it is not our goal to decide that at this time. The management process could be used in these cases, as they meet the criteria above, but other technical/architectural reasons may exists for why it should not be. We are looking forward to your comments on our proposal, Dinesh Joshi and Jordan West