Ariel Weisberg created CASSANDRA-10245:
------------------------------------------
Summary: Provide after the fact visibility into the reliability of
the environment C* operates in
Key: CASSANDRA-10245
URL: https://issues.apache.org/jira/browse/CASSANDRA-10245
Project: Cassandra
Issue Type: New Feature
Components: Core
Reporter: Ariel Weisberg
Fix For: 3.x
I think that by default databases should not be completely dependent on
operator provided tools for monitoring node and network health.
The database should be able to detect and report on several dimensions of
performance in its environment, and more specifically report on deviations from
acceptable performance.
* Node wide pauses
* JVM wide pauses
* Latency, and roundtrip time to all endpoints
* Block device IO latency
If flight recorder were available for use in production I would say as a start
just turn that on, add jHiccup (inside and outside the server process), and a
daemon inside the server to measure network performance between endpoints.
FR is not available (requires a license in production) so instead focus on
adding instrumentation for the most useful facets of flight recorder in
diagnosing performance issues. I think we can get pretty far because what we
need to do is not quite as undirected as the exploration FR and JMC facilitate.
Until we dial in how we measure and how to signal without false positives I
would expect this kind of logging to be in the background for post-hoc analysis.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)