I got some feedback last week that I should try this on Monday morning, so
let's see if we can nudge a few people into action this week.

3.0.15 and 3.11.1 are released. This is a dev list, so that shouldn't be a
surprise to anyone here - you should have seen the votes and release
notifications. The people working directly ON Cassandra every day are
probably very aware of the number and nature of fixes in those versions -
if you're not aware, the Change lists are HUGE, and some of the fixes are
VERY IMPORTANT. So this week's wrap-up is really a reflection on the size
of those two release changelogs.

One of the advantages of the Cassandra project is the size of the user base
- I don't know if we have accurate counts (and some of the "surveys" are
laughable), but we know it's on the order of thousands (probably tens of
thousands) of companies, and some huge number of instances (not willing to
speculate here, we know it's at least in the hundreds of thousands, may be
well into the millions). Historically, the best stabilizer of a release was
people upgrading their unusual use cases, finding bugs that the developers
hadn't anticipated (and therefore tests didn't exist for those edge cases),
reporting them, and the next release would be slightly better than the one
before it. The chicken/egg problem here is pretty obvious, and while a lot
of us are spending a lot of time making things better, I want to use this
email to ask a favor (in 3 parts):

1) If you haven't tried 3.0 or 3.11 yet, please spin it up on a test
cluster. 3.11 would be better, 3.0 is ok too. It doesn't need to be a
thousand node cluster, most of the weird stuff we've seen in the post-3.0
world deals with data, not cluster size. Grab some of your prod data if you
can, throw it into a test cluster, add a node/remove a node, tell us if it
doesn't work.
2) Please run a stress workload against that test cluster, even if it's
5-10 minutes. Purpose here is two-fold: like #1, it'll help us find some
edge cases we haven't seen before, but it'll also help us identify holes in
stress coverage. We have some tickets to add UDTs to stress (
https://issues.apache.org/jira/browse/CASSANDRA-13260 ) and LWT (
https://issues.apache.org/jira/browse/CASSANDRA-7960 ). Ideally your stress
profile should be more than "80% reads 20% writes" - try to actually model
your schema and query behavior. Do you use static columns? Do you use
collections?  If you're unable to model your use case because of a
deficiency in stress, open a JIRA. If things break, open a JIRA. If it
works perfectly, I'm interested in seeing your stress yaml and results
(please send it to me privately, don't spam the list).
3) If you're somehow not able to run stress because you don't have hardware
for a spare cluster, profiling your live cluster is also incredibly useful.
TLP has some notes on how to generate flame graphs -
https://github.com/thelastpickle/lightweight-java-profiler - I saw one
example from a cluster that really surprised me. There are versions and use
cases that we know have been heavily profiled, but there are probably
versions and use cases where nobody's ever run much in the way of
profiling. If you're running openjdk in prod, and you're able to SAFELY
attach a profiler to generate some flame graphs, please send those to me
(again, privately please, I don't think the whole list needs a copy).

My hope in all of this is to build up a corpus of real world use cases (and
real current state via profiling) that we can leverage to make testing and
performance better going forward. If I get much in the way of response to
either of these, I'll try to send out a summary in next week's email).

- Jeff

Reply via email to