Re: March 2015 QA retrospective

Benedict Elliott Smith Tue, 14 Apr 2015 16:39:58 -0700

This does create a strange incentive structure, though: we currently (even
with 1-3m patch cycles) don't fix every bug with every release. So if we
cannot commit a bug fix without a corresponding sufficient test coverage,
then we are potentially reducing the number of bug fixes we will
incorporate in a release, even if we could have easily fixed the bug (via
volunteer, or community provision, or whatever). Of course, without good
coverage arguably we don't know if those bug fixes actually achieve a net
benefit, but the disincentive is there.


I have never seen bugs as something you volunteer for. They typically belong
> somewhere and if it is with you then so be it.


The thing is, we have a very self-organising aproach to ticket delivery.
So, while major show-stopping bugs get routed to the best person and
prioritized, there is a sliding scale of perceived importance and the time
available of the best placed individual to address the bug that affects the
likelihood of the bug being addressed by them in the (or any) release
window.

I'm not sure how much others have done this, but I have found myself
volunteering for bug fixes, sometimes just as a distraction, other times
for a sense of progress when working on something lengthier or frustrating,
or sometimes just because I have seen someone languishing in desire of a
quick fix. Which leads me on to the other strange incentive of this aspect
of the new system: the *dis*incentive to widen the scope of your perceived
competencies. If bug fixing is exclusively a chore (and a significant time
sink), and bugs are routed to people with competences in a certain area,
then there is a strong disincentive to be seen as able in areas outside
your "core" competencies, since it is going to eat into the time you have
to deliver in the areas you are largely measured on. It is simply exposing
yourself to a downside risk, without much upside. Spreading competencies
around the community is a really positive thing, though, in ensuring the
long term health of the project. Diving into quick fixes has given me a
much wider view of the codebase than I otherwise would have likely had, and
I would hate to see that discouraged.



On Mon, Apr 13, 2015 at 5:37 PM, Ariel Weisberg <ariel.weisb...@datastax.com
> wrote:

> Hi Benedict,
>
> This only requires unit testing or dtests to be run this way. However for
> > the kitchen sink tests this is just another dimension in the
> configuration
> > state space, which IMO should be addressed as a whole methodically.
> Perhaps
> > we should file a central JIRA, or the Google doc you suggested, for
> > tracking all of these data points?
>
>
> I created a doc
> <
> https://docs.google.com/a/datastax.com/document/d/1kccPqxEAoYQpT0gXnp20MYQUDmjOrakAeQhf6vkqjGo/edit?usp=sharing
> >
> that
> is requirements, but not implementation. I want to list things we would
> like it to test in the general sense, as well as enumerating specific bugs
> that it should have been able to catch.
>
> This does raise an interesting, but probably not significant downside to
> > the new approach: I fixed this ticket because somebody mentioned to me
> that
> > it was hurting them, and I saw a quick and easy fix. The testing would
> not
> > be quick and easy, so I am unlikely to volunteer to patch quick fixes in
> > the new world order. This will certainly lead to higher quality bug
> fixes,
> > but it may lead to fewer of them, and fewer instances of volunteer work
> to
> > help people out, because the overhead eats too much into the work you're
> > actually responsible for. This may lead to bug fixing being seen as much
> > more of a chore than it already can be. I don't say this to discourage
> the
> > new approach; it is just a thought that occurs to me off the back of this
> > specific discussion.
>
>
> It's a real problem. People doing bugs fixes can be stuck spending months
> doing nothing but that and writing tests to fill in coverage. Then they get
> unhappy and unproductive.
>
> One of the reasons I leave the option for filing a JIRA  open instead of
> saying that they have to do something is that it gives assignees and
> reviewers the option to have the work done later or by someone else. The
> person who is scheduling releases can see that the test issues before
> release (you would set fix version for the next release). It's still not
> done and the release is not done. That puts pressure on the person who
> wants to release to make sure it is in someone's queue.
>
> If you are hardcore agile and doing one or two week sprints what happens is
> that there are no tickets left in the sprint other than what was agreed on
> at at the planning meeting and people will have no choice but to work on
> test tasks. How we manage and prioritize tasks right now is "magic" to me
> and maybe not something that scales down to monthly releases.
>
> For monthly releases on at least a weekly basis you need to know what
> stands between you and the release being done and you need to have a plan
> for who is going to take care of the blockers that crop up.
>
> The testing would not
> > be quick and easy, so I am unlikely to volunteer to patch quick fixes in
> > the new world order.
>
>
> I think this gets into how we load balance bug fixes. There is a clear
> benefit to routing the bug to the person who will know how to fix and test
> it. I have never seen bugs as something you volunteer for. They typically
> belong somewhere and if it is with you then so be it.
>
>
>  because the overhead eats too much into the work you're
> > actually responsible for.
>
>
> We need to make sure that bug fixing isn't seen that way. I think it's
> important to make sure bugs find their way home. The work your actually
> responsible for is not done so you can't claim that bug fixes are eating
> into it. It already done been ate.
>
> We shouldn't prioritize new work over past work that was never finished.
> With monthly releases and breaking things down into much smaller chunks it
> means you have the option to let new work slip to accommodate without
> moving tasks between people.
>
> Ariel
>
>
>
> On Fri, Apr 10, 2015 at 7:07 PM, Benedict Elliott Smith <
> belliottsm...@datastax.com> wrote:
>
> > >
> > > CASSANDRA-8459 <https://issues.apache.org/jira/browse/CASSANDRA-8459>
> > > "autocompaction"
> > > on reads can prevent memtable space reclaimation
> > >
> > > Can you link a ticket to CASSANDRA-9012 and characterize in a way we
> can
> > > try and implement how to make sufficiently large partitions, over
> > > sufficiently large periods of time?
> >
> > Maybe also enumerate the other permutations where this matters like
> > > secondary indexes and the access patterns (scans).
> > >
> >
> > Does this really qualify for its own ticket? This should just be one of
> > many configurations for stress' part in the new tests. We should perhaps
> > have an aggregation ticket where we ensure we enumerate the configuration
> > data points we've met that need to be covered. But, IMO at least, a
> > methodical exhaustive approach should be undertaken separately, and only
> be
> > corroborated against such a list to ensure it was done sufficiently well.
> >
> >
> > >
> > > CASSANDRA-8619 <https://issues.apache.org/jira/browse/CASSANDRA-8619>
> -
> > > using
> > > CQLSSTableWriter gives ConcurrentModificationException
> > >
> > > OK. I don't think the original fix meets our new definition of done
> since
> > > the was insufficient coverage, and in this case no regression test. To
> be
> > > done you would have to either implement the coverage or file a JIRA to
> > add
> > > it.
> > >
> > > Can you file a ticket with as much detail as you can on what a the test
> > > might look like and link it to CASSANDRA-9012?
> > >
> > >
> > Well, the goal posts have shifted a smidgen since then :)
> >
> > I've already filed CASSANDRA-9163 and CASSANDRA-9164 (the former I have
> > linked to CASSANDRA-9012). These problems would trivially be caught by
> any
> > kind of randomized long testing of these utilities, basically.
> >
> > This does raise an interesting, but probably not significant downside to
> > the new approach: I fixed this ticket because somebody mentioned to me
> that
> > it was hurting them, and I saw a quick and easy fix. The testing would
> not
> > be quick and easy, so I am unlikely to volunteer to patch quick fixes in
> > the new world order. This will certainly lead to higher quality bug
> fixes,
> > but it may lead to fewer of them, and fewer instances of volunteer work
> to
> > help people out, because the overhead eats too much into the work you're
> > actually responsible for. This may lead to bug fixing being seen as much
> > more of a chore than it already can be. I don't say this to discourage
> the
> > new approach; it is just a thought that occurs to me off the back of this
> > specific discussion.
> >
> >
> > CASSANDRA-8668 <https://issues.apache.org/jira/browse/CASSANDRA-8668> We
> > > don't enforce offheap memory constraints; regression introduced by 7882
> > >
> > > We need to note somewhere that the kitchen sink test needs to insert
> > large
> > > columns. How would it detect that the constraint was violated
> >
> >
> > It would fall over with an OOM
> >
> >
> > > I am starting to think we need a google doc for kitchen sink test wish
> > > listing and design discussion rather then scattering bits about it in
> > JIRA.
> > >
> >
> >  Agreed.
> >
> >
> >
> > > CASSANDRA-8719 <https://issues.apache.org/jira/browse/CASSANDRA-8719>
> > > Using
> > > thrift HSHA with offheap_objects appears to corrupt data
> > >
> > > Can you file a ticket for having the kitchen sink tests be configurable
> > to
> > > run against all client access paths? Linked to 9012 for now?
> > >
> >
> > This only requires unit testing or dtests to be run this way. However for
> > the kitchen sink tests this is just another dimension in the
> configuration
> > state space, which IMO should be addressed as a whole methodically.
> Perhaps
> > we should file a central JIRA, or the Google doc you suggested, for
> > tracking all of these data points?
> >
> >
> > > CASSANDRA-8726 <https://issues.apache.org/jira/browse/CASSANDRA-8726>
> > > throw
> > > OOM in Memory if we fail to allocate OOM
> > >
> > > Can you create a ticket for this? I think that testing each allocation
> is
> > > not realistic in the sense that they don't fail in isolation. The JVM
> > > itself can ruin our day in OOM conditions as well. There is also heap
> OOM
> > > vs native memory OOM. It's worth some thought as to what the best bang
> > for
> > > the buck testing strategy is going to be.
> > >
> >
> > That's a bit of a different scope to the original problem, since in those
> > instances the VM explicitly throws an OOM. We can fault injection test
> both
> > of these scenarios, though, and I've already filed CASSANDRA-9165 for
> this.
> > I have commented on the ticket so that these scenarios are amongst those
> > explicitly considered when we address it, but I expect the scope of that
> > ticket to be very broad, and probably introduce its own entire class of
> > subtickets.
> >
> >
> > > Thanks,
> > > Ariel
> > >
> > > On Fri, Apr 10, 2015 at 8:04 AM, Benedict Elliott Smith <
> > > belliottsm...@datastax.com> wrote:
> > >
> > > > TL;DR: "Kitchen sink" (aggressive randomised stress with subsystem
> > > > correctness) tests; commitlog/memtable isolated correctness stress
> > > testing;
> > > > improved tool/utility testing; internal structural changes to prevent
> > > > occurrence (delivered); fault injection testing. Filed #916[1-5]
> > > >
> > > > <https://issues.apache.org/jira/browse/CASSANDRA-7704> Benedict
> > > > FileNotFoundException during STREAM-OUT triggers 100% CPU usage
> > Streaming
> > > >
> > > > This particular class of bug should be near impossible, due to
> > structural
> > > > changes beginning with 7705. For testing such an uncommon race
> > condition,
> > > > we would hope it to be exhibited eventually by our kitchen sink
> > > aggressive
> > > > testing, but it would be a very uncommon event.
> > > >
> > > > CASSANDRA-8383 <https://issues.apache.org/jira/browse/CASSANDRA-8383
> >
> > > > Benedict Memtable
> > > > flush may expire records from the commit log that are in a later
> > memtable
> > > > No
> > > > regression test, no follow up ticket. Could/should this have been
> > > > reproducable as an actual bug?
> > > >
> > > > As stated on the ticket, we need to introduce rigorous randomized
> > testing
> > > > of the commit log's correctness, both in isolation and in conjunction
> > > with
> > > > memtable flushing. This is not a trivial undertaking. Whether or not
> it
> > > > integrates with our kitchen sink tests is an open question, but I
> think
> > > > that might be difficult. I've filed #9162 to track this.
> > > >
> > > > CASSANDRA-8429 <https://issues.apache.org/jira/browse/CASSANDRA-8429
> >
> > > > Benedict
> > > > Some keys unreadable during compaction
> > > >
> > > > Running stress in CI would have caught this, and we're going to do
> that
> > > >
> > > > CASSANDRA-8459 <https://issues.apache.org/jira/browse/CASSANDRA-8459
> >
> > > > Benedict
> > > > "autocompaction" on reads can prevent memtable space reclaimation
> > > >
> > > > Kitchen sink tests with sufficiently large partitions written over a
> > > > sufficiently large period of time. Same risk present for e.g.
> secondary
> > > > indexes, so aggressive coverage of these, including scans etc,
> > important.
> > > >
> > > > CASSANDRA-8499 <https://issues.apache.org/jira/browse/CASSANDRA-8499
> >
> > > > Benedict
> > > > Ensure SSTableWriter cleans up properly after failure
> > > > Testing error paths? Any way to test things in a loop to detect
> leaks?
> > > >
> > > > This kind of leak are now reported, and autocorrected for, so
> detecting
> > > is
> > > > much easier. However fault injection testing (if we can find a good
> way
> > > for
> > > > license compliance) as I started in CASSANDRA-8568 would help a lot
> > also.
> > > >
> > > > CASSANDRA-8513 <https://issues.apache.org/jira/browse/CASSANDRA-8513
> >
> > > > Benedict
> > > > SSTableScanner may not acquire reference, but will still release it
> > when
> > > > closed
> > > > This had a user visible component, what test could have caught it
> befor
> > > > erelease?
> > > >
> > > > Again, this cannot happen now, due to internal structural changes to
> > > > prevent it.
> > > >
> > > > CASSANDRA-8619 <https://issues.apache.org/jira/browse/CASSANDRA-8619
> > > > > Benedict
> > > > using CQLSSTableWriter gives ConcurrentModificationException
> > > >
> > > > Some better testing of our tools and utilities. The fix for this
> > > introduced
> > > > its own bug, by the looks of it, which we also did not catch. Better
> > > > (randomized long testing) coverage of these tools would help in both
> > > fixing
> > > > and ensuring it doesn't return again.
> > > >
> > > > CASSANDRA-8632 <https://issues.apache.org/jira/browse/CASSANDRA-8632
> >
> > > > Benedict
> > > > cassandra-stress only generating a single unique row
> > > >
> > > > This was caught prior to release by developer use, which is currently
> > the
> > > > only QA we have for stress. Some basic testing would certainly be
> > > helpful,
> > > > but there is a tension between getting stress to do useful things,
> and
> > > > testing that it does so, since there are finite resources available
> to
> > > us.
> > > > The utility is currently probably more pressing, given the eyes it
> gets
> > > > when it is used. With more complex validation arriving, in
> conjunction
> > > with
> > > > performance profile histories and its generally being employed as a
> dev
> > > > tool, it should somewhat self test (major changes in performance
> > profiles
> > > > should be explicable else investigated, and critical mistakes should
> > > often
> > > > lead to failed validation, or to users noticing a problem), and I
> > expect
> > > > this will have to suffice for the interim.
> > > >
> > > >
> > > > CASSANDRA-8668 <https://issues.apache.org/jira/browse/CASSANDRA-8668
> >
> > > > Benedict We don't enforce offheap memory constraints; regression
> > > > introduced by 7882
> > > >
> > > > This would have been easily found with a kitchen sink test that was
> > > > inserting large columns. We should probably also have some specific
> > tests
> > > > for ensuring the allocation tracking is exactly correct (by
> inspecting
> > > the
> > > > whole object graph independently, and reconciling the values), but
> this
> > > is
> > > > fiddly and of low immediate yield.
> > > >
> > > > CASSANDRA-8719 <https://issues.apache.org/jira/browse/CASSANDRA-8719
> >
> > > > Benedict
> > > > Using thrift HSHA with offheap_objects appears to corrupt data
> > > >
> > > > *Untested configuration before release, this would be straightforward
> > if
> > > we
> > > > ran with it? *
> > > > Spot on.
> > > >
> > > > CASSANDRA-8726 <https://issues.apache.org/jira/browse/CASSANDRA-8726
> > > > > Benedict
> > > > throw OOM in Memory if we fail to allocate OOM
> > > >
> > > > Kind of tricky to induce an OOM; in general we consider an OOM to put
> > C*
> > > > into an unstable state as well, so correct behaviour is just to shut
> > > down,
> > > > making it potentially tricky to test all avenues that could throw
> OOM.
> > > > Possibly the best route is to modify the byte code to corrupt the
> > return
> > > > value to zero for each possible avenue we can reach it by, and
> confirm
> > > that
> > > > shutdown occurs safely.
> > > >
> > >
> >
>

Re: March 2015 QA retrospective

Reply via email to