[jira] [Commented] (CASSANDRA-15564) Refactor repair coordinator so errors are consistent

David Capwell (Jira) Thu, 27 Feb 2020 15:54:23 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047065#comment-17047065
 ]


David Capwell commented on CASSANDRA-15564:
-------------------------------------------

Thanks [~bdeggleston]!

* I am good with exception over Either, ill make that change soon.
* Retry has come up for me in the context of tests, so I don't mind moving it 
into a test package.  I see test/unit and test/distributed are in the same 
compile context (unit classes invoke distributed classes), so I should be able 
to put this in test/unit
* "Usage of final" yeah, I added them because [~djoshi] asked me to. [~djoshi] 
thoughts?

Clarification:

bq. For example, initial table filtering could have its own method, calculating 
neighbors and common ranges could have it's own method, etc, etc. This would 
make runMayThrow a lot easier to understand.

so we are on the same page, you are saying something like the below example?

{code}
void run() 
{
  try 
  {
    // all methods rely on mutable state, so update fields
    // setup
    validateInput();
    calculateNeighbors();
    groupCommonRanges();
   ...
    // start
    repair().join();
  }
  catch (Exception e)
 {
    failure(e);
 }
}
{code}

I am ok with that.  I mostly did it this way for two reasons

1) attempt to remove mutable state (tracing was a problem, didn't go as well as 
I hoped...)
2) group based off "life-cycle", the core reason for this was to make it clear 
when we do a state change so the parent JIRA could track it.

bq. I don't understand your comment "Why is this before start when its 
publishing the start event? For backwards compatability (...)" what's incorrect 
about logging that we're starting the repair command there?

Sorry, its bace to the "life-cycle" bit.  I personally classified all the logic 
there as "setup" and have start happening when we call 
SystemDistributedKeyspace.startParentRepair and 
ActiveRepairService.instance.prepareForRepair.  In the other JIRA (this code is 
extracted from that one to make it smaller), I keep track of each time we 
transition between life cycle phases and I classified those two methods as the 
actual time we do "start".

I moved the event/log into the start method, but found out that python dtest 
was failing because it expected a different order, by moving it back here the 
tests were passing again. So it was more "why do we send the START notification 
before we are in the START phase?", so was trying to call out to any future 
reader that this is there at that point in the code for compatibility; mostly 
python dtests.



> Refactor repair coordinator so errors are consistent
> ----------------------------------------------------
>
>                 Key: CASSANDRA-15564
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15564
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Consistency/Repair
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>              Labels: pull-request-available
>          Time Spent: 10h 50m
>  Remaining Estimate: 0h
>
> This is to split the change in CASSANDRA-15399 so the refactor is isolated 
> out.
> Currently the repair coordinator special cases the exit cases at each call 
> site; this makes it so that errors can be inconsistent and there are cases 
> where proper complete isn't done (proper notifications, and forgetting to 
> update ActiveRepairService).
> [Circle 
> CI|https://circleci.com/gh/dcapwell/cassandra/tree/bug%2FrepairCoordinatorJmxConsistency]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-15564) Refactor repair coordinator so errors are consistent

Reply via email to