Re: Declarative Workflow update, mutex, replay and retention

Geoff Macartney Mon, 21 Nov 2022 13:26:39 -0800

Hi Alex,

I'm afraid I find the updated doc quite confusing - I think it needs
work to clarify this new distinction between replayable and resumable,
which I don't understand. What's the difference between those
concepts? The first mention of "resumable" in the doc (at the start of
section "Common Replay Settings") is


"Most of the time, there are just a few tweaks to `resumable` and
`replayable` needed to let Apache Brooklyn do the right thing to
replay correctly."

But there has been no mention of "resumable" up to that point and no
explanation of what it means or where and how it is used.

Additionally in the section "Advanced: Replay/Resume Settings", under
the bullet for Replayable is the option "automatically"; but in the
2nd bullet where it documents "from here" it talks about "resumable:
automatically". Which is it? Or is it both?

In the example on p 31. it comments "step type is not resumable so
replay will not be permitted here" and so sets "replayable: no".  Is a
separate concept needed here?

There are also a couple of mentions of something called "retry
replay". The first, (p. 26) talks about "a `retry replay` step",
without really defining it, the second, in the explanation of (step
level) "replayable: from here", talks about "any 'retry replay' or
'resumable: automatically' handler", again without explanation.

Up until now I felt I sort of understood the proposals, but I now feel
I'm lost, and that the document has become quite hard to follow.

At the very least we are going to need to make sure that the
documentation that we put into brooklyn-docs for this is very
carefully and thoughtfully written. Workflows are going to be very
powerful, but if they're not well documented they're going to be very
hard to understand. It might turn out you're the only person who knows
how to write them!

Cheers,
Geoff

On Mon, 21 Nov 2022 at 15:48, Alex Heneveld <[email protected]> wrote:
>
> Following some feedback re "replayable" I've rejigged that section [1].  It
> changes the concepts to consider whether steps are "resumable" as a
> separate idea to noting explicit "replayable" waypoints, with in most cases
> either able to allow a workflow to be replayed, e.g. if a resumable step is
> interrupted (such as a sleep, but not for instance and http call) or if the
> workflow author indicated that a completed step was "replayable here".
>
> Thanks to those who gave their input!  I am much happier with this rejigged
> plan.  More comments are welcome!
>
> Best
> Alex
>
> [1]
> https://docs.google.com/document/d/1u02Bi6sS8Fkf1s7UzRRMnvLhA477bqcyxGa0nJesqkI/edit#heading=h.63aibdplmze
>
>
> On Fri, 18 Nov 2022 at 13:44, Alex Heneveld <[email protected]> wrote:
>
> > Hi team,
> >
> > I've got most of the proposed "lock" implementation completed, as
> > discussed in the previous mail (below), PR to come shortly.  It works well,
> > though there are a huge number of subtleties so test cases were a
> > challenge; it makes it all the better than we provide this however, so
> > workflow authors have a much easier time.  The biggest challenge was to
> > make sure that if an author writes e.g. { type: workflow, lock: my-lock,
> > on-error: [ retry ], steps: [ ... ] }`, if Brooklyn does a failover the
> > retry (at the new server) preserves the lock ... but if there is no retry,
> > or the retry fails, the lock is cleared.
> >
> > As part of this I've been mulling over "replayable"; as I mentioned below
> > it was one aspect I'm not entirely sure of, and it's quite closely related
> > to "expiration" which I think might be better described as "retention".  I
> > think I've got a better way to handle those, and a tweak to error
> > handling.  It's described in this section:
> >
> >
> > https://docs.google.com/document/d/1u02Bi6sS8Fkf1s7UzRRMnvLhA477bqcyxGa0nJesqkI/edit#heading=h.63aibdplmze
> >
> > There are two questions towards the end that I'd especially value input on.
> >
> > Thanks
> > Alex
> >
> >
> > On Fri, 11 Nov 2022 at 13:40, Alex Heneveld <[email protected]> wrote:
> >
> >> Hi team,
> >>
> >> Workflow is in a pretty good state -- nearly done mostly as per
> >> proposal, with nearly all step types, retries, docs [1], and integrated
> >> into the activities view in the inspector.  My feeling is that it radically
> >> transforms how easy it is to write sensors and effectors.
> >>
> >> Thanks to everyone for their reviews and feedback!
> >>
> >> The biggest change from the original proposal is a switch to the list
> >> syntax from the init.d syntax.  Extra thanks to those who agitated for 
> >> that.
> >>
> >> The other really nice aspect is how the shorthand step notation functions
> >> as a DSL in simple cases so extra thanks for the suggestion to make it
> >> DSL-like.
> >>
> >> The two items remaining are nested workflows and controlling how long
> >> workflows are remembered.
> >>
> >>
> >> There is one new feature which seems to be needed, which I wanted to
> >> raise.  As the subject suggests, this is mutexes.  I had hoped we wouldn't
> >> need this but actually quite often you want to ensure no other workflows
> >> are conflicting.  Consider the simple case where you want to atomically
> >> increment a sensor:
> >>
> >> ```
> >> - let i = ${entity.sensor.count} ?? 0
> >> - let i = ${i} + 1
> >> - set-sensor count = ${i}
> >> ```
> >>
> >> Running this twice we'd hope to get count=2.  But if the runs are
> >> concurrent you won't.  So how can we ensure no other instances of the
> >> workflow are running concurrently?
> >>
> >> There are three options I see.
> >>
> >>
> >> (1) set-sensor allows arithmetic
> >>
> >> We could support arithmetic on set-sensor and require it to run
> >> atomically against that sensor.  For instance `set-sensor count =
> >> (${entity.sensor.count} ?? 0) + 1`.  We could fairly easily ensure the RHS
> >> is evaluated with the caller holding the lock on the sensor count.  However
> >> our arithmetic support is quite limited (we don't support grouping at
> >> present, so either you'd have to write `${entity.sensor.count} + 1 ?? 1` or
> >> we'd beef that up), and I think there is something nice about at present
> >> where arithmetic is only allowed on `let` so it is more inspectable.
> >>
> >>
> >> (2) set-sensor with mutex check
> >>
> >> We could introduce a check which is done while the lock on the sensor is
> >> held.  So you could check the sensor is unset before setting it, and fail
> >> if it isn't, or check the value is as expected.  You can then set up
> >> whatever retry behaviour you want in the usual way.  For instance:
> >>
> >> ```
> >> # pessimistic locking
> >> - let i = ${entity.sensor.count} ?? 0
> >> - let j = ${i} + 1
> >> - step: set-sensor count = ${j}
> >>   require: ${i}
> >>   on-error:
> >>   - goto start
> >> - clear-sensor lock-for-count
> >> ```
> >>
> >> ```
> >> # mutex lock acquisition
> >> - step: set-sensor lock-for-count = ${workflow.id}
> >>   require:
> >>     when: absent
> >>   on-error:
> >>   - retry backoff 50ms increasing 2x up to 5s
> >> - let i = ${entity.sensor.count} ?? 0
> >> - let i = ${i} + 1
> >> - set-sensor count = ${i}
> >> - clear-sensor lock-for-count
> >> ```
> >>
> >> (A couple subtleties for those of you new to the workflow conditions;
> >> they always have an implicit target depending on context, which for
> >> `require` we would make be the old sensor value; "when: absent" is a
> >> special predicate DSL keyword to say that a sensor is unavailable (you
> >> could also use `when: absent_or_null` or `when: falsy` or `not: { when:
> >> truthy }`.  Finally `require: ${i}` uses the fact that conditions default
> >> to being an equality check.  That call is equivalent to `require: { target:
> >> ${entity.sensor.count}, equals: ${i} }`.)
> >>
> >> The retry with backoff is pretty handy here.  But there is still one
> >> problem, in the lock acquisition case, if the workflow fails after step 1
> >> what will clear the lock?  (Pessimistic locking doesn't have this
> >> problem.). Happily we have an easy solution, because workflows were
> >> designed with recovery in mind.  If Brooklyn detects an interrupted
> >> workflow on startup, it will fail it with a "dangling workflow" exception,
> >> and you can attach recovery to it; specifying replay points and making
> >> steps idempotent.
> >>
> >> ```
> >> # rock-solid mutex lock acquisition
> >> steps:
> >> - step: set-sensor lock-for-count = ${workflow.id}
> >>   require:
> >>     any:
> >>     - when: absent
> >>     - equals: ${workflow.id}
> >>   on-error:
> >>   - retry backoff 50ms increasing 2x up to 5s
> >> - let i = ${entity.sensor.count} ?? 0
> >> - let i = ${i} + 1
> >> - step: set-sensor count = ${i}
> >>   replayable: yes
> >> - clear-sensor lock-for-count
> >> on-error:
> >>   - condition:
> >>       error-cause:
> >>         glob: *DanglingWorkflowException*
> >>     step: retry replay
> >> replayable: yes
> >> ```
> >>
> >> The `require` block now allows re-entrancy.  We rely on the fact that
> >> Brooklyn gives workflow instances a unique ID and on failover Dangling is
> >> thrown from an interrupted step preserving the workflow ID (but giving it a
> >> different task ID so replays can be distinguished, with support for this in
> >> the UI), and Brooklyn persistence handles election of a single primary with
> >> any demoted instance interrupting its tasks.  The workflow is replayable
> >> from the start, and on Dangling it replays.  Additionally we can replay
> >> from the `set-sensor` step which will use local copies of the workflow
> >> variable so if that step is interrupted and runs twice it will be
> >> idempotent.
> >>
> >>
> >> (3) explicit `lock` keyword on `workflow` step
> >>
> >> My feeling is that (2) is really powerful, and the pessimistic locking
> >> case is easy enough, but the mutex implementation is hard.  We could make
> >> it easy to opt-in to by allowing sub-workflow blocks to specify a mutex.
> >>
> >> ```
> >> - step: workflow lock incrementing-count
> >>   steps:
> >>     - let i = ${entity.sensor.count} ?? 0
> >>     - let i = ${i} + 1
> >>     - step: set-sensor count = ${i}
> >>       replayable: yes
> >> ```
> >>
> >> This would act exactly as the previous example, setting the workflow.id
> >> into a sensor ahead of time, allowing absent or current workflow id as the
> >> value for re-entrancy, retrying if locked, clearing it after, and handling
> >> the Dangling exception.  The only tiny differences are that the sensor it
> >> atomically sets and clears would be called something like
> >> `workflow-lock-incrementing-count`, and the steps would be running in a
> >> sub-workflow.
> >>
> >> In this example we still need to say that the final `set-sensor count` is
> >> a replay point, otherwise if Brooklyn were interrupted in the split-second
> >> after setting the sensor but before recording the fact that the step
> >> completed, it would retry from the start causing a slight chance the sensor
> >> increments by 2.  This isn't to do with the fact a mutex is wanted however,
> >> it's because fundamentally adding one is not an idempotent operation!
> >> Provided the steps are idempotent there would be no need for it.
> >>
> >> For instance to make sure that `apt` (or `dpkg` or `terraform` or any
> >> command which requires a lock) isn't invoked concurrently you could use it
> >> like this:
> >>
> >> ```
> >> type: workflow-effector
> >> name: apt-get
> >> parameters:
> >>   package:
> >>     description: package(s) to install
> >> steps:
> >> - step: workflow lock apt
> >>   steps:
> >>     - ssh sudo apt-get install ${package}
> >> ```
> >>
> >> Parallel invocations of effector `apt-get { package: xxx }` will share a
> >> mutex on sensor `workflow-lock-apt` ensuring they don't conflict.
> >>
> >> You could even define a new workflow step type (assuming "lock xxx" in
> >> the shorthand is a key `lock` on the workflow step):
> >>
> >> ```
> >> id: apt-get
> >> type: workflow
> >> lock: apt
> >> shorthand: ${package}
> >> parameters:
> >>   package:
> >>     description: package(s) to install
> >> steps:
> >> - step: workflow lock apt
> >>   steps:
> >>     - ssh sudo apt-get install ${package}
> >> ```
> >>
> >> With this in the catalog, you could in workflow have steps `apt-get xxx`
> >> which acquire a lock from Brooklyn before running so they don't fail if
> >> invoked in parallel.
> >>
> >> ---
> >>
> >> I lean towards providing (2) and (3).  It's actually fairly easy to
> >> implement given what we already have, and allows low-level control for
> >> special cases via (2) and a fairly simple high-level mechanism (3).
> >>
> >> Thoughts?
> >>
> >> On a related note, I'm not 100% happy with `replayable` as a keyword and
> >> its options.  The essential thing is to indicate which points in a workflow
> >> are safe to replay from if it is interrupted or fails.  We currently
> >> support "yes" and "no" to indicate if something is a replayable point,
> >> "true" and "false" as synonyms, and "on" and "off" to indicate that a step
> >> and all following ones are or are not replayable (unless overridden with
> >> replayable yes/no or until changed with another replayable on/off).  It
> >> defaults to "off".  I think it is logically a good solution, but it doesn't
> >> feel intuitive.  Alternative suggestions welcome!
> >>
> >>
> >> Cheers for your input!
> >> Alex
> >>
> >>
> >> [1]
> >> https://github.com/apache/brooklyn-docs/blob/master/guide/blueprints/workflow/index.md
> >>
> >

Re: Declarative Workflow update, mutex, replay and retention

Reply via email to