Re: Declarative Workflow update & mutex

Geoff Macartney Sun, 13 Nov 2022 15:47:41 -0800

Hi Alex et al,

First of all, congrats on the implementation of the workflows so far,
I think you've made amazing progress in a short time.

The discussion of the mutex issues is pretty complex - hardly
surprising though, as dealing with concurrency is always hard. Here
are some thoughts.

Referring to your options:

(1) I have a feeling that just making set-sensor in some sense be
atomic is probably not enough to support a wide enough range of
use-cases. I'm also not sure that we would want to do any sort of
concurrency control "implicitly" - I think it is probably preferable
to have explicit declarations in the workflow, whatever form they
take. I think that will both make it easier to implement and, probably
more importantly, easier to understand for developers reading the
workflow code. Hence, I would avoid this option.

(2) I don't feel I understand the examples in here fully, but anyway -
I don't think I like the idea of pessimistic locking for the same sort
of reasons as (1) above. Better to be explicit - if there is no
declaration of concurrency control, then there is no concurrency
control. This also lets workflows be more lightweight when they don't
need locking. (Then again, won't they mostly need it? Hm, not sure.)

I don't really see how the "# mutex lock acquisition" example works.
Where does the lock arise? To put it another way, why does

set-sensor lock-for-count = ${workflow.id}

not simply declare an ordinary string-type sensor? Does the
"lock-for-" prefix implicitly create a lock type sensor?

In the "rock-solid" example, the whole "require" section on the
set-sensor step seems like it is making the code author responsible
for the low-level details of the control flow. This could be an easy
way for bugs to arise, say if the author forgot one of the 'any'
conditions. I could imagine that normally the requirement will match
the conditions you gave - the saved sensor lock should match the
workflow ID (or be absent, on first execution). Could this not be
enforced as the standard behaviour, so that users wouldn't have to
write the condition at all? Similarly the whole "on-error" clause
seems perhaps unnecessary - doesn't it just describe what will always
be the case if "replayable: yes"?

Another point relates to "clear-sensor". Could we avoid requiring this
from users? What if they forget to add it, will that leave the mutex
erroneously locked? Would there be any possiblity of regarding the
workflow as a "scope", such that from the point of acquiring the lock,
it remained locked for the rest of that scope, and was automatically
released at the end of that scope? (A kind of RAII, if you'll forgive
an allusion to C++.) That actually seems kind of like what you have
written in the first code example in point (3), so let's move on to
that.

(3) is what I like best, because it has that very explicit quality,
with the formal statement of a "lock". Declaring it like this is bound
to make implementation easier for us. It certainly makes it easier to
understand for users. The variant I like best is the one with the
workflow step type "lock:".

So I'm going to suggest that for the moment you just proceed with (3),
the "fairly simple high level mechanism". Let's provide that and give
it a bit of time to gain experience with it. If it turns out not to be
a rich enough model then maybe something lower level like (2) may be
needed; but I do think it is likely to be a more frequent source of
errors.

About "replayable". I agree this doesn't feel intuitive. Again, it
seems like mixing in a low level concept, and that it would be good if
there was some convention we could adopt such that users wouldn't have
to be explicit about what points are replayable and what not. However,
in this case I don't have any clear idea that any such convention is
possible. I rather fear not. Perhaps whether something should be
replayable or not is  a higher order consideration, not something we
could decide, and will have to be left up to the code author? I don't
have a good grasp of the issues here.

Whether any of this is of use I don't know, hopefully there are some
points worth a thought.

Cheers
Geoff

On Fri, 11 Nov 2022 at 13:41, Alex Heneveld <[email protected]> wrote:
>
> Hi team,
>
> Workflow is in a pretty good state -- nearly done mostly as per
> proposal, with nearly all step types, retries, docs [1], and integrated
> into the activities view in the inspector.  My feeling is that it radically
> transforms how easy it is to write sensors and effectors.
>
> Thanks to everyone for their reviews and feedback!
>
> The biggest change from the original proposal is a switch to the list
> syntax from the init.d syntax.  Extra thanks to those who agitated for that.
>
> The other really nice aspect is how the shorthand step notation functions
> as a DSL in simple cases so extra thanks for the suggestion to make it
> DSL-like.
>
> The two items remaining are nested workflows and controlling how long
> workflows are remembered.
>
>
> There is one new feature which seems to be needed, which I wanted to
> raise.  As the subject suggests, this is mutexes.  I had hoped we wouldn't
> need this but actually quite often you want to ensure no other workflows
> are conflicting.  Consider the simple case where you want to atomically
> increment a sensor:
>
> ```
> - let i = ${entity.sensor.count} ?? 0
> - let i = ${i} + 1
> - set-sensor count = ${i}
> ```
>
> Running this twice we'd hope to get count=2.  But if the runs are
> concurrent you won't.  So how can we ensure no other instances of the
> workflow are running concurrently?
>
> There are three options I see.
>
>
> (1) set-sensor allows arithmetic
>
> We could support arithmetic on set-sensor and require it to run atomically
> against that sensor.  For instance `set-sensor count =
> (${entity.sensor.count} ?? 0) + 1`.  We could fairly easily ensure the RHS
> is evaluated with the caller holding the lock on the sensor count.  However
> our arithmetic support is quite limited (we don't support grouping at
> present, so either you'd have to write `${entity.sensor.count} + 1 ?? 1` or
> we'd beef that up), and I think there is something nice about at present
> where arithmetic is only allowed on `let` so it is more inspectable.
>
>
> (2) set-sensor with mutex check
>
> We could introduce a check which is done while the lock on the sensor is
> held.  So you could check the sensor is unset before setting it, and fail
> if it isn't, or check the value is as expected.  You can then set up
> whatever retry behaviour you want in the usual way.  For instance:
>
> ```
> # pessimistic locking
> - let i = ${entity.sensor.count} ?? 0
> - let j = ${i} + 1
> - step: set-sensor count = ${j}
>   require: ${i}
>   on-error:
>   - goto start
> - clear-sensor lock-for-count
> ```
>
> ```
> # mutex lock acquisition
> - step: set-sensor lock-for-count = ${workflow.id}
>   require:
>     when: absent
>   on-error:
>   - retry backoff 50ms increasing 2x up to 5s
> - let i = ${entity.sensor.count} ?? 0
> - let i = ${i} + 1
> - set-sensor count = ${i}
> - clear-sensor lock-for-count
> ```
>
> (A couple subtleties for those of you new to the workflow conditions; they
> always have an implicit target depending on context, which for `require` we
> would make be the old sensor value; "when: absent" is a special predicate
> DSL keyword to say that a sensor is unavailable (you could also use `when:
> absent_or_null` or `when: falsy` or `not: { when: truthy }`.  Finally
> `require: ${i}` uses the fact that conditions default to being an equality
> check.  That call is equivalent to `require: { target:
> ${entity.sensor.count}, equals: ${i} }`.)
>
> The retry with backoff is pretty handy here.  But there is still one
> problem, in the lock acquisition case, if the workflow fails after step 1
> what will clear the lock?  (Pessimistic locking doesn't have this
> problem.). Happily we have an easy solution, because workflows were
> designed with recovery in mind.  If Brooklyn detects an interrupted
> workflow on startup, it will fail it with a "dangling workflow" exception,
> and you can attach recovery to it; specifying replay points and making
> steps idempotent.
>
> ```
> # rock-solid mutex lock acquisition
> steps:
> - step: set-sensor lock-for-count = ${workflow.id}
>   require:
>     any:
>     - when: absent
>     - equals: ${workflow.id}
>   on-error:
>   - retry backoff 50ms increasing 2x up to 5s
> - let i = ${entity.sensor.count} ?? 0
> - let i = ${i} + 1
> - step: set-sensor count = ${i}
>   replayable: yes
> - clear-sensor lock-for-count
> on-error:
>   - condition:
>       error-cause:
>         glob: *DanglingWorkflowException*
>     step: retry replay
> replayable: yes
> ```
>
> The `require` block now allows re-entrancy.  We rely on the fact that
> Brooklyn gives workflow instances a unique ID and on failover Dangling is
> thrown from an interrupted step preserving the workflow ID (but giving it a
> different task ID so replays can be distinguished, with support for this in
> the UI), and Brooklyn persistence handles election of a single primary with
> any demoted instance interrupting its tasks.  The workflow is replayable
> from the start, and on Dangling it replays.  Additionally we can replay
> from the `set-sensor` step which will use local copies of the workflow
> variable so if that step is interrupted and runs twice it will be
> idempotent.
>
>
> (3) explicit `lock` keyword on `workflow` step
>
> My feeling is that (2) is really powerful, and the pessimistic locking case
> is easy enough, but the mutex implementation is hard.  We could make it
> easy to opt-in to by allowing sub-workflow blocks to specify a mutex.
>
> ```
> - step: workflow lock incrementing-count
>   steps:
>     - let i = ${entity.sensor.count} ?? 0
>     - let i = ${i} + 1
>     - step: set-sensor count = ${i}
>       replayable: yes
> ```
>
> This would act exactly as the previous example, setting the workflow.id
> into a sensor ahead of time, allowing absent or current workflow id as the
> value for re-entrancy, retrying if locked, clearing it after, and handling
> the Dangling exception.  The only tiny differences are that the sensor it
> atomically sets and clears would be called something like
> `workflow-lock-incrementing-count`, and the steps would be running in a
> sub-workflow.
>
> In this example we still need to say that the final `set-sensor count` is a
> replay point, otherwise if Brooklyn were interrupted in the split-second
> after setting the sensor but before recording the fact that the step
> completed, it would retry from the start causing a slight chance the sensor
> increments by 2.  This isn't to do with the fact a mutex is wanted however,
> it's because fundamentally adding one is not an idempotent operation!
> Provided the steps are idempotent there would be no need for it.
>
> For instance to make sure that `apt` (or `dpkg` or `terraform` or any
> command which requires a lock) isn't invoked concurrently you could use it
> like this:
>
> ```
> type: workflow-effector
> name: apt-get
> parameters:
>   package:
>     description: package(s) to install
> steps:
> - step: workflow lock apt
>   steps:
>     - ssh sudo apt-get install ${package}
> ```
>
> Parallel invocations of effector `apt-get { package: xxx }` will share a
> mutex on sensor `workflow-lock-apt` ensuring they don't conflict.
>
> You could even define a new workflow step type (assuming "lock xxx" in the
> shorthand is a key `lock` on the workflow step):
>
> ```
> id: apt-get
> type: workflow
> lock: apt
> shorthand: ${package}
> parameters:
>   package:
>     description: package(s) to install
> steps:
> - step: workflow lock apt
>   steps:
>     - ssh sudo apt-get install ${package}
> ```
>
> With this in the catalog, you could in workflow have steps `apt-get xxx`
> which acquire a lock from Brooklyn before running so they don't fail if
> invoked in parallel.
>
> ---
>
> I lean towards providing (2) and (3).  It's actually fairly easy to
> implement given what we already have, and allows low-level control for
> special cases via (2) and a fairly simple high-level mechanism (3).
>
> Thoughts?
>
> On a related note, I'm not 100% happy with `replayable` as a keyword and
> its options.  The essential thing is to indicate which points in a workflow
> are safe to replay from if it is interrupted or fails.  We currently
> support "yes" and "no" to indicate if something is a replayable point,
> "true" and "false" as synonyms, and "on" and "off" to indicate that a step
> and all following ones are or are not replayable (unless overridden with
> replayable yes/no or until changed with another replayable on/off).  It
> defaults to "off".  I think it is logically a good solution, but it doesn't
> feel intuitive.  Alternative suggestions welcome!
>
>
> Cheers for your input!
> Alex
>
>
> [1]
> https://github.com/apache/brooklyn-docs/blob/master/guide/blueprints/workflow/index.md

Re: Declarative Workflow update & mutex

Reply via email to