This is an automated email from the ASF dual-hosted git repository.

heneveld pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/brooklyn-docs.git


The following commit(s) were added to refs/heads/master by this push:
     new 50061ff1 more workflow documentation - now i think it's complete!
50061ff1 is described below

commit 50061ff1e386c0c3a9ed2a1c93b3e3b74d32b974
Author: Alex Heneveld <[email protected]>
AuthorDate: Thu Nov 24 17:40:38 2022 +0000

    more workflow documentation - now i think it's complete!
---
 guide/blueprints/workflow/common.md          | 179 +++++-----
 guide/blueprints/workflow/defining.md        |  13 +-
 guide/blueprints/workflow/index.md           |   3 +-
 guide/blueprints/workflow/nested-workflow.md | 123 +++++--
 guide/blueprints/workflow/settings.md        | 471 +++++++++++++++++++++++++++
 guide/blueprints/workflow/steps/steps.yaml   | 114 +++++--
 guide/blueprints/workflow/variables.md       | 117 +++++--
 7 files changed, 867 insertions(+), 153 deletions(-)

diff --git a/guide/blueprints/workflow/common.md 
b/guide/blueprints/workflow/common.md
index 3edb99a6..c790833b 100644
--- a/guide/blueprints/workflow/common.md
+++ b/guide/blueprints/workflow/common.md
@@ -1,5 +1,5 @@
 ---
-title: Common Workflow Properties
+title: Common Step Properties
 layout: website-normal
 ---
 
@@ -52,46 +52,9 @@ steps:
 
 All steps support a number of common properties, described below.
 
-
-### Conditions
-
-The previous example shows the use of conditions, as mentioned as one of the 
properties common to all steps.  
-This makes use of the recent "Predicate DSL" conditions framework 
-(https://github.com/apache/brooklyn-docs/blob/master/guide/blueprints/yaml-reference.md#predicate-dsl).
-
-It is normally necessary to supply a `target`, unless one of the 
entity-specific target keys (e.g. `sensor` or `config`)
-is used.  The target and arguments here can use the [workflow expression 
syntax](variables.md).  
-
-The condition is evaluated when the step is about to run, and if the condition 
is not satisfied, 
-the workflow moves to the following step in the sequence, or ends if that was 
the last step.
-(The `next` keyword, described next, is _not_ considered.) 
-
-### Jumping with "Next"
-
-The common property `next` allows overriding the workflow sequencing, 
indicating that a different step should 
-be gone to next.
-
-These can be used with the step type `no-op` to create "if x then goto" 
behavior, so as an alternative to the 
-condition on the ssh step in the previous section, one could write:
-
-```
-steps:
-- type: no-op
-  next: end
-  condition:
-    target: ${scratch.skip_date}
-    equals: true
-- ssh echo today is `date`
-```
-
-The special `next` target `end` can be used to indicate that a workflow should 
complete and not proceed to any further steps.  
-This avoids the need to introduce an unnecessary last step simply to have a 
jump target, 
-e.g. `{ id: very-end, type: no-op }`.  Similarly `next: start` will go to the 
start again.
-
-
 ### Explicit IDs and Name
 
-Steps can define an explicit ID for use with `next`, for correlation in the 
UI, 
+Steps can define an explicit ID for use with `next`, for correlation in the UI,
 and to be able to reference the output or input from a specific step using the 
[workflow expression syntax](variables.md).
 They can also include a `name` used in the UI.
 
@@ -113,6 +76,45 @@ steps:
 ```
 
 
+### Conditions
+
+The previous example shows the use of conditions, as mentioned as one of the 
properties common to all steps.  
+This makes use of the recent "Predicate DSL" conditions framework 
+(https://github.com/apache/brooklyn-docs/blob/master/guide/blueprints/yaml-reference.md#predicate-dsl).
+
+It is normally necessary to supply a `target`, unless one of the 
entity-specific target keys (e.g. `sensor` or `config`)
+is used.  The target and arguments here can use the [workflow expression 
syntax](variables.md).  
+
+The condition is evaluated when the step is about to run, and if the condition 
is not satisfied, 
+the workflow moves to the following step in the sequence, or ends if that was 
the last step.
+Apart from `name` and `id` above, if a step's `condition` is unmet,
+the other properties set on a step are ignored.
+
+
+### Jumping with "Next" or "Goto"
+
+The common property `next` allows overriding the workflow sequencing, 
+indicating that a different step should be gone to next.
+This does not apply if a step's condition is not satisfied, as noted at the 
end of the previous section.
+
+The value of the `next` property should be the ID of the step to go to
+or one of the following reserved words:
+
+* `start`: return to the start of the workflow
+* `end`: exit the workflow (or if in a block where this doesn't make sense, 
such as `retry`, go to the last executed step)
+* `exit`: if in an error handler, exit that error handler
+
+The `goto` step type is equivalent to the `no-op` step with `next` set,
+as a simpler idiom for controlling workflow flow.
+While `goto` is "considered harmful" in many programming environments,
+for declarative workflow it is fairly common, because it can simplify what
+might otherwise involve multiple nested workflows.
+That said, the confusion that `goto` can cause should be kept in mind,
+and its availability not abused:  in particular where a task can be better done
+in a proper high-level programming language, consider putting that program 
+into a container and calling it from your workflow.
+
+
 ### Input and Output
 
 Most steps take input parameters and return output. 
@@ -152,50 +154,79 @@ This example also shows the expression syntax. More on 
inputs, outputs, variable
 is covered [here](variables.md). 
 
 
-### TODO Other keys
+### Timeout
 
-timeout, on-error, retry (or new section)
-- timeout implement for workflow steps
-- on-error implement for workflow steps
+Any step and/or an entire workflow can define a `timeout: <duration>`,
+where the `<duration>` is of the form `1 day` or `1h 30m`.
+If the step or workflow where this is present takes longer than this duration,
+it will be interrupted and will throw a `TimeoutException`.
 
 
-- first matching on-error block applies, usually applied a condition, must 
apply a step (e.g. retry), and
-  can apply output or next which replaces that defined by the original step
-- 
-- on-error not permitted to have id or replayable mode; the error step remains 
the replay target;
-- where error handlers use nested workflow, these are not persisted or 
replayable
+### Error Handling with `on-error`
 
-- ui support for error handlers (deferred, see notes in WorkflowErrorHandling)
- 
+Errors on a step and/or a workflow can use the `on-error: <handler>` property 
to determine how
+and error should be handled.  The `<handler>` can be:
 
+* a single step as a string, for instance `on-error: retry`, or to prevent 
infinite loops
+  and introduce exponential backoff `on-error: retry limit 4 backoff 5s 
increasing 2x`
 
-TODO also include this in defining:
+* a single step as a map, possibly with a condition; if the condition is not 
met,
+  the error is rethrown; for example:
 
-If interrupted, steps are replayed, so care must be taken for actions which 
are not idempotent,
-i.e. if an individual step is replayed we should be wary if they cause a 
different result.
-For example, if the following were allowed:
+  ```
+  - step: ssh systemctl restart my-service
+    on-error:
+      step: goto my-service-restart-error
+      condition:
+        target: ${exit_code}
+        greater-than: 0
+  ```
 
-```
-- set-sensor count = ${entity.sensor.count} + 1   # NOT supported
-```
+  If the `ssh` command returns an `exit_code` greater than zero (which the 
`ssh` step treats as an error) 
+  this will go to the step with `id: my-service-restart-error`.
+  Any other error, such as network connectivity, will be rethrown and could be 
addressed by a workflow-level
+  `on-error` or could cause the workflow to fail.
 
-if it were interrupted, Brooklyn would have no way of knowing whether
-the sensor `count` contains the old value or the new value.
-For this reason, arithmetic operations are permitted only in `let`,
-and because workflow variables are stored per step,
-we ensure that the arithmetic is idempotent, so either of these can be
-safely replayed from the point where they are interrupted
-(in addition to handling the case where the sensor is not yet published):
+* a list of steps, some or all with conditions and some or all with `next` 
indicated, as follows:
 
-```
-- let integer count_local = ${entity.sensor.count} + 1 ?? 1",
-- set_sensor count = ${count_local}
-```
+The list of steps will run sequentially, applying conditions to each.
+The target of conditions in an error handler is the error itself, so
+the DSL `error-cause` predicate can be used, for example
+`error-cause: { java-instance-of: TimeoutException }` or
+`error-cause: { glob: "*server timeout*" }`.
 
-or
+The error handler will complete and be considered to have handled the error at 
the first step
+where the condition is satisfied and which indicates a `next` step (either a 
`goto` or `retry` step, or `next` property).
+and subsequent steps in the error handler will not run.
+If all steps have conditions and none are met, the error handler will rethrow 
the error,
+but otherwise, if one or more steps run and none of them throw errors or 
indicate a `next`,
+it will consider the error to be handled and go to the next step in the 
non-error-handler workflow.
+Where the handler combines non-conditional statements (such as `log`) with 
conditions,
+all expected terminal conditions should indicate a `next`; to avoid confusion 
it is not recommended that
+the last step be a condition that might not apply. Consider adding a final step
+`fail rethrow message None of the error handler conditions were met` to make 
sure the handler does not
+accidentally succeed because a `log` step was run, when none of the "real" 
conditions applied.
+
+The `next` target `exit` can be used in an error handler to indicate to go to 
the next step in the containing
+workflow sequence. Nested error handlers are permitted, and `exit` will return 
to the containing error handler
+without indicating that it should exit. Any other `next` target from a nested 
workflow jumps out of all nested
+error handlers and goes to that target in the non-error-handler workflow.
+
+Error handlers run in the same context as the original workflow, not a new 
context as nested workflow does,
+but with some restrictions. This has some significant benefits but also some 
things in special cases which
+might require care:
+
+* You can read and write workflow variables within error handlers
+* You can set the `output` for use in the outer workflow
+* Error handlers are not persisted; any replay will revert to the outer step 
or earlier
+* ID's are not permitted in error handlers; any `next` refers to the 
containing workflow
+* The workflow UI does not show error handling steps; their activity can only 
be seen in the tasks view
+  and in the logs
+
+
+### Workflow Settings
+
+There are a few other settings allowed on all or some steps which configure 
workflow behavior.
+These are `replayable`, `idempotent`,  and `retention`,
+and are described under [Workflow Settings](settings.md).
 
-```
-- let integer count_local = ${entity.sensor.count} ?? 0",
-- let count_local = ${count_local} + 1
-- set_sensor count = ${count_local}
-```
diff --git a/guide/blueprints/workflow/defining.md 
b/guide/blueprints/workflow/defining.md
index d67775be..c674fb31 100644
--- a/guide/blueprints/workflow/defining.md
+++ b/guide/blueprints/workflow/defining.md
@@ -41,19 +41,26 @@ which uses workflow to do just that.  The config to define 
the effector is:
 
 * `name`: the name of the effector to define (required)
 * `parameters`: an optional map of parameters to advertise for the effector,
-  keyed by the parameter name against the definition as a map including 
optionally `type`, `description`, and `defaultValue`
+  keyed by the parameter name against the definition as a map including 
optionally 
+  `type`, `description`, and `defaultValue`; see [nested workflow](nested.md) 
for more details
 
 To define the workflow, this requires:
 
 * `steps`: to supply the list of steps defining the workflow (required)
 
-And the following optional [common](command.md) configuration keys are 
supported with the same semantics as for individual steps:
+And the following optional common configuration keys are supported,
+with the same semantics as for individual steps as described under [Common 
Step Properties](common.md):
 
 * `condition`: an optional condition on the effector which if set and 
evaluating to false,
   prevents the effector workflow from running; this does _not_ support 
interpolated variables
+
 * `input`: a map of keys to values to make accessible in the workflow, in 
addition to effector `parameters`
 
-TODO on-error, timeout, retry
+* `output`: defines the output of the workflow, often referring to workflow 
[variables](variables.md) or
+  the output of the last step in the workflow
+
+* other [common step properties](common.md) such as `timeout`, `on-error`, and 
`next`
+* other [workflow settings properties](settings.md) such as `lock`, 
`retention`, `replayable`, and `idempotent`
 
 
 ### Sensors
diff --git a/guide/blueprints/workflow/index.md 
b/guide/blueprints/workflow/index.md
index 9683020b..0903963c 100644
--- a/guide/blueprints/workflow/index.md
+++ b/guide/blueprints/workflow/index.md
@@ -5,9 +5,10 @@ layout: website-normal
 children:
 - defining.md
 - common.md
-- steps/
 - variables.md
+- steps/
 - nested-workflow.md
+- settings.md
 - examples/
 ---
 
diff --git a/guide/blueprints/workflow/nested-workflow.md 
b/guide/blueprints/workflow/nested-workflow.md
index 77c11482..f961a014 100644
--- a/guide/blueprints/workflow/nested-workflow.md
+++ b/guide/blueprints/workflow/nested-workflow.md
@@ -15,7 +15,9 @@ or to apply `on-error` behavior to a group of steps.
 
 Nested and custom workflows are not permitted to access data from their 
containing workflow;
 instead, they accept an `input` block like other steps.
-When defining a new step type in the catalog, `parameters` and a `shorthand` 
template can be defined. 
+
+This type permits all the [common step properties](common.md) and all the 
[workflow settings properties](settings.md),
+plus a few others, `target`, `concurrency`, `parameters`, and `shorthand` as 
described below.
 
 
 ### Basic Usage in a Workflow
@@ -33,19 +35,74 @@ For example:
   input:
     x: ${x}
   steps:
-    - log This is a nested workflow, able to see x=${x} but not y from the 
output workflow.
+    - log This is a nested workflow, able to see x=${x} from input but not y 
from the output workflow.
   on-error:
     # error handler which runs if the nested workflow fails (i.e. if any step 
therein fails and does not correct it) 
 ```
 
+### Loops and Parallelization
+
+The `workflow` type can also be used to run a sequence of steps on multiple 
targets.
+If given a `target` value, Brooklyn will run the workflow against that target 
or targets,
+as follows:
+
+* If the target is a managed entity, e.g. `$brooklyn:entity("some-child")`, 
the nested workflow
+  will run in the scope of that entity. It will be visible in the UI under 
that entity,
+  and references to sensors and effectors will be aginst that entity.
+* If the target is any value which resolves to a list, it will be run against 
every entry in the list,
+  with the variable expression `${target}` available in the sub-workflow to 
refer to the relevant entry
+* If the target is `children` or `members` it will run against each entity in 
the relevant list
+* If the target is of the form `M..N` for integers `M` and `N` it will run for 
all integers in that range,
+  inclusive (so the string `1..4` is equivalent to the list `[1,2,3,4]`)
+
+Where a list is supplied, the result of the step is the list collecting the 
output of each sub-workflow.
+
+If a `condition` is supplied when a list is being used, the `workflow` step 
will always run,
+and the `condition` will be applied to entries in the list.
+An example of this is included below.
+
+By default nested workflows with list targets run sequentially over the 
entries,
+but this can be varied by setting `concurrency`.
+The following values are allowed:
+
+* a number to indicate the maximum number of simultaneous executions (with `1` 
being the default, for no concurrency)
+* the string `unlimited` to allow all to run in parallel
+* a negative number to indicate all but a certain number
+* a percentage to indicate a percentage of the targets
+* the string `min(...)` or `max(...)`, where `...` is a comma separated list 
of valid values
+
+This concisely allows complicated -- but important in the real world -- logic 
such as 
+`max(1, min(50%, -10))` to express running concurrently over up to half if 
more than twenty, otherwise all but 10, 
+and always allowing 1.
+This might be used for example to upgrade a cluster in situ, leaving the 
larger of 10 instances or half the cluster alone, if possible.  
+If the concurrency expression evaluates to 0, or to a negative number whose 
absolute value is larger than the number of values, the step will fail before 
executing, to ensure that if e.g. "-10" is specified when there are fewer than 
10 items in the target list, the workflow does not run.  (Use "max(1, -10)" to 
allow it to run 1 at a time if there are 10 or fewer.)
+
+#### Example
+
+This example invokes an effector on all children which are `service.isUp`,
+running in batches of up to 5 but not more than a third of the children at 
once:
+
+```
+- type: workflow
+  target: children
+  concurrency: max(1, min(33%, 5))
+  condition:
+    sensor: service.isUp
+    equals: true
+  steps:
+    - invoke-effector effector-on-children
+```
+
 
 ### Defining Custom Workflow Steps
 
 This type can be used to define new step types and add them as new types in 
the type registry.
 The definition must specify the `steps`, and may in addition specify:
 
-* `parameters`: a map of parameters accepted by the workflow, TODO link to 
config params
-* `shorthand`: a template
+* `parameters`: a map of parameters accepted by the workflow, with the key the 
parameter name,
+  and the value map possibly empty or providing optional `type` (default 
`string`), `defaultValue` (default none),
+  `required` (default `false`), `description` (default none), and/or 
`constraints`
+* `shorthand`: a template, as described below
 * `output`: an output map or value to be returned by the step, evaluated in 
the context of the nested workflow
 
 When this type is used to define a new workflow step, the newly defined step 
does _not_ allow the
@@ -55,23 +112,51 @@ It also accepts the standard step keys such as `input`, 
`timeout` on `on-error`.
 A user of the defined step type can also supply `output` which, as per other 
steps,
 is evaluated in the context of the outer workflow, with visibility of the 
output from the current step.
 
-For example:
 
-TODO
+#### Shorthand Template Syntax
+
+A custom workflow step can define a `shorthand` template which permits a user
+to use the workflow step as a string rather than a map, even with parameters.
+The shorthand template syntax consists of a sequence of the following tokens:
 
+* `${VAR}` - to set VAR, which should be of the regex 
[A-Za-z0-9_-]+(\.[A-Za-z0-9_-]+)*, 
+  with dot separation used to set nested maps
+* `${VAR...}` - as `${VAR}` but allowing it to match multiple words
+* `"LITERAL"` - to expect the user to supply the exact token `LITERAL`;
+  this should include spaces if spaces are required
+* `[ <TOKENS> ]` - to indicate that a sequence of `<TOKENS>` is optional; 
+  parsing is attempted first with this block, then without it
 
-#### Shorthand Template
+#### Example
+
+A simple example to say hello is as follows:
+
+```
+id: greet
+type: workflow
+shorthand: ${name...} [ " with " ${greeting} ]
+parameters:
+  name:
+    required: true
+  greeting:
+    defaultValue: Hello
+steps:
+- log ${greeting} ${name}
+```
+
+With this added as a registered type, workflows can write:
+
+```
+- type: greet
+  input:
+    name: Angela
+```
 
-TODO
+The result will be the same as `log Hello Angela`.
+The shorthand template spec also allows `greet Angela` for the same,
+or exercising the optional block, `greet Zachary Jones with Howdy` to `log 
Howdy Zachary Jones`.
 
-* Accepts a shorthand template, and converts it to a map of values,
-* e.g. given template "[ ${sensor.type} ] ${sensor.name} \"=\" ${value}"
-* and input "integer foo=3", this will return
-* { sensor: { type: integer, name: foo }, value: 3 }.
-*
-* Expects space separated TOKEN where TOKEN is either:
-*
-* [ TOKEN ] - to indicate TOKEN is optional. parsing is attempted first with 
it, then without it.
-* ${VAR} - to set VAR, which should be of the regex 
[A-Za-z0-9_-]+(\.[A-Za-z0-9_-]+)*, with dot separation used to set nested maps;
-*   will match a quoted string if supplied, else up to the next literal if the 
next token is a literal, else the next work.
-* "LITERAL" - to expect a literal expression. this should include spaces if 
spaces are required.
\ No newline at end of file
+This is a trivial single-step example but shows the power of creating custom 
workflows,
+especially with parameters and shorthand templates.
+The [examples](examples/) and the [workflow settings](settings.md) include 
more realistic
+illustrations of custom workflow steps.
diff --git a/guide/blueprints/workflow/settings.md 
b/guide/blueprints/workflow/settings.md
new file mode 100644
index 00000000..421323be
--- /dev/null
+++ b/guide/blueprints/workflow/settings.md
@@ -0,0 +1,471 @@
+---
+title: Workflow Settings
+layout: website-normal
+---
+
+Wherever workflows are defined, such as in [effectors, sensors and 
policies](defining.md) and
+in [nested workflow](nested-workflow.md), there are a number of properties 
which can be defined.
+The most common of these, `input`, `output`, and `parameters` are described in 
the sections above.
+Some of the common properties permitted on [steps](common.md) also apply to 
workflow definitions,
+including `condition`, `timeout`, and `on-error`.
+
+This rest of this section describes the remaining properties for more advanced 
use cases 
+including mutex locking and resilient workflows with replay points.
+
+
+## Locks and Mutual Exclusion Behavior
+
+In some cases, it is important to ensure that the same workflow does not run 
concurrently 
+with itself, or more generally to assign a mutual exclusion "mutex" lock to 
make sure
+that at most one executing instance from a group can run at any point in time.
+
+This can be done in Apache Brooklyn by specifying `lock: LOCK-NAME` on the 
workflow.
+The lock is scoped to the entity, and means that if a workflow instance 
running at the entity
+enters this block, it acquires that "lock",
+and no other workflow instance at the entity can enter that block
+until the first one exits the block and releases the lock.
+Workflow instances at the entity that seek to `lock` the same `LOCK-NAME`
+will block until the lock becomes available.
+
+For example to ensure that `start` and `stop` do not run simultaneously, we 
could write:
+
+```
+brooklyn.initializers:
+- type: workflow-effector
+  name: start
+  lock: start-stop
+  steps:
+    - ...
+- type: workflow-effector
+  name: stop
+  lock: start-stop
+  steps:
+    - ...
+```
+
+If `stop` is run while `start` is still running, or a second `start` is run,
+they will not run until the first `start` completes and releases the lock.
+An operator with appropriate access permissions could also manually cancel the 
`start`.
+Details of why the effector is blocked are shown in the UI and available via 
the API,
+as part of the workflow data.
+
+
+### Example: Thread-Safe Package Management
+
+Locks can also be used in workflow steps saved as registered types.
+A good example where this is useful is when working with on-box package 
managers,
+most of which do not allow concurrent operation.
+For instance, an `apt-get` workflow step might use a lock
+to ensure that multiple parallel effectors do not try to `apt-get`
+on a server at the same time:
+
+```
+id: apt-get
+type: workflow
+lock: apt
+shorthand: ${package}
+parameters:
+  package:
+    description: package(s) to install
+steps:
+- type: workflow
+  lock: apt-package-management
+  steps:
+    - ssh sudo apt-get install -y ${package}
+```
+
+A workflow can then do `apt-get iputils-ping` as a step and Brooklyn will
+ensure it interacts nicely with any other workflow at the same entity.
+
+
+### Advanced Implementation Details
+
+Brooklyn guarantees that if a workflow is interrupted by server shutdown,
+it will resume with that lock after startup, so it works well with 
`replayable: automatically`
+described below.
+Brookilyn does not guarantee that waiters will acquire the lock in the same 
order
+they requested the lock, although this behavior can be constructed using a 
sensor that
+acts as a queue.
+
+Any `on-error` handler on the workflow with `lock` will run with the lock 
still acquired.
+
+Any `timeout` period starts once the lock is acquired.
+
+Internally, the lock is acquired by setting a sensor `lock-...` 
+equal to the `${workflow.id}`, where `...` is the `LOCK-NAME`.
+If a different workflow ID is indicated, the workflow will block.
+The sensor will always be cleared after the workflow with the `lock` 
+completes.
+
+Thus if a workflow needs to test whether it can acquire the lock,
+it can do exactly what the internal lock mechanism does:
+set that sensor to its `${workflow.id}` 
+with a `require` condition testing that it is blank or already held. 
+This technique can also be used to specify a timeout on the lock
+with a `retry`.
+
+This can also be used to force-clear a lock, to allow another workflow to run,
+either interactively using "Run workflow" in the App Inspector,
+saying `clear-sensor lock-LOCK-NAME`, or as below if the lock isn't available 
after 1 minute.
+
+These techniques are all illustrated in the following example:
+
+```
+- step: set-sensor lock-checking-first-example = ${workflow.id}
+  require:
+    any:
+      - when: absent
+      - equals: ${workflow.id}
+        # allowing = ${workflow.id} is recommended in the edge case where the 
workflow is interrupted 
+        # in the split-second between setting the sensor and moving to the 
next step;
+        # a "replay-resuming" as described below will proceed in case this 
step has already run
+  on-error:
+    # retry every 10 milliseconds for up to 1 minute
+    - step: retry backoff 10ms timeout 1m
+      on-error: goto lock-not-available
+- type: workflow
+  lock: checking-first-example
+  steps:
+    - log we got the lock
+    # ... other steps to be performed while holding the lock
+# other steps to be performed after clearing the lock
+  next: end
+  
+- id: lock-not-available
+  step: log Lock not available after one minute, force-clearing it and 
continuing
+- clear-sensor lock-checking-first-example
+- goto start
+```
+
+
+## Resilience: Replaying and Retention/Expiry
+
+Workflows have a small number of settings that determine how Brooklyn handles 
workflow metadata.  
+These allow workflow details to be accessible via the API and in the UI (in 
addition to whatever 
+is persisted in the logs) and optionally for a user to "replay" a workflow.  
These are:
+
+* **retention**: for how long after completion should details of workflow 
steps, input, and output be kept 
+  for inspection or manual replaying
+* **idempotent** (for steps):  is a step safe to re-run if it is interrupted 
or fails, if true implying it
+  can be recovered by replaying resuming at that step
+* **replayable**: from where can a retained workflow be safely replayed, such 
as replaying from the start 
+  or from other defined explicitly defined replay points
+
+### Common Replay Settings
+
+Most of the time, there are just a few tweaks to `idempotent` and `replayable` 
needed to let Apache Brooklyn 
+do the right thing to replay correctly.  These simple settings are covered 
first.  The other settings, 
+including changing the retention, are intended for advanced use cases only.
+
+Brooklyn workflows are designed so that most steps are automatically 
"idempotent":  they can safely be run multiple
+times in a row, and so long as the last run is successful, the workflow can 
proceed. This means if a workflow is
+interrupted or fails, it is safe to attempt a recovery by replaying resuming 
at that step. This can be used for
+transient problems (e.g. a flaky network) or where an operator corrects a 
problem (e.g. fixes the network). It means
+uncertainty about whether the step completed or not can be ignored, and the 
step re-run if in doubt. Instructions such
+as "sleep", "set-config", or "wait for sensor X to be true" are obviously 
idempotent; it also applies to `let` because
+Brooklyn records a copy of the value of workflow variables on entry to each 
step and will restore them on a replay.
+
+However, for some step types, it is impossible for Brooklyn to infer whether 
they are idempotent:  this applies to "
+external" steps such as `http` and `ssh`. It can also be the case that even 
where individual steps are idempotent, a
+sequence of steps is not. In either of these cases the workflow author should 
give instructions to Brooklyn about how
+to "replay".
+
+There are two common ways for an author to let Apache Brooklyn know how to 
replay:
+
+* individual steps that are idempotent but not obviously so can be marked 
explicitly as such with **`idempotent: yes`**; 
+  for example a read-only http or container command
+
+* explicit "replayable waypoints" can be set with a step **`workflow 
replayable from here`** to indicate the workflow can be
+  replayed from that point, either manually or automatically; if any step is 
not idempotent, a "replay resuming" request
+  will replay from the last such waypoint; this might be a `retry` step in the 
workflow, on failover with
+  a `replayable: automatically` instruction, or a manual request from an 
operator; if waypoints are defined, operators
+  will also have the option to select a waypoint to replay from
+
+An example of a non-idempotent step is a container which calls `aws ec2 
run-instances`; this might fail after the
+command has been received by AWS, but before the response is received by 
Brooklyn, and simply re-running it in this case
+would cause further new instances to be created. The solution for this 
situation is to have a sequence of steps which
+creates a unique identifier for the request (setting this as a tag), then 
scans to discover any instances with a
+matching tag, calling `run-instances` only if none are found, and then to wait 
on the instances just created or
+discovered. An author can specify `replayable from here` just after the unique 
identifier is created, so if the workflow
+is subsequently interrupted on `run-instances` it will replay from the 
discovery.
+
+This is also an example where a sequence of individually idempotent steps is 
not altogether idempotent; 
+once a unique identifier has been used in a subsequent step, it would be 
invalid to create a new unique identifier. 
+Defining the replay point immediately after this step is a good solution, 
because Brooklyn's "replay resuming" 
+will only ever run from the last executed step if that step is idempotent, or 
from the last explicit `replayable from here` point. 
+(Alternatively the unique identifier step
+could use `${entity.id}` rather than something random, or store the random 
value in a sensor with a `require`
+instruction there to ensure it is only ever created once per entity.)
+
+Where an external step is known to be idempotent -- such as a 
`describe-instances` step that does the discovery, or any
+read-only step -- the step can be marked `idempotent: yes` and Brooklyn will 
support replay resuming at that step.  (
+However here, and often, this is unnecessary, if the nearest "replay point" is 
good enough.)
+
+In some cases, it can be convenient to indicate default replayable/idempotency 
instructions when defining a workflow. As
+part of any workflow definition, such as `workflow-effector` or a nested 
`type: workflow` step, the
+entry `idempotent: all` indicates that all external steps in the workflow are 
idempotent; `replayable: automatically`
+indicates that an automatic `on-error` handler should resume any workflow 
interrupted by a Brooklyn server restart or
+failover; and `replayable: from start` indicates that the start of the 
workflow is a replay point.
+
+Thus by default, most steps will allow a manual "replay resuming" picking up 
at the step that was last run. However
+without a `retry replay` step (such as in an error handler), this will not 
happen automatically, and some steps,
+external ones (which are often the ones most likely to fail), cannot safely 
permit a "replay resuming" and so require
+extra attention. The following is a summary of the common settings used:
+
+* as a step
+  * **`workflow replayable from here`** to indicate that a step is a valid 
replay point,
+    with the option of appending the word **`only`** at the end to clear other 
replay points
+* on a step
+  * **`idempotent: yes`** to indicate that if the workflow is interrupted or 
fails at that step, it can be resumed at that step (only needed for external 
steps which are not automatically inferrable as idempotent)
+    when defining a workflow
+  * **`replayable: from start`** to indicate that the start of the workflow is 
a valid replay point
+  * **`replayable: automatically`** to indicate that on an unhandled Brooklyn 
failover (DanglingWorkflowException), the workflow should attempt to "replay 
resuming", either from the last executed step if it is resumable, or from the 
last replay point
+  * **`idempotent: all`** to indicate that external steps such as `http` and 
`container` in the workflow are resumable unless explicitly indicated otherwise 
(by default these are not; only internal steps known to be safely re-runnable 
are resumable)
+
+Finally, it is worth elaborating the differences between the three types of 
retry behavior, as described on the `retry` step:
+
+* A "replay resuming" attempts to resume flow at the last incomplete executed 
step. If that step is idempotent, it is
+  replayed with special arguments to resume where it left off:  this means 
skipping any `condition` check, using the
+  workflow variables as at that point, using any previous resolved values for 
input, and if the step launched
+  sub-workflows (such as `workflow` or `switch`, or an `invoke-effector` 
calling directly to a workflow effector), those
+  sub-workflows are resumed if possible. If the step is not idempotent, it 
attempts to "replay from" the last replayable
+  step, which might be the same step, but the condition and inputs will be 
re-evaluated. If there is no last replayable
+  step, it will fail.
+
+* A "replay from" looks at a given step to see if it is replayable, and if 
not, searches the flow backwards until it
+  finds one. If there are none, or it cannot backtrack, it will fail. If it 
finds one, execution replays from that step,
+  using the workflow variables as they were at that point, but re-checking any 
condition, re-evaluating inputs, and
+  re-creating any sub-workflows.
+
+* A `retry` step can specify `replay` and/or a `next` step. If a `next` step 
is specified without `replay`, it will do a
+  simple `goto` to that step and so will use the most recent value of workflow 
variables. In all other cases it will do
+  some form of replay:  if `replay` with `next` is specified, it will replay 
from that point, with `last` an alias for
+  the last replay point and `end` an alias for replay resuming; if `replay` is 
specified without `next`, it will replay
+  from the last replay point; if neither `next` nor `replay` is specified, it 
will replay resuming where that makes
+  sense (in an error handler) and otherwise replay from `last`. In all cases, 
`retry` options including `limit`
+  and `backoff` are respected.
+
+Only `replay` is permitted through the API or UI, either "from" a specific 
step, from "start", from the "last"
+replayable step, or resuming from the "end". Users with suitable entitlement 
also have the ability to `force` a replay
+from a given step or `resuming`, which will proceed irrespective of the 
`idempotent` or `replayable` status of the step.
+
+
+#### Example: Mutex Atomic Increment
+
+Consider an atomic increment:
+
+```
+- let x = ${entity.sensor.count}
+- step: let x = ${x} + 1
+  replayable: from here only
+- set-sensor count = ${x}
+```
+
+If this is interrupted on step 3 after setting the sensor, a replay from start 
will
+retrieve the new sensor value and increment it again. By saying `here only` on 
step two,
+we remove all previous workflow points and ensure the workflow is only ever 
replayed
+from the one safe place.
+
+The above assumes no other instances of workflow might use the sensor;
+if two workflows run concurrently they both might get the same initial value 
of `count`,
+so the result would be a single increment.
+Wrapping this in a workflow with a `lock` block as described above will 
prevent this problem,
+with Apache Brooklyn ensuring that on interruption the workflow with the lock
+is replayed first. Again we need to set a replay point before the incremented 
value is written,
+and for good measure we put a replay point after the sensor is updated.
+
+```
+- type: workflow
+  lock: single-entry-workflow-incrementing-count
+  replayable: from start
+  steps:
+    # ... various steps
+    - let x = ${entity.sensor.count} ?? 0
+    - step: let x = ${x} + 1
+      replayable: from here only       # do not allow replays from the start
+    - set-sensor count = ${x}
+    - workflow replayable from here    # could say only, but previous replay 
point also okay
+    # ... various steps
+  on-error:
+    - retry limit 2 in 5s              # allow it to retry replaying on any 
error
+                                       # (if we just wanted Brooklyn server 
failover errors to recover,
+                                       # we could have said `replayable: 
automatically from start`)
+```
+
+
+### Advanced:  Replay/Resume Settings
+
+There are additional options for `idempotent` and `replayable` useful in 
special cases, and the `retention` can be
+configured. As noted above, this section and the next can be skipped on first 
reading and returned to if there are
+complicated replay or retention needs.
+
+* **`idempotent`**
+  * when defining a workflow, as `idempotent: <value>`
+    * **`all`**: means that all external steps in this workflow will be 
resumable unless explicitly declared otherwise (
+      by default, per below, external steps are not resumable); this includes 
steps in explicit sub-workflows (where the
+      workflow definition has a `workflow` with `steps`) but not sub-workflows 
which are references (effectors or
+      registered workflow types)
+  * on a step as a key, as `idempotent: <value>`
+    * **`yes`**:  the step is idempotent and the workflow can replay resuming 
at this step if interrupted there
+    * **`no`**: the step is not idempotent and should not resumed at this 
step; if interrupted there, replay resuming
+      will start from the previous replay point
+    * **`default`** (the default):  `no` for `fail` (because there is no point 
in resuming from a `fail` step), `no` for
+      external steps (eg http, ssh) except where the surrounding workflow 
definition is `all`, computed based on the
+      state of sub-workflows if at a workflow step, and `yes` otherwise
+
+* `replayable`
+  * when defining a workflow, as `replayable: <value>`
+    * **`enabled`** (the default):  is is permitted to replay resuming 
wherever the workflow fails on idempotent steps
+      or where there are explicit replay points
+    * **`disabled`**:  it is not permitted for callers to replay the workflow, 
whether operator-driven or automatic;
+      resumable steps and replay points in the workflow are not externally 
visible (but may still be used by replays
+      triggered within the workflow)
+    * **`from start`**:  the workflow start is a replay point
+    * **`automatically`**: indicates that on an unhandled Brooklyn failover 
(DanglingWorkflowException), the workflow
+      should attempt to replay resuming; implies `enabled`, can be combined 
with `from start`
+  * as a step, as `workflow replayable <value>` (or `{ type: workflow, 
replayable: <value> }`)
+    * **`reset`**:  to invalidate all previous replay points in the workflow
+    * **`from here`**:  this step is a valid replay point; on workflow 
failure, any "retry replay" or "resumable:
+      automatically" handler will replay from this point if the workflow is 
non-resumable; operators will have the
+      option to replay from this point
+    * **`from here only`**:  like a `reset` followed by a `from here`
+  * on a step, as a key
+    * **`from here`** or `from here only`:  as for `workflow replayable`
+  * on workflow step with sub-steps, as a key
+    * any of the key values for defining a workflow, with their semantics 
applied to the nested workflow(s)
+      only (`replayable: disabled` is equivalent to `idempotent: no` and will 
override an `idempotent: yes` there)
+    * any of the key values for a step, with semantics applied to the 
containing workflow
+
+
+# Advanced:  Retention Settings
+
+Apache Brooklyn stores details of workflows as part of its persistence while a 
workflow is executing and for a
+configurable period of time after completion. This allows workflows to be 
resumed even in the case of a Brooklyn server
+restart or failover, and it allows operators to manually explore or replay 
workflows for as long as they are retained.
+
+If needed, it is possible to specify that a workflow should be kept for a 
period of time (including `forever`) or up to
+a maximum number of invocations. The specification can also refer to the 
loosest ("max") or tightest ("min") of a list
+of values. This can be set as part of a workflow's definition, where some 
workflows are more interesting than others,
+and/or as part of a workflow step, if the retention period should be changed 
depending how far the workflow progresses.
+
+Where not explicitly set, a system-wide retention default is used. This can be 
configured in `brooklyn.properties` using
+the key `workflow.retention.default`. If no supplied, Brooklyn defaults to 
`3`, meaning it will keep the three most
+recent invocations of a workflow, with no time limit, and
+
+Workflow retention is done on a per-entity basis based by default on a hash of 
the workflow name. Typically workflow
+definitions for effectors, sensors, and policies all get unique names for that 
definition, so the retention applies
+separately to each of the different defined workflows on an entity. However 
each definition typically assigns the same
+name to each instance, so any retention count limit applies to completed runs 
in that group of workflows.
+Thus `any(2, 24h)` on an effector will keep all runs for 24 hours but only the 
2 most recent completed invocations for
+longer, in addition to ongoing instances.
+
+A custom `hash` can be specified for a workflow to use a key different to the 
name. This can be used to apply the
+retention limit to instances across multiple workflow definitions, for 
instance if only the last 2 of any start, stop,
+or restart command should be kept, the instruction `retention: 2 hash 
start-stop` can be included in the definition for
+each of the start, stop, and restart workflows. This can also be used to 
specify that a workflow might go into different
+retention classes depending where it is in its execution; if workflow failures 
should be kept for longer, the `fail`
+step might say `retention: forever hash ${workflow-name} failed`, causing the 
workflow to be retained with a different
+hash ("<name> failed") and for it to apply a different period ("forever") when 
it checks expiry on that hash.
+
+Formally, the syntax for `retention` is:
+
+* when defining a workflow, as `retention: <value>`
+* as a step, as `workflow retention <value>` (or `{ type: workflow, retention: 
<value> }`)
+
+Permitted `<value>` expressions in either case are:
+
+* a number, to indicate how many instances of a workflow should be kept
+* a duration, to indicate for how long workflows should be kept
+* `forever`, to never expire
+* `context`, to use the previous retention values (often used together with 
`max`)
+* `parent`, to use the value of any parent workflow or else the system 
default; this is the default for workflows,
+  they inherit their parent workflow's retention if it is a nested workflow, 
otherwise it takes the system default
+* `system`, to use the system default (from `brooklyn.properties`)
+* `min(<value>, <value>, ...)` or `max(<value>, <value>, ...)` where `<value>` 
is any of the expressions on this line or above 
+  (but not `disabled` or `hash`); in particular a `max` or a `min` or vice 
versa is useful, and also to refer to the `parent` value
+  * `min` means completed workflow instances must only be retained if they 
meet all the constraints implied by
+    the `<value>` arguments, i.e. `min(2, 3, 1h, 2h)` means only the most 
recent two instances need to be kept and only
+    if it has been less than an hour since they completed
+  * `max` means completed workflow instances must be retained if they meet any 
of the constraints implied by
+    the `<value>` arguments, i.e. `max(2, 3, 1h, 2h)` means to keep the 3 most 
recent instances irrespective of when
+    they run, and to keep all instances for up to two hours
+* `disabled`, to prevent persistence of a workflow, causing less work for the 
system where workflows don't need to be
+  stored; such workflows will not be replayable by an operator or recoverable 
on failover;
+  this should not be used with workflows that acquire a `lock` unless the 
entity has special handlers to clear locks
+* `hash <hash>` to change the retention key; useful if some instances of a 
workflow should be kept for a longer duration
+  than others; unlike most values, this can be a `${...}` variable expression;
+  this can optionally be preceded by any of the other <value> expressions 
listed
+
+
+### Advanced Example
+
+This defines an effector with idempotent workflow that can be replayed from 
most steps, and from the beginning if
+failing on a step which isn't resumable, and details of the last 5 invocations 
will always be kept, and all invocations
+in the past 24 hours will be kept:
+
+```
+brooklyn.initializers:
+- type: workflow-effector
+  retention: max(24h,5)
+  replayable: yes
+  steps:
+    - ...
+```
+
+As a more interesting example, consider provisioning a VM where approval is 
needed and where unlike the `aws` case
+above, tags cannot be used to make the actual call idempotent. The call to the 
actual provisioner needs to fail hard so
+an operator can review it, but the rest of the workflow should be as robust as 
possible.  (Of course it is recommended
+to try to make workflows idempotent, as discussed in this section, but in some 
cases that may be difficult.)
+Specifically here, any cancellation or failure prior to sending the request 
might be uninteresting for operators and
+fine for a user to replay; however once provisioning begins, all details 
should be kept, and the provisioning step
+itself should not be replayable; finally once the machine details are known 
locally it is no longer important to keep
+workflow details. In this case the workflow might look like:
+
+```
+type: workflow
+retention: max(context,6h)   # keep all for at least 6h, and longer/more if 
parent or system workflow says so
+replayable: from start       # allow replay from the start (until changed)
+on-error:
+- retry limit 10    # automatically replay on any error (unless no replay 
points)
+  steps:
+
+# get and wait for approval
+- http approvals.acme.com/request/infrastructure/vm?<details_omitted>
+- let request_id = ${content.request_id}
+- id: wait_for_approval
+  step: http approvals.acme.com/check?request_id=${request_id}
+  # assume returns a map { completed: boolean, approved: boolean, details: 
string }
+- step: retry from wait_for_approval limit 7d backoff 10s increasing 2x up to 
5m
+  condition:
+    target: ${content.completed}
+    when: falsy
+- step: fail message Provisioning request denied by approvals system: 
${content.details}
+  # the 'fail' step type is not resumable so replay will not be permitted here,
+  # but it would be allowed from the start, so we have to disable it
+  replayable: reset
+  condition:
+    target: ${content.approved}
+    not:
+      equals: true
+
+# now provision, don't allow replays and keep details for cleanup
+- workflow replayable reset
+- workflow retention forever
+- http cloud.acme.com/provision/vm?<details_omitted>
+# once the request is made we can allow replays again
+# but continue to keep details for cleanup
+- workflow replayable from here
+- let provision_id = ${content.provision_id}
+- http cloud.acme.com/check?provision_id=${provision_id}
+  # assume returns a map with { completed: boolean, id: string, ip_address: 
string }
+- step: retry limit 1h backoff 10s increasing 2x max 1m
+  condition:
+    target: ${content.completed}
+    equals: false
+- set-sensor vm_id = ${content.id}
+- set-sensor ip_address = ${content.ip_address}
+
+# finally restore default retention per parent or system, as details are now 
stored on the entity
+- workflow retention parent
+```
+
diff --git a/guide/blueprints/workflow/steps/steps.yaml 
b/guide/blueprints/workflow/steps/steps.yaml
index 0bcbeae5..ad327ad3 100644
--- a/guide/blueprints/workflow/steps/steps.yaml
+++ b/guide/blueprints/workflow/steps/steps.yaml
@@ -23,35 +23,49 @@
   steps:
     - name: let
       summary: An alias for `set-workflow-variable`.
-      shorthand: '`let [ "trimmed" ] [ TYPE ] VARIABLE_NAME = VALUE`'
+      shorthand: '`let ["merge" ["deep"]] ["trim"] ["yaml"|"json"|"bash" 
["encode"|"parse"]] ["wait"] [ TYPE ] VARIABLE_NAME [ = VALUE ]`'
 
     - name: set-workflow-variable
       summary: Sets the value of a workflow internal variable. The step `let` 
is an alias for this.
-      shorthand: '`set-workflow-variable ["merge" ["deep"]] ["trim"] 
["yaml"|"json"] ["wait"] [TYPE] VARIABLE_NAME = VALUE`'
+      shorthand: '`set-workflow-variable ["merge" ["deep"]] ["trim"] 
["yaml"|"json"|"bash" ["encode"|"parse"]] ["wait"] [TYPE] VARIABLE_NAME [ = 
VALUE ]`'
       input: |
         * `variable`: either a string, being the workflow variable name, or a 
map, containing the `name` and optionally the `type`;
           the value will be coerced to the given type, e.g. to force 
conversion to an integer or to a bean registered type;
           the special types `yaml` and `json` can be specified here to force 
conversion to a valid YAML or JSON string;
           the `name` here can be of the form `x.key` where `x` is an existing 
map variable, to set a specific `key` within it
         * `value`: the value to set, with some limited evaluation as described 
[here](../variables.md)
-        * `merge`: indicates the value will be a space-separated list of 
values, with quotes significant, usually including `$(vars)`, 
-          which will be merged and set as the indicated variable; `type` must 
be specified; 
-          for maps, all values must be maps; 
-          for lists and sets, any value is permitted and collection values are 
flattened (to insert a list as an entry in a list, it must be wrapped in 
another list using longhand syntax);
-          if `trim` is specified, null or unresolvable values and entries 
(including either key or value for maps) are ignored;
-          if `wait` is specified, unavailable values are waited on
-        * `merge_deep`: whether to merge maps deeply, i.e. where the keys are 
identical the values should be maps and those maps are themselves merged 
deeply; 
-          the value should be as per `merge` but containing maps only;
-          the values to merge are not otherwise coerced so must be 
pre-transformed to the appropriate type 
-        * `trim`: whether the value should be trimmed after evaluation and 
prior to setting; 
-          this applies to string input and to an explicit string type result
-        * `yaml`: indicates the input should be converted to YAML if not 
already a string,
-          and if a type other than string is specified, the YAML will be 
parsed as YAML to make that type;
-          if no type is specified then the YAML will be parsed and the 
resulting primitive/map/list/string returned;
-          if the type `string` is specified, the encoded YAML will be returned;
-          in all cases apart from the last (`yaml string`), any content in a 
string input before a YAML document separator `---` is dropped
-        * `json`: as for `yaml`, but using JSON for any conversion and parse; 
in particular `json string` can be used to get a JSON-encoded string for 
passing to scripts
-        * `wait`: whether to block on expressions such as `entity.sensor.x` 
which are well-formed but not yet available
+      output: the output from the previous step, or null if this is the first 
step
+
+    - name: transform
+      summary: Applies a transformation to a variable or expression.
+      shorthand: '`transform [TYPE] VARIABLE_NAME [ [ = VALUE ] | TRANSFORM ]`'
+      input: |
+        * `variable`: either a string, being the workflow variable name, or a 
map, containing the `name` and optionally the `type`;
+          the value will be coerced to the given type, e.g. to force 
conversion to an integer or to a bean registered type;
+          the special types `yaml` and `json` can be specified here to force 
conversion to a valid YAML or JSON string;
+          the `name` here can be of the form `x.key` where `x` is an existing 
map variable, to set a specific `key` within it
+        * `value`: the value to set, with some limited evaluation as described 
[here](../variables.md)
+        * `transform`: a string indicating a transform, or multiple transforms 
separated by `|`, where a transform can be
+          * `trim` to remove leading and trailing whitespace on strings, null 
values from lists or sets, and any entry with a null key or null value in a map
+          * `merge [list|set|map] [deep] [wait] [lax]` to treat the words in 
the value as expressions to be combined;
+            indicates the value will be a space-separated list of values, with 
quotes significant, usually including `$(vars)`;
+            if a type is specified, all values must be compatible with that 
type, otherwise collection v map merge is inferred from the types;
+            if `deep`, any maps containing mergeable values (lists or maps) 
are themselves merged deeply;
+            if `wait`, unavailable sensor values are blocked on before merging;
+            if `lax` then nulls, and if not waiting unavailable sensors, are 
silently removed
+          * `wait` to wait on values (such as 
`${entity.attributeWhenReady...}`); must precede any expression that has to 
resolve values, such as `trim`)
+          * `type TYPE` to coerce to registered type `TYPE`
+          * `json [string|parse|encode]`: indicates the input should be 
converted as needed using JSON;
+            if `string` is specified, the value is returned if it is a string, 
or serialized as a JSON string if it is anything else;
+            if `parse` is specified, the result of `string` is then parsed to 
yield maps/lists/strings/primitives (any string value must be valid json);
+            if `encode` is specified, the value is serialized as a JSON string 
as per JS "stringify", including encoding string values wrapped in explicit `"`;
+            if nothing is specified the behavior is as per `parse`, but any 
value which is already a map/list/primitive is passed-through unchanged
+        * `yaml [string|parse|encode]`: as per `json` but using YAML 
serialization and, 
+          for anything other than `encode`, stripping any text before a `---` 
YAML document separator,
+        * `bash [json|yaml] [string|encode]`: equivalent to the corresponding 
`json` or `yaml` transform, with `json` being the default, followed by 
bash-escaping and wrapping in double quotes;
+          ideally suited for passing to scripts; `string` is the default and 
suitable for most purposes, but `encode` is needed where passing to something 
that expects JSON such as `jq`
+        * `first`, `last`, `min`, `max`, and `average` are accepted for 
collections
+        * any other word is looked up as a registered type of type 
`org.apache.brooklyn.core.workflow.steps.transform.WorkflowTransform` (or 
`WorkflowTransformWithContext` to get the context or arguments)
       output: the output from the previous step, or null if this is the first 
step
 
     - name: clear-workflow-variable
@@ -108,20 +122,35 @@
 
     - name: retry
       summary: Retries from an appropriate point with a configurable delay and 
back-off
-      shorthand: '`retry [ "replay" ] [ "from" NEXT ] [ "limit" LIMIT_COUNT [ 
"in" LIMIT_TIME ] ] [ "backoff" BACKOFF ] [ "timeout" TIMEOUT ]`'
+      shorthand: '`retry [ "replay" ] [ "from" NEXT ] [ "limit" LIMIT_COUNT [ 
"in" LIMIT_TIME ] ] [ "backoff" BACKOFF ]`'
       input: |
-        * `replay`: whether to replay or not; XXX 
-        * `limit`: a list of limit definitions, e.g. `[ "10", "2 in 1 min" ]` 
to restrict to 10 retries total with
-          an additional limit of 2 in any 1 minute period; as shorthand, per 
the syntax above, it takes one such limit
-        * `backoff`: a specification of how to delay, of the form `INITIAL [ 
"increasing" FACTOR ] [ "up to" MAX ]`,
+        * `replay`: whether to do the retry as a replay or not; 
+          if done as a replay, it reverts to the workflow variables as known 
at the target step;
+          if a `next` step is specified the default is `false`, but if there 
is no `next` step
+          the default is `true` and `next` is taken as `end` if in an error 
handler and `last` otherwise,
+          so blank/default `retry` behavior in an error handler is to try to 
replay resuming 
+          and in a normal workflow to replay from the last valid replay point;
+          if `replay` is specified without next, the default `next` is the 
`last` replay point
+          (always excluding the retry step itself, even if it is a valid 
replay point)
+        * `limit`: a list of limit definitions, e.g. `[ "10", "1h", "2 in 1 
min" ]` to restrict to a max of 10 retries total,
+          an additional limit of 2 in any 1 minute period, and an additional 
limit of 1 hour; the retry will fail if any of these are exceeded;
+          as shorthand, per the syntax above, it takes one such limit 
definition
+        * `backoff`: a specification of how to delay, of the form `INITIAL [ 
"increasing" FACTOR ] [ "jitter" JITTER ] [ "up to" MAX ]`,
           where `INITIAL` is one or more durations applied to the respective 
iteration, and the last value applying to all
-          unless `FACTOR` is supplied; `FACTOR` as either a number followed by 
`x` to multiply, a number followed by by `%` to add a percentage each time, 
-          or a duration to increase linearly; and `MAX` to specify a maximum 
duration; for example,
+          unless an `increasing FACTOR` is supplied; 
+          `FACTOR` as either a number followed by `x` to multiply, a number 
followed by by `%` to add a percentage each time, 
+          or a duration to increase linearly; 
+          `JITTER` as either a percentage or real factor by which to jitter 
the result with randomness, 0 being none, 100% being up to double, and -100% 
being down to zero;
+          and `MAX` to specify a maximum duration; for example,
           `backoff 1s 1s 10s` will retry after 1 second the first two times, 
then after 10 seconds on subsequent instances; 
           and `backoff 1s 2s backoff 3x max 1m` will retry after 1 second then 
2s then 6s then 18s then 54s then 60s each subsequent time 
-        * `key`: an optional key to identify related retries in a workflow; 
all retry blocks with the same `key` will share counts
-          for the purpose of counting limits, although the limits themselves 
are as per the definition in each retry block
-        * `next` and `timeout` per the [common](../common.md) step properties
+        * `hash`: an optional hash expression to identify related retries in a 
workflow instance; all retry blocks with the same `hash` will share counts
+          for the purpose of counting limits (although the limits themselves 
are as per the definition in each retry block),
+          which can be useful if there are different steps which might fail in 
different ways but the overall retry behaviour should be preserved
+        * `next` per the [common](../common.md) step properties, with special 
targets 
+          `last` for the last replayable step
+          (the default if not in an error handler or if `replay` is specified) 
and 
+          `end` to replay resuming (only permitted in an error handler where 
it is the default if `replay` is not specified)
       output: the output from the previous step, or null if this is the first 
step
 
     - name: goto
@@ -131,14 +160,33 @@
         * `next` per the [common](../common.md) step properties
       output: the output from the previous step, or null if this is the first 
step
 
+    - name: switch
+      summary: Concisely selects and runs the first matching step case, like 
`if ... else if ... else`
+      shorthand: '`switch [ VALUE ]`'
+      input: |
+        * `cases`: a list of steps, all but the last with a `condition`, and 
the first applicable will be run, the others not;
+          error if none match (where the last has a `condition`)
+        * `value`: a value passed to each of the conditions as the `target`
+      output: the output from the matched case step
+
     - name: workflow
       summary: |
         Runs nested workflow, optionally over an indicated target.
         This step type is described in more detail 
[here](../nested-workflow.md).
+        It can also
+      shorthand: '`workflow [ replayable REPLAYBLE ] [ retention RETENTION ]`' 
or custom
       input: |
         * `steps`: a list of steps to run, run in a separate context
-        * `target`: an optional target specifier (see below)
-      output: the output from the last step in the nested workflow
+        * `target`: an optional target specifier, an entity or input to the 
steps for the sub-workflow,
+          or if a list, a list of entities or inputs to pass to multiple 
sub-workflows
+        * `concurrency`: a specification for how many of the sub-workflows can 
be run in parallel, if given a list target;
+          defaulting to one at a time, supporting a DSL as described 
[here](../nested-workflow.md)
+        * `condition`: the usual condition settable on steps as described 
[here](../common.md) can be used, with one difference
+          that if a target is specified, the condition is applied to it or to 
each entry in the list, to conditionally allow sub-workflows,
+          and the workflow step itself will always run (i.e. the condition 
does not apply to the step itself if it has a target)
+        * `replayable`: instructions to add or modify replay points, as 
described [here](../settings.md), for example `workflow replayable from here`
+        * `retention`: instructions to modify workflow retention, as described 
[here](../settings.md)
+      output: the output from the last step in the nested workflow, or a list 
of such outputs if supplied a `target` list
 
 -
   section_name: External Actions
@@ -296,6 +344,8 @@
           and/or the `entity` where the sensor should be set (defaulting to 
the entity where the workflow is running);
           the value will be coerced to the given type, e.g. to force 
conversion to an integer or to a bean registered type
         * `value`: the value to set
+        * `require`: a condition to evaluate against the sensor value while 
holding the lock on the sensor,
+          used to enable atomic operations similar to the `lock` workflow 
setting
       output: the output from the previous step, or null if this is the first 
step
 
     - name: clear-config
diff --git a/guide/blueprints/workflow/variables.md 
b/guide/blueprints/workflow/variables.md
index 41d5d515..f79aad5e 100644
--- a/guide/blueprints/workflow/variables.md
+++ b/guide/blueprints/workflow/variables.md
@@ -58,7 +58,7 @@ The interpolated reference `${workflow.<KEY>}` can be used to 
access workflow in
 * `step.<ID>.<KEY>` - info on the last invocation of the step with declared 
`id` matching `<ID>`, as above
 * `var.<VAR>` - return the value of `<VAR>` which should be a workflow-scoped 
variable (set with `let`) 
 
-In the step contexts, the following is also supported:
+In the step contexts, the following is also supported after `workflow.`:
 
 * `step_id` -- the declared `id` of the step
 * `step_index` -- the index of the current step in the workflow definition, 
starting at 0
@@ -83,6 +83,20 @@ workflow is running, where `<KEY>` can be:
 * `application.<KEY>` - returns the value of `<KEY>` per any of the above in 
the context of the application
 
 
+### Output Access
+
+The token `${output}` refers to the nearest output in scope:
+in a step's `output:` block, it refers to the default output from a step, thus
+`output: ${output.stdout}` can be used on a `container` step to change the 
output from being the default map including `stdout`
+to being just the `stdout` (alternatively just `${stdout}` can be used, per 
the next section).
+With a nested workflow running over a list, e.g. of children, `output: 
${output[0]}`
+can be used to refer to the output from the first element in the list.
+If used in a step _prior_ to the resolution of an `output` block, such as in 
the inputs,
+it refers to the output from the previous step.
+If used in the `output` block of a workflow, it refers to the default output 
of the workflow
+which is the output of the last step.
+
+
 ### Simple Expressions for Input, Output, and Variable
 
 Where `${<VAR>}` is supplied, assuming it doesn't match one of the models 
above, the following search order is used:
@@ -100,7 +114,7 @@ and then workflow vars and inputs. It will be an error if 
`x` is not defined in
 (The `output` of the `current_step` is only defined when processing an 
explicit `output` block defined on a step,
 and the `error_handler` is only defined when running in an `on-error` step.)
 
-Note that the `workflow` and `entity` models take priority over workflow 
variables,
+Note that the `workflow`, `entity`, and `output` models take priority over 
workflow variables,
 so it is a bad idea to call a workflow variable `workflow`, as an attempt to 
evaluate
 `${workflow}` refers to the model above which is not a valid result for an 
expression.
 (You could still access such a variable, using `${workflow.var.workflow}`.)
@@ -114,6 +128,43 @@ This only applies in very specific edge cases, and so can 
generally be ignored.
 If resolution behavior is ever surprising, it is recommended to use the full 
syntax including scope (prefixed by `workflow.`).
 
 
+
+### Arithmetic and Idempotency
+
+The `let` step allows mathematical operations, such as:
+
+```
+- let x = ${x} * 3 + 1
+```
+
+This is the only place arithmetic is supported, setting local variables whose 
values
+at that step are restored if a workflow is "replayed" from that step.
+
+This ensures that all the internal steps (excluding steps that act externally 
such as `ssh` or `container`)
+are individually idempotent, and if interrupted at the step can be safely 
resumed from that step.
+
+For example, if the following were allowed:
+
+```
+- set-sensor count = ${entity.sensor.count} + 1   # NOT supported
+```
+
+if it were interrupted, Brooklyn would have no way of knowing whether
+the sensor `count` contains the old value or the new value,
+and a replay might cause it to be incremented twice.
+The following sequence of steps (which is permitted) can always safely be 
replayed from any interrupted state:
+
+```
+- let integer count_local = ${entity.sensor.count} ?? 0",
+- let count_local = ${count_local} + 1
+- set_sensor count = ${count_local}
+```
+
+Where workflows need to be resumed on interruption or might replay steps to 
recover from other errors,
+idempotency is an important part of reliable workflow design.
+Good practice and the settings available for resilient workflows are covered 
in [Workflow Settings](settings.md).
+
+
 ### Unavailable Variables and the `let` Nullish Check
 
 To guard against mistakes in variable names or state, workflow execution will 
typically throw an error if
@@ -121,8 +172,9 @@ a referenced variable is unavailable or null, including 
access to a sensor which
 There are three exceptions:
 
 * the `let` step supports the "nullish coalescing" operator `??` for this 
case, as described below below
-* the `wait` step will block until a sensor becomes available
-* `condition` blocks can reference a null or unavailable value in the `target`,
+* the `wait` step or `transform ... | wait` will block until a value becomes 
available,
+  such as using `${entity.attributeWhenReady.SENSOR_NAME}`.
+* `condition` entries can reference a null or unavailable value in the 
`target`,
   and check for absence using `when: absent_or_null`
 
 Where it is necessary to consider "nullish" values -- variables which are 
either null or not yet available --
@@ -156,46 +208,63 @@ Shorthand form is designed primarily for simple strings 
as the data. To pass mor
 the quotes, longhand form (map) is recommended, and it may be helpful to 
convert complex objects to strings
 in a previous step using e.g. `let string map_s = ${map}`.
 
-It is possible to embed most quoted expressions and some complex types as 
shorthand, but care must be taken
-and it is helpful to understand the parsing process.
-Shorthand will groups things using quotation marks, single or double, provided 
the quoted string is separated
-by whitespace or an end-of-line. Thus it is technically possible to set a 
workflow variable `a b` using
+It is possible to embed strings with spaces, quotes, and complex types as 
shorthand, but care must be taken
+and if doing this, it is helpful to understand the parsing process.
+Shorthand will normally groups things using quotation marks, single or double, 
provided the quoted string is surrounded
+by whitespace or an end-of-line, and will remove these outermost quotes and 
standardize whitespace not in the quotes.
+Thus it is technically possible to set a workflow variable `a b` using
 `let "a b" = 1`, although it is not recommended, and because the expression 
syntax doesn't allow spaces,
 there is no way to access such a variable!
-
-Expressions within strings are evaluated as strings, and where a shorthand 
step can match multiple words
-(most final arguments, e.g. everything after `=` in `set-sensor`),
-if there are multiple words, they will also be treated as a string.
-Simple types such as numbers and booleans can be converted to strings, but 
complex types will be an error,
-as will null or absent variables. If an argument is a single word which is a 
single expression then its type will be
-preserved.  Thus if we run `let integer val = 1` then `set-sensor s1 = val is 
${val}` or `set-sensor s1 = "val is ${val}"`,
+The one case where quotes are not stripped by the shorthand processor is when 
the step's
+final argument accepts multiple words, such as after the `=` in `set-sensor` 
or `ssh <command>`;
+if the final multi-word groups argument is one entire quoted string, it is 
unwrapped,
+but otherwise its quotes are respected.  This allows `ssh bash -c 'echo hi'` 
to pass the quotes,
+and also allows it to be written `ssh 'bash -c "echo hi"'` or `ssh 'bash -c 
\'echo hi\''`,
+with the outer quotes removed.
+The syntax is optimized to be as intuitive as possible in common cases,
+although it does get complicated at the margins; for example `log "hello 
world"` prints `hello world` (quotes unwrapped) but 
+`log "hello" "world"` prints `"hello" "world"` (quotes preserved).
+If in doubt, you can always write `log "\"hello\" \"world\""` or `log '"hello" 
"world"'`.
+It is suggested to follow the examples and do testing, use longhand, and 
review these notes only 
+if particularly interested or uncertain about quotes.
+
+Variable expansion occurs whenever `${var}` is used, expanding to the value of 
`var` as described above.
+If matching a shorthand variable on its own, then the type of the value is 
preserved,
+but if embedded in a larger word, simple values (numbers and booleans) will be 
converted to strings
+but complex types will give an error, as will null or absent variables.
+Thus if we run `let integer val = 1` then `set-sensor s1 = val is ${val}` or 
`set-sensor s1 = "val is ${val}"`,
 the string `val is 1` will be set as the sensor `s1`; however `set-sensor s1 = 
${val}` will emit the integer `1`.
 If `val` is a map, then the last form will preserve the map, but the other 
two, including it in `val is ${val}`,
 will throw an error.
 
 It can be helpful to use the `let` command to coerce data to the write format 
in a new variable
 or to handle potentially unset values; for example `let json string val_json = 
${val}` to create a string
-`val_json` representing the JSON encoding of `val`, or even `let map x = { a: 
1 }` for simple unambiguous map expressions,
-where the string `{ a: 1 }` is converted to a map.
-The longhand form, e.g. `{ step: "let x", value: { a: 1 } }`, should be used 
for potentially ambiguous values.
+`val_json` representing the JSON encoding of `val`, or even `"let map x = { a: 
1 }"` for simple unambiguous map expressions,
+where the string `{ a: 1 }` is converted to a map, but noting that YAML 
requires any string with `:` to be quoted in its entirety,
+and the YAML parse will unwrap it before passing to the shorthand processor.
+The longhand form, e.g. `{ step: "let x", value: { a: 1 } }`, can be used for 
potentially ambiguous values or for clarity.
 
 The `let` step has some special behavior. It can accept `yaml` and, when 
converting to complex types,
 will strip everything before a `---` document separator.  Thus a script can 
have any output, so long as it
 ends with `---\n` followed the YAML to read in, then `let yaml fancy-bean = 
${stdout}` will convert it to
 a registered type `fancy-bean`. It will be an error if `stdout` is not 
coercible to `fancy-bean`.
 
-Another special behavior of `let` is that its shorthand form preserves spaces 
and quotes when tokenizing its final argument, `value`.
-Any tokens which are quoted will be unescaped but _not_ evaluated as 
expressions,
-and all other tokens are evaluated and then coerced to strings (except where 
the value is a single expression).
-This allows embedding literal quotation marks, multiple spaces, and `${...}` 
literals into variables.
-Thus given the steps:
+Another special behavior of `let` is that its `value` is reprocessed, 
supporting arithmetic as described elsewhere,
+and also unwrapping quoted words in the value (removing quotes) _without_ 
evaluating expressions within them.
+This is the only way to embed `${...}` expressions in a value, and can 
simplify other places where quotation marks
+and spaces are needed. Thus given the steps:
 
 ```
 - let msg = "${person}" is ${person}
 - log ${msg}
 ```
 
-Brooklyn will log `"${person}" is { name: "Bob", age: 42 }`.
+Brooklyn will log `${person} is { name: "Bob", age: 42 }`.
+
+It is also possible to use longhand syntax `{ type: set-sensor, sensor: x, 
value: value }` 
+or a hybrid syntax `{ step: set-sensor x, value: value }`, 
+which can be useful for complex types and bypassing the shorthand processor's 
unquoting strategy,
+but in this case note that YAML processing will unwrap quotes.
 
 
 ### Advanced Details

Reply via email to