Hi Carsten

There are two things to consider:

(1) moving the implementation of the repository based signalling into
the installer (essentially encapsulation)
(2) implementation of a robust protocol for signalling a block to
installers on other cluster nodes

So far I have talked about (1) but didn't go into the details of (2).
What I have in mind for (2) is a content structure that records three
pieces of information information:

- Sling ID in order to identify on which cluster node the block was triggered
- Service PID (i.e. the fully qualified class name of the
implementation) in order to know which service triggered the block
- Creation timestamp for information/debugging

The content structure would look like the following:

/system/sling/installer/jcr/pauseInstallation
    <sling-id>/
        <service-pid>/
            <random-uuid>/
                jcr:created = <datetime>

It is important that, as a general rule, any node without children is
eagerly deleted. This means that the installer is blocked if
"pauseInstallation" has at least one child node and unblocked if it
has none (or does not exist itself).

The structure would allow a single service to hold multiple blocks
(each <random-uuid> node representing one).

Normally we would assume that a service blocks the installer and later
unblocks it again, ideally using a try/finally block. However, it gets
interesting when edge cases are considered:
- the repository service may get stopped (or restarted), in which case
the unblock can fail
- a cluster node can be killed or disconnected before the unblock can be done
- I have seen log files where the "blocking" service was restarted
while it blocked the installer, because the installer was
asynchronously processing a batch from a previous block. However, it
is unclear why the unblock did not happen in this case: there were no
exceptions in the log and I don't believe they were swallowed, because
 when I provoked a similar scenario exceptions were written to the
log.

To recover from such failure scenarios, the installer needs to be unblocked:
- if a blocking service is stopped. A stopped service may still exist
in the JVM and finish execution, therefore this could be solved using
weak references to a block-token and a reference queue. Or
alternatively by using a timeout in such cases.
- if a cluster node disappears from a topology, its <sling-id> node
should be removed after a timeout

There is a danger, however, that unblocking the installer due to
recovery causes a partial deployment to be installed. This may put the
system into an unusable state (e.g. bundles may not be resolvable,
because their dependencies were not updated(installed). I don't know
how we could address this.

Maybe an entirely different approach would be to provide a list of
deployables (e.g. repository paths?) to the installer, which then only
installs the deployables if all are available (ignoring deployables
with extensions it does not handle). This list would need to be
communicated in a cluster as well, however.

Regards
Julian


On Sun, Jan 31, 2016 at 10:34 AM, Carsten Ziegeler <cziege...@apache.org> wrote:
> Julian Sedding wrote
>> Hi Carsten
>>
>>> Offline discussions don't make it transparent why you came to this
>>> conclusion.
>>> Please enclose the relevant information either here or in the issue.
>>
>> Sure, I thought that I included some information both in the email and
>> in the issue. But it is probably worthwhile to expand on it a bit
>> more.
>>
>> The current implementation is based on a convention rather than a
>> contract: place a node under a specific parent node and the JCR
>> installer will pause its activities.
>>
>> It turns out that this convention in this simple form has limitations
>> when things go wrong:
>>
>> - If a "deployer" implementation fails to delete the pause-marker node
>> (no matter what the reasons are), whose responsibility is it to delete
>> this node to recover the system?
>> - If a "deployer" on cluster node A creates a pause-marker node and
>> then cluster node A is shut down/crashes/disconnects, whose
>> responsibility is it to delete this node to recover the system?
>>
>> Both these questions require a more sophisticated convention IMHO.
>> This becomes a burden for implementers, makes fixing the convention
>> nearly impossible (in case we miss an edge case) and is brittle,
>> because "deployers" may have bugs in their implementations.
>>
>> So the logical conclusion is to move the implementation of this
>> "convention" into Sling and expose it via a simple API.
>>
>> The convention basically becomes an implementation detail, which is
>> needed to distribute the information about blocking the installer
>> within a cluster.
>>
>> Does this answer your questions?
>>
>
> Thanks, yes, however :) I can't follow the conclusion. How is having an
> API which for example has a pause/resume method to be called
> different/easier/avoiding the problems than adding/removing a node?
>
> Carsten
>
>
>
> --
> Carsten Ziegeler
> Adobe Research Switzerland
> cziege...@apache.org

Reply via email to