Re: SLING-5421 - Allow JCR installer to recover from being paused indefinitely

Julian Sedding Fri, 29 Jan 2016 08:02:00 -0800

Hi Carsten

> Offline discussions don't make it transparent why you came to this
> conclusion.
> Please enclose the relevant information either here or in the issue.

Sure, I thought that I included some information both in the email and
in the issue. But it is probably worthwhile to expand on it a bit
more.

The current implementation is based on a convention rather than a
contract: place a node under a specific parent node and the JCR
installer will pause its activities.

It turns out that this convention in this simple form has limitations
when things go wrong:

- If a "deployer" implementation fails to delete the pause-marker node
(no matter what the reasons are), whose responsibility is it to delete
this node to recover the system?
- If a "deployer" on cluster node A creates a pause-marker node and
then cluster node A is shut down/crashes/disconnects, whose
responsibility is it to delete this node to recover the system?

Both these questions require a more sophisticated convention IMHO.
This becomes a burden for implementers, makes fixing the convention
nearly impossible (in case we miss an edge case) and is brittle,
because "deployers" may have bugs in their implementations.

So the logical conclusion is to move the implementation of this
"convention" into Sling and expose it via a simple API.

The convention basically becomes an implementation detail, which is
needed to distribute the information about blocking the installer
within a cluster.

Does this answer your questions?

Regards
Julian

On Fri, Jan 29, 2016 at 7:36 AM, Carsten Ziegeler <[email protected]> wrote:
> Julian Sedding wrote
>> Hi all
>>
>> The JCR installer was enhanced with a feature to pause it for a while
>> in SLING-3747. By pausing and later resuming the JCR installer a
>> "deployer" can signal to the installer that a set of installable
>> resources should be processed together.
>>
>> The mechanism to pause the JCR installer is based on the presence of a
>> node in a particular location in the repository. This is a requirement
>> to allow the feature to work in a cluster, where the installers on all
>> instances need to be paused.
>>
>> In SLING-5421 it is reported that this mechanism can lead to a
>> permanently paused JCR installer, most likely due to a crash/kill or a
>> premature shutdown/failure of the repository. The possibility of a
>> programming error was ruled out by inspecting the code of the
>> "deployer" (try/finally is used consistently).
>>
>> Additional robustness comes with the cost of added complexity. E.g. to
>> allow deletion of a pause-marker, the marker needs to be annotated
>> with the Sling-ID of an instance. Otherwise another instance might
>> remove a valid pause-marker.
>>
>> In order not to burden "deployer" implementations with this
>> complexity, I suggest encapsulating the logic within the installer
>> itself, and instead expose an API to pause the installer. This was the
>> consensus we found in some offline discussions.
>>
>> (see also https://issues.apache.org/jira/browse/SLING-5421)
>>
>> Any thoughts or objections?
>>
> Offline discussions don't make it transparent why you came to this
> conclusion.
> Please enclose the relevant information either here or in the issue.
>
> Thanks
> Carsten
>
>
> --
> Carsten Ziegeler
> Adobe Research Switzerland
> [email protected]

Re: SLING-5421 - Allow JCR installer to recover from being paused indefinitely

Reply via email to