Re: SLING-5421 - Allow JCR installer to recover from being paused indefinitely

Carsten Ziegeler Mon, 01 Feb 2016 03:49:59 -0800

Julian Sedding wrote
> Hi Carsten
> 
> There are two things to consider:
> 
> (1) moving the implementation of the repository based signalling into
> the installer (essentially encapsulation)


when you say "installer", do you mean "jcr installer" or "osgi
installer"? From below I assume jcr installer.


> (2) implementation of a robust protocol for signalling a block to
> installers on other cluster nodes
> 
> So far I have talked about (1) but didn't go into the details of (2).
> What I have in mind for (2) is a content structure that records three
> pieces of information information:
> 
> - Sling ID in order to identify on which cluster node the block was triggered
> - Service PID (i.e. the fully qualified class name of the
> implementation) in order to know which service triggered the block
> - Creation timestamp for information/debugging
> 
> The content structure would look like the following:
> 
> /system/sling/installer/jcr/pauseInstallation
>     <sling-id>/
>         <service-pid>/
>             <random-uuid>/
>                 jcr:created = <datetime>
> 
> It is important that, as a general rule, any node without children is
> eagerly deleted. This means that the installer is blocked if
> "pauseInstallation" has at least one child node and unblocked if it
> has none (or does not exist itself).
> 
> The structure would allow a single service to hold multiple blocks
> (each <random-uuid> node representing one).
> 
> Normally we would assume that a service blocks the installer and later
> unblocks it again, ideally using a try/finally block. However, it gets
> interesting when edge cases are considered:
> - the repository service may get stopped (or restarted), in which case
> the unblock can fail
> - a cluster node can be killed or disconnected before the unblock can be done
> - I have seen log files where the "blocking" service was restarted
> while it blocked the installer, because the installer was
> asynchronously processing a batch from a previous block. However, it
> is unclear why the unblock did not happen in this case: there were no
> exceptions in the log and I don't believe they were swallowed, because
>  when I provoked a similar scenario exceptions were written to the
> log.
> 
> To recover from such failure scenarios, the installer needs to be unblocked:
> - if a blocking service is stopped. A stopped service may still exist
> in the JVM and finish execution, therefore this could be solved using
> weak references to a block-token and a reference queue. Or
> alternatively by using a timeout in such cases.
> - if a cluster node disappears from a topology, its <sling-id> node
> should be removed after a timeout
> 
> There is a danger, however, that unblocking the installer due to
> recovery causes a partial deployment to be installed. This may put the
> system into an unusable state (e.g. bundles may not be resolvable,
> because their dependencies were not updated(installed). I don't know
> how we could address this.
> 
> Maybe an entirely different approach would be to provide a list of
> deployables (e.g. repository paths?) to the installer, which then only
> installs the deployables if all are available (ignoring deployables
> with extensions it does not handle). This list would need to be
> communicated in a cluster as well, however.
> 

Thanks Julian, now I understand your idea. This might work, however
sounds a little bit complex to me.
Now, obviously, the easier solution is that the content package which
installs the bundles in the first place, is installed through the OSGi
installer - as the OSGi installer is single threaded, bundles installed
through content packages would be installed after all content is
installed. And no pausing would be needed as well.
So, I think pausing and trying to recover etc. is a house made problem
which could be avoided if the root cause would be solved.

Carsten

> Regards
> Julian
> 
> 
> On Sun, Jan 31, 2016 at 10:34 AM, Carsten Ziegeler <[email protected]> 
> wrote:
>> Julian Sedding wrote
>>> Hi Carsten
>>>
>>>> Offline discussions don't make it transparent why you came to this
>>>> conclusion.
>>>> Please enclose the relevant information either here or in the issue.
>>>
>>> Sure, I thought that I included some information both in the email and
>>> in the issue. But it is probably worthwhile to expand on it a bit
>>> more.
>>>
>>> The current implementation is based on a convention rather than a
>>> contract: place a node under a specific parent node and the JCR
>>> installer will pause its activities.
>>>
>>> It turns out that this convention in this simple form has limitations
>>> when things go wrong:
>>>
>>> - If a "deployer" implementation fails to delete the pause-marker node
>>> (no matter what the reasons are), whose responsibility is it to delete
>>> this node to recover the system?
>>> - If a "deployer" on cluster node A creates a pause-marker node and
>>> then cluster node A is shut down/crashes/disconnects, whose
>>> responsibility is it to delete this node to recover the system?
>>>
>>> Both these questions require a more sophisticated convention IMHO.
>>> This becomes a burden for implementers, makes fixing the convention
>>> nearly impossible (in case we miss an edge case) and is brittle,
>>> because "deployers" may have bugs in their implementations.
>>>
>>> So the logical conclusion is to move the implementation of this
>>> "convention" into Sling and expose it via a simple API.
>>>
>>> The convention basically becomes an implementation detail, which is
>>> needed to distribute the information about blocking the installer
>>> within a cluster.
>>>
>>> Does this answer your questions?
>>>
>>
>> Thanks, yes, however :) I can't follow the conclusion. How is having an
>> API which for example has a pause/resume method to be called
>> different/easier/avoiding the problems than adding/removing a node?
>>
>> Carsten
>>
>>
>>
>> --
>> Carsten Ziegeler
>> Adobe Research Switzerland
>> [email protected]
> 


 
-- 
Carsten Ziegeler
Adobe Research Switzerland
[email protected]

Re: SLING-5421 - Allow JCR installer to recover from being paused indefinitely

Reply via email to