Julian Sedding wrote > Hi Carsten > > There are two things to consider: > > (1) moving the implementation of the repository based signalling into > the installer (essentially encapsulation)
when you say "installer", do you mean "jcr installer" or "osgi installer"? From below I assume jcr installer. > (2) implementation of a robust protocol for signalling a block to > installers on other cluster nodes > > So far I have talked about (1) but didn't go into the details of (2). > What I have in mind for (2) is a content structure that records three > pieces of information information: > > - Sling ID in order to identify on which cluster node the block was triggered > - Service PID (i.e. the fully qualified class name of the > implementation) in order to know which service triggered the block > - Creation timestamp for information/debugging > > The content structure would look like the following: > > /system/sling/installer/jcr/pauseInstallation > <sling-id>/ > <service-pid>/ > <random-uuid>/ > jcr:created = <datetime> > > It is important that, as a general rule, any node without children is > eagerly deleted. This means that the installer is blocked if > "pauseInstallation" has at least one child node and unblocked if it > has none (or does not exist itself). > > The structure would allow a single service to hold multiple blocks > (each <random-uuid> node representing one). > > Normally we would assume that a service blocks the installer and later > unblocks it again, ideally using a try/finally block. However, it gets > interesting when edge cases are considered: > - the repository service may get stopped (or restarted), in which case > the unblock can fail > - a cluster node can be killed or disconnected before the unblock can be done > - I have seen log files where the "blocking" service was restarted > while it blocked the installer, because the installer was > asynchronously processing a batch from a previous block. However, it > is unclear why the unblock did not happen in this case: there were no > exceptions in the log and I don't believe they were swallowed, because > when I provoked a similar scenario exceptions were written to the > log. > > To recover from such failure scenarios, the installer needs to be unblocked: > - if a blocking service is stopped. A stopped service may still exist > in the JVM and finish execution, therefore this could be solved using > weak references to a block-token and a reference queue. Or > alternatively by using a timeout in such cases. > - if a cluster node disappears from a topology, its <sling-id> node > should be removed after a timeout > > There is a danger, however, that unblocking the installer due to > recovery causes a partial deployment to be installed. This may put the > system into an unusable state (e.g. bundles may not be resolvable, > because their dependencies were not updated(installed). I don't know > how we could address this. > > Maybe an entirely different approach would be to provide a list of > deployables (e.g. repository paths?) to the installer, which then only > installs the deployables if all are available (ignoring deployables > with extensions it does not handle). This list would need to be > communicated in a cluster as well, however. > Thanks Julian, now I understand your idea. This might work, however sounds a little bit complex to me. Now, obviously, the easier solution is that the content package which installs the bundles in the first place, is installed through the OSGi installer - as the OSGi installer is single threaded, bundles installed through content packages would be installed after all content is installed. And no pausing would be needed as well. So, I think pausing and trying to recover etc. is a house made problem which could be avoided if the root cause would be solved. Carsten > Regards > Julian > > > On Sun, Jan 31, 2016 at 10:34 AM, Carsten Ziegeler <[email protected]> > wrote: >> Julian Sedding wrote >>> Hi Carsten >>> >>>> Offline discussions don't make it transparent why you came to this >>>> conclusion. >>>> Please enclose the relevant information either here or in the issue. >>> >>> Sure, I thought that I included some information both in the email and >>> in the issue. But it is probably worthwhile to expand on it a bit >>> more. >>> >>> The current implementation is based on a convention rather than a >>> contract: place a node under a specific parent node and the JCR >>> installer will pause its activities. >>> >>> It turns out that this convention in this simple form has limitations >>> when things go wrong: >>> >>> - If a "deployer" implementation fails to delete the pause-marker node >>> (no matter what the reasons are), whose responsibility is it to delete >>> this node to recover the system? >>> - If a "deployer" on cluster node A creates a pause-marker node and >>> then cluster node A is shut down/crashes/disconnects, whose >>> responsibility is it to delete this node to recover the system? >>> >>> Both these questions require a more sophisticated convention IMHO. >>> This becomes a burden for implementers, makes fixing the convention >>> nearly impossible (in case we miss an edge case) and is brittle, >>> because "deployers" may have bugs in their implementations. >>> >>> So the logical conclusion is to move the implementation of this >>> "convention" into Sling and expose it via a simple API. >>> >>> The convention basically becomes an implementation detail, which is >>> needed to distribute the information about blocking the installer >>> within a cluster. >>> >>> Does this answer your questions? >>> >> >> Thanks, yes, however :) I can't follow the conclusion. How is having an >> API which for example has a pause/resume method to be called >> different/easier/avoiding the problems than adding/removing a node? >> >> Carsten >> >> >> >> -- >> Carsten Ziegeler >> Adobe Research Switzerland >> [email protected] > -- Carsten Ziegeler Adobe Research Switzerland [email protected]
