My preference is (b), even though I think stopping the NodeStore service should be sufficient (it may not currently be sufficient, I don't know).
Particularly, I believe that "trying harder" is detrimental to the overall stability of a cluster/topology. We are dealing with a possibly faulty instance, so who can decide that it is ok again after trying harder? The faulty instance itself? "Read-only" doesn't sound too useful either, because that may fool clients into thinking they are dealing with a "healthy" instance for longer than necessary and thus can lead to bigger issues downstream. I believe that "fail early and fail often" is the path to a stable cluster. Regards Julian On Thu, Sep 10, 2015 at 6:43 PM, Stefan Egli <[email protected]> wrote: > On 09/09/15 18:11, "Stefan Egli" <[email protected]> wrote: > >>On 09/09/15 18:01, "Stefan Egli" <[email protected]> wrote: >> >>>I think if the observers would all be 'OSGi-ified' then this could be >>>achieved. But currently eg the BackgroundObserver is just a pojo and not >>>an osgi component (thus doesn't support any activate/deactivate method >>>hooks). >> >>.. I take that back - going via OsgiWhiteboard should work as desired - so >>perhaps implementing deactivate/activate methods in the >>(Background)Observer(s) would do the trick .. I'll give it a try .. > > ootb this wont work as the BackgroundObserver, as one example, is not an > OSGi component, so wont get any deactivate/activate calls atm. so to > achieve this, it would have to be properly OSGi-ified - something which > sounds like a bigger task and not only limited to this one class - which > means making DocumentNodeStore 'restart capable' sounds like a bigger task > too and the question is indeed if it is worth while ('will it work?') or > if there are alternatives.. > > which brings me back to the original question as to what should be done in > case of a lease failure - to recap the options left (if System.exit is not > one of them) are: > > a) 'go read-only': prevent writes by throwing exceptions from this moment > until eternity > > b) 'stop oak': stop the oak-core bundle (prevent writes by throwing > exceptions for those still reaching out for the nodeStore) > > c) 'try harder': try to reactivate the lease - continue allowing writes - > and make sure the next backgroundWrite has correctly updated the > 'unsavedLastRevisions' (cos others could have done a recover of this node, > so unsavedLastRevisions contains superfluous stuff that must no longer be > written). this would open the door for edge cases ('change of longer time > window with multiple leaders') but perhaps is not entirely impossible... > > additionally/independently: > > * in all cases the discovery-lite descriptor should expose this lease > failure/partitioning situation - so that anyone can react who would like > to, esp should anyone no longer assume that the local instance is leader > or part of the cluster - and to support that optional Sling Health Check > which still does a System.exit :) > > * also, we should probably increase the lease thread's priority to reduce > the likelihood of the lease timing out (same would be true for > discovery.impl's heartbeat thread) > > > * plus increasing the lease time from 1min to perhaps 5min as the default > would also reduce the number of cases that hit problems dramatically > > wdyt? > > Cheers, > Stefan > >
