My vote would also be (b) for the short-term. If we figure out a way to properly restart the nodestore (c) we can still come back to that at a later time.
Hence I've created https://issues.apache.org/jira/browse/OAK-3397 and unless the list vetoes I'll follow up on that next. Cheers, Stefan On 11/09/15 11:38, "Julian Sedding" <[email protected]> wrote: >My preference is (b), even though I think stopping the NodeStore >service should be sufficient (it may not currently be sufficient, I >don't know). > >Particularly, I believe that "trying harder" is detrimental to the >overall stability of a cluster/topology. We are dealing with a >possibly faulty instance, so who can decide that it is ok again after >trying harder? The faulty instance itself? > >"Read-only" doesn't sound too useful either, because that may fool >clients into thinking they are dealing with a "healthy" instance for >longer than necessary and thus can lead to bigger issues downstream. > >I believe that "fail early and fail often" is the path to a stable >cluster. > >Regards >Julian > >On Thu, Sep 10, 2015 at 6:43 PM, Stefan Egli <[email protected]> >wrote: >> On 09/09/15 18:11, "Stefan Egli" <[email protected]> wrote: >> >>>On 09/09/15 18:01, "Stefan Egli" <[email protected]> wrote: >>> >>>>I think if the observers would all be 'OSGi-ified' then this could be >>>>achieved. But currently eg the BackgroundObserver is just a pojo and >>>>not >>>>an osgi component (thus doesn't support any activate/deactivate method >>>>hooks). >>> >>>.. I take that back - going via OsgiWhiteboard should work as desired - >>>so >>>perhaps implementing deactivate/activate methods in the >>>(Background)Observer(s) would do the trick .. I'll give it a try .. >> >> ootb this wont work as the BackgroundObserver, as one example, is not an >> OSGi component, so wont get any deactivate/activate calls atm. so to >> achieve this, it would have to be properly OSGi-ified - something which >> sounds like a bigger task and not only limited to this one class - which >> means making DocumentNodeStore 'restart capable' sounds like a bigger >>task >> too and the question is indeed if it is worth while ('will it work?') or >> if there are alternatives.. >> >> which brings me back to the original question as to what should be done >>in >> case of a lease failure - to recap the options left (if System.exit is >>not >> one of them) are: >> >> a) 'go read-only': prevent writes by throwing exceptions from this >>moment >> until eternity >> >> b) 'stop oak': stop the oak-core bundle (prevent writes by throwing >> exceptions for those still reaching out for the nodeStore) >> >> c) 'try harder': try to reactivate the lease - continue allowing writes >>- >> and make sure the next backgroundWrite has correctly updated the >> 'unsavedLastRevisions' (cos others could have done a recover of this >>node, >> so unsavedLastRevisions contains superfluous stuff that must no longer >>be >> written). this would open the door for edge cases ('change of longer >>time >> window with multiple leaders') but perhaps is not entirely impossible... >> >> additionally/independently: >> >> * in all cases the discovery-lite descriptor should expose this lease >> failure/partitioning situation - so that anyone can react who would like >> to, esp should anyone no longer assume that the local instance is leader >> or part of the cluster - and to support that optional Sling Health Check >> which still does a System.exit :) >> >> * also, we should probably increase the lease thread's priority to >>reduce >> the likelihood of the lease timing out (same would be true for >> discovery.impl's heartbeat thread) >> >> >> * plus increasing the lease time from 1min to perhaps 5min as the >>default >> would also reduce the number of cases that hit problems dramatically >> >> wdyt? >> >> Cheers, >> Stefan >> >>
