On 09/09/15 18:11, "Stefan Egli" <[email protected]> wrote:
>On 09/09/15 18:01, "Stefan Egli" <[email protected]> wrote: > >>I think if the observers would all be 'OSGi-ified' then this could be >>achieved. But currently eg the BackgroundObserver is just a pojo and not >>an osgi component (thus doesn't support any activate/deactivate method >>hooks). > >.. I take that back - going via OsgiWhiteboard should work as desired - so >perhaps implementing deactivate/activate methods in the >(Background)Observer(s) would do the trick .. I'll give it a try .. ootb this wont work as the BackgroundObserver, as one example, is not an OSGi component, so wont get any deactivate/activate calls atm. so to achieve this, it would have to be properly OSGi-ified - something which sounds like a bigger task and not only limited to this one class - which means making DocumentNodeStore 'restart capable' sounds like a bigger task too and the question is indeed if it is worth while ('will it work?') or if there are alternatives.. which brings me back to the original question as to what should be done in case of a lease failure - to recap the options left (if System.exit is not one of them) are: a) 'go read-only': prevent writes by throwing exceptions from this moment until eternity b) 'stop oak': stop the oak-core bundle (prevent writes by throwing exceptions for those still reaching out for the nodeStore) c) 'try harder': try to reactivate the lease - continue allowing writes - and make sure the next backgroundWrite has correctly updated the 'unsavedLastRevisions' (cos others could have done a recover of this node, so unsavedLastRevisions contains superfluous stuff that must no longer be written). this would open the door for edge cases ('change of longer time window with multiple leaders') but perhaps is not entirely impossible... additionally/independently: * in all cases the discovery-lite descriptor should expose this lease failure/partitioning situation - so that anyone can react who would like to, esp should anyone no longer assume that the local instance is leader or part of the cluster - and to support that optional Sling Health Check which still does a System.exit :) * also, we should probably increase the lease thread's priority to reduce the likelihood of the lease timing out (same would be true for discovery.impl's heartbeat thread) * plus increasing the lease time from 1min to perhaps 5min as the default would also reduce the number of cases that hit problems dramatically wdyt? Cheers, Stefan
