On 09/09/15 18:11, "Stefan Egli" <[email protected]> wrote:

>On 09/09/15 18:01, "Stefan Egli" <[email protected]> wrote:
>
>>I think if the observers would all be 'OSGi-ified' then this could be
>>achieved. But currently eg the BackgroundObserver is just a pojo and not
>>an osgi component (thus doesn't support any activate/deactivate method
>>hooks).
>
>.. I take that back - going via OsgiWhiteboard should work as desired - so
>perhaps implementing deactivate/activate methods in the
>(Background)Observer(s) would do the trick .. I'll give it a try ..

ootb this wont work as the BackgroundObserver, as one example, is not an
OSGi component, so wont get any deactivate/activate calls atm. so to
achieve this, it would have to be properly OSGi-ified - something which
sounds like a bigger task and not only limited to this one class - which
means making DocumentNodeStore 'restart capable' sounds like a bigger task
too and the question is indeed if it is worth while ('will it work?') or
if there are alternatives..

which brings me back to the original question as to what should be done in
case of a lease failure - to recap the options left (if System.exit is not
one of them) are:

a) 'go read-only': prevent writes by throwing exceptions from this moment
until eternity

b) 'stop oak': stop the oak-core bundle (prevent writes by throwing
exceptions for those still reaching out for the nodeStore)

c) 'try harder': try to reactivate the lease - continue allowing writes -
and make sure the next backgroundWrite has correctly updated the
'unsavedLastRevisions' (cos others could have done a recover of this node,
so unsavedLastRevisions contains superfluous stuff that must no longer be
written). this would open the door for edge cases ('change of longer time
window with multiple leaders') but perhaps is not entirely impossible...

additionally/independently:

* in all cases the discovery-lite descriptor should expose this lease
failure/partitioning situation - so that anyone can react who would like
to, esp should anyone no longer assume that the local instance is leader
or part of the cluster - and to support that optional Sling Health Check
which still does a System.exit :)

* also, we should probably increase the lease thread's priority to reduce
the likelihood of the lease timing out (same would be true for
discovery.impl's heartbeat thread)


* plus increasing the lease time from 1min to perhaps 5min as the default
would also reduce the number of cases that hit problems dramatically

wdyt?

Cheers,
Stefan


Reply via email to