My vote would also be (b) for the short-term. If we figure out a way to
properly restart the nodestore (c) we can still come back to that at a
later time.

Hence I've created https://issues.apache.org/jira/browse/OAK-3397 and
unless the list vetoes I'll follow up on that next.

Cheers,
Stefan

On 11/09/15 11:38, "Julian Sedding" <[email protected]> wrote:

>My preference is (b), even though I think stopping the NodeStore
>service should be sufficient (it may not currently be sufficient, I
>don't know).
>
>Particularly, I believe that "trying harder" is detrimental to the
>overall stability of a cluster/topology. We are dealing with a
>possibly faulty instance, so who can decide that it is ok again after
>trying harder? The faulty instance itself?
>
>"Read-only" doesn't sound too useful either, because that may fool
>clients into thinking they are dealing with a "healthy" instance for
>longer than necessary and thus can lead to bigger issues downstream.
>
>I believe that "fail early and fail often" is the path to a stable
>cluster.
>
>Regards
>Julian
>
>On Thu, Sep 10, 2015 at 6:43 PM, Stefan Egli <[email protected]>
>wrote:
>> On 09/09/15 18:11, "Stefan Egli" <[email protected]> wrote:
>>
>>>On 09/09/15 18:01, "Stefan Egli" <[email protected]> wrote:
>>>
>>>>I think if the observers would all be 'OSGi-ified' then this could be
>>>>achieved. But currently eg the BackgroundObserver is just a pojo and
>>>>not
>>>>an osgi component (thus doesn't support any activate/deactivate method
>>>>hooks).
>>>
>>>.. I take that back - going via OsgiWhiteboard should work as desired -
>>>so
>>>perhaps implementing deactivate/activate methods in the
>>>(Background)Observer(s) would do the trick .. I'll give it a try ..
>>
>> ootb this wont work as the BackgroundObserver, as one example, is not an
>> OSGi component, so wont get any deactivate/activate calls atm. so to
>> achieve this, it would have to be properly OSGi-ified - something which
>> sounds like a bigger task and not only limited to this one class - which
>> means making DocumentNodeStore 'restart capable' sounds like a bigger
>>task
>> too and the question is indeed if it is worth while ('will it work?') or
>> if there are alternatives..
>>
>> which brings me back to the original question as to what should be done
>>in
>> case of a lease failure - to recap the options left (if System.exit is
>>not
>> one of them) are:
>>
>> a) 'go read-only': prevent writes by throwing exceptions from this
>>moment
>> until eternity
>>
>> b) 'stop oak': stop the oak-core bundle (prevent writes by throwing
>> exceptions for those still reaching out for the nodeStore)
>>
>> c) 'try harder': try to reactivate the lease - continue allowing writes
>>-
>> and make sure the next backgroundWrite has correctly updated the
>> 'unsavedLastRevisions' (cos others could have done a recover of this
>>node,
>> so unsavedLastRevisions contains superfluous stuff that must no longer
>>be
>> written). this would open the door for edge cases ('change of longer
>>time
>> window with multiple leaders') but perhaps is not entirely impossible...
>>
>> additionally/independently:
>>
>> * in all cases the discovery-lite descriptor should expose this lease
>> failure/partitioning situation - so that anyone can react who would like
>> to, esp should anyone no longer assume that the local instance is leader
>> or part of the cluster - and to support that optional Sling Health Check
>> which still does a System.exit :)
>>
>> * also, we should probably increase the lease thread's priority to
>>reduce
>> the likelihood of the lease timing out (same would be true for
>> discovery.impl's heartbeat thread)
>>
>>
>> * plus increasing the lease time from 1min to perhaps 5min as the
>>default
>> would also reduce the number of cases that hit problems dramatically
>>
>> wdyt?
>>
>> Cheers,
>> Stefan
>>
>>


Reply via email to