Whatever the long term solution will be: we need a short term solution that doesn't kill an entire application server, so +1.
On 09.09.15 14:12, Stefan Egli wrote: > Hi all, > > I'd like to follow up on the idea to restart DocumentNodeStore as a result > of a lease failure [0]: I suggest we don't do that and instead just stop > the oak-core bundle. > > After some prototyping and running into OAK-3373 [1] I'm no longer sure if > restarting the DocumentNodeStore is a feasible path to go, esp in the > short term. The problem encountered so far is that Observers cannot be > easily switched from old to (restarted/)new store due to: > > * as pointed out by MichaelD they could have a backlog yet to process > towards the old store - which they cannot access anymore as that one would > be forcibly closed > * there is not yet a proper way to switch from old to new ('reset') - esp > is there a risk that there could be a gap (this part we might be able to > fix though, not sure) > * both above carry the risk that Observers miss some changes - something > which would be unacceptable I guess. > > I think the more kiss approach would be to just forcibly close the > DocumentNodeStore - or actually to stop the entire oak-core bundle - with > appropriate errors logged so that the issue becomes clear. The instance > would basically become unusable, mostly, but at least it would not be a > System.exit. > > What do ppl think? > > Cheers, > Stefan > -- > [0] https://issues.apache.org/jira/browse/OAK-3250 > [1] https://issues.apache.org/jira/browse/OAK-3373 > > On 18/08/15 16:45, "Stefan Egli" <e...@adobe.com> wrote: > >> I've created OAK-3250 to follow up on the DocumentNodeStore-restart idea. >> >> Cheers, >> Stefan >> -- >> https://issues.apache.org/jira/browse/OAK-3250 >> >> On 18/08/15 15:59, "Marcel Reutegger" <mreut...@adobe.com> wrote: >> >>> On 18/08/15 15:38, "Stefan Egli" wrote: >>>> On 18/08/15 13:43, "Marcel Reutegger" <mreut...@adobe.com> wrote: >>>>> On 18/08/15 11:14, "Stefan Egli" wrote: >>>>>> b) Oak does not do the System.exit but refuses to update anything >>>>>> towards >>>>>> the document store (thus just throws exceptions on each invocation) - >>>>>> and >>>>>> upper level code detects this situation (eg a Sling Health Check) and >>>>>> would do a System.exit based on how it is configured >>>>>> >>>>>> c) same as b) but upper level code does not do a System.exit (I¹m not >>>>>> sure >>>>>> if that makes sense - the instance is useless in such a situation) >>>>> either b) or c) sounds reasonable to me. >>>>> >>>>> but if possible I'd like to avoid a System.exit(). would it be possible >>>>> to detect this situation in the DocumentNodeStoreService and restart >>>>> the DocumentNodeStore without the need to restart the JVM >>>> Good point. Perhaps restarting DocumentNodeStore is a valid alternative >>>> indeed. Is that feasible from a DocumentNodeStore point of view? >>> it probably requires some changes to the DocumentNodeStore, because >>> we want it to tear down without doing any of the cleanup it >>> may otherwise perform. it must not release the cluster node info >>> nor update pending _lastRevs, etc. >>> >>>> What would be the consequences of a restarted DocumentNodeStore? >>> to the DocumentNodeStore it will look like it was killed and it will >>> perform recovery (e.g. for the pending _lastRevs). >>> >>> Regards >>> Marcel >>> >