[ 
https://issues.apache.org/jira/browse/OAK-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcel Reutegger updated OAK-3488:
----------------------------------
    Attachment: OAK-3488.patch

Attached a WIP patch. With those changes, a DocumentNodeStore waits at most 60 
seconds for an ongoing recovery and then fails the startup if recovery is still 
not finished. It does not yet check if the recovering cluster node is still 
alive.

> LastRevRecovery for self async?
> -------------------------------
>
>                 Key: OAK-3488
>                 URL: https://issues.apache.org/jira/browse/OAK-3488
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: documentmk
>            Reporter: Julian Reschke
>            Assignee: Marcel Reutegger
>              Labels: resilience
>             Fix For: 1.4
>
>         Attachments: OAK-3488.patch
>
>
> Currently, when a cluster node starts and discovers that it wasn't properly 
> shutdown, it first runs the complete LastRevRecovery and only continues 
> startup when done.
> However, when it fails to acquire the recovery lock, which implies that a 
> different cluster node is already running the recovery on its behalf, it 
> simply skips recovery and continues startup?
> So what is it? Is running the recovery before proceeding critical or not? If 
> it is, this code in {{LastRevRecoveryAgent}} needs to change:
> {code}
>         //TODO What if recovery is being performed for current clusterNode by 
> some other node
>         //should we halt the startup
>         if(!lockAcquired){
>             log.info("Last revision recovery already being performed by some 
> other node. " +
>                     "Would not attempt recovery");
>             return 0;
>         }
> {code}
> If it's not critical, we may want to run the recovery always asynchronously. 
> cc [~mreutegg]  and [~chetanm]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to