[ https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013498#comment-17013498 ]
Bikas Saha commented on LIVY-718: --------------------------------- Clearly, I am not aligned with the above or else I would not start and push on this discussion :) In my experience, code refactoring and cost is paid once initially and easier to test relative to operational complexity and runtime correctness/reliability. However, if others are onboard with the current proposal then I will not pursue this discussion further. On the voting thread, IIRC the ask had been to add more details to the design doc and align on parts where no conclusion has been reached yet. Is that done? I ask because a couple of PRs are committed already indicating that coding has started. Even if we go with the proposal in the document, having the details water tight and converged is super important for a feature like this which involves distributed state and coordination. These things are notoriously difficult to get right. So the more we can solidify the design up front the safer it will be to implement. > Support multi-active high availability in Livy > ---------------------------------------------- > > Key: LIVY-718 > URL: https://issues.apache.org/jira/browse/LIVY-718 > Project: Livy > Issue Type: Epic > Components: RSC, Server > Reporter: Yiheng Wang > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In this JIRA we want to discuss how to implement multi-active high > availability in Livy. > Currently, Livy only supports single node recovery. This is not sufficient in > some production environments. In our scenario, the Livy server serves many > notebook and JDBC services. We want to make Livy service more fault-tolerant > and scalable. > There're already some proposals in the community for high availability. But > they're not so complete or just for active-standby high availability. So we > propose a multi-active high availability design to achieve the following > goals: > # One or more servers will serve the client requests at the same time. > # Sessions are allocated among different servers. > # When one node crashes, the affected sessions will be moved to other active > services. > Here's our design document, please review and comment: > https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing > -- This message was sent by Atlassian Jira (v8.3.4#803005)