[ https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014723#comment-17014723 ]
shanyu zhao commented on LIVY-718: ---------------------------------- [~jerryshao] the active-standby HA for Livy server is to solve the problem of hardware/networking failures and upgrade scenario on the active Livy server. When the active Livy server is offline, the standby Livy server becomes active and read the states from Zookeeper and start to serve requests. This aims at High Availability rather then scalability. The active-active proposal in this PR seems to be more geared towards scalability. The designated server proposal by [~yihengw] is simpler and more realistic to implement. As far as I know, the HiveServer2 HA is also using the designated server approach. The stateless proposal by [~bikassaha] is more desirable but much harder to implement. There are many in-memory states like access times need to be moved to persistent store, and may need locks for some variables. I think it is beneficial to first have active-standby HA (LIVY-11) checked in, while this PR is being worked on, especially it satisfy users with the need for HA rather than scalability. > Support multi-active high availability in Livy > ---------------------------------------------- > > Key: LIVY-718 > URL: https://issues.apache.org/jira/browse/LIVY-718 > Project: Livy > Issue Type: Epic > Components: RSC, Server > Reporter: Yiheng Wang > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In this JIRA we want to discuss how to implement multi-active high > availability in Livy. > Currently, Livy only supports single node recovery. This is not sufficient in > some production environments. In our scenario, the Livy server serves many > notebook and JDBC services. We want to make Livy service more fault-tolerant > and scalable. > There're already some proposals in the community for high availability. But > they're not so complete or just for active-standby high availability. So we > propose a multi-active high availability design to achieve the following > goals: > # One or more servers will serve the client requests at the same time. > # Sessions are allocated among different servers. > # When one node crashes, the affected sessions will be moved to other active > services. > Here's our design document, please review and comment: > https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing > -- This message was sent by Atlassian Jira (v8.3.4#803005)