[ https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17005203#comment-17005203 ]
Yiheng Wang commented on LIVY-718: ---------------------------------- Thanks for your comments [~bikassaha] and [~meisam]. I summary the discussing points and put my comments below(please point it out if I miss something). h4. Designated Server - Is it because there are issues with multiple servers handling multiple clients to the same session? The issue includes: 1. Livy server needs to monitor spark sessions. If we remove the designated server, each server may need to monitor all sessions. It's kind of waste and there may be some inconsistent among servers. 2. Besides service data, Livy server also stores other data like the application log and last active time in memory. Such information has a higher update rate. It's not suitable to store in some state-store backend like zookeeper. 3. If multiple servers serve one session, we need to add some kind of lock mechanism to handle the concurrent state-change request(e.g. stop session) h4. Session id - I would strongly suggest deprecating the integral session id I agree that the incremental integral session ID is not necessary for Livy. For changing it to UUID, my biggest concern is compatible with the earlier API(session-id data type may be needed to change from Int to String). It's a quite big move so we choose a conservative way in the design. h4. Dependency on ZK ZK is introduced to resolve the above two problems(server status change notification and unique id generation). If we don't need the designated server and incremental session-id, I think we can remove zk. h4. Service discovery, Ease of use of the API and the number of ports in the firewall that needs to be opened for Livy HA can become a security concern I think the point here is a single URL for the Livy cluster. It depends on the first questions. If we can remove the designated server, we just need to put a load balancer before all the servers. If we keep the designated server design, we can use a http load balancer which is aware of 307 responses. Currently, we use a 307 response to route the request to the designated server on the client-side. The response can be handled automatically by some load balancer. I think the key point of the discussion is the designated server. Please let me know your suggestions for my list issues and see if we can improve the design proposal. > Support multi-active high availability in Livy > ---------------------------------------------- > > Key: LIVY-718 > URL: https://issues.apache.org/jira/browse/LIVY-718 > Project: Livy > Issue Type: Epic > Components: RSC, Server > Reporter: Yiheng Wang > Priority: Major > > In this JIRA we want to discuss how to implement multi-active high > availability in Livy. > Currently, Livy only supports single node recovery. This is not sufficient in > some production environments. In our scenario, the Livy server serves many > notebook and JDBC services. We want to make Livy service more fault-tolerant > and scalable. > There're already some proposals in the community for high availability. But > they're not so complete or just for active-standby high availability. So we > propose a multi-active high availability design to achieve the following > goals: > # One or more servers will serve the client requests at the same time. > # Sessions are allocated among different servers. > # When one node crashes, the affected sessions will be moved to other active > services. > Here's our design document, please review and comment: > https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing > -- This message was sent by Atlassian Jira (v8.3.4#803005)