[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17005203#comment-17005203
 ] 

Yiheng Wang commented on LIVY-718:
----------------------------------

Thanks for your comments [~bikassaha] and [~meisam]. I summary the discussing 
points and put my comments below(please point it out if I miss something).

h4. Designated Server - Is it because there are issues with multiple servers 
handling multiple clients to the same session?
The issue includes:
1. Livy server needs to monitor spark sessions. If we remove the designated 
server, each server may need to monitor all sessions. It's kind of waste and 
there may be some inconsistent among servers.
2. Besides service data, Livy server also stores other data like the 
application log and last active time in memory. Such information has a higher 
update rate. It's not suitable to store in some state-store backend like 
zookeeper.
3. If multiple servers serve one session, we need to add some kind of lock 
mechanism to handle the concurrent state-change request(e.g. stop session)

h4. Session id - I would strongly suggest deprecating the integral session id
I agree that the incremental integral session ID is not necessary for Livy. For 
changing it to UUID, my biggest concern is compatible with the earlier 
API(session-id data type may be needed to change from Int to String). It's a 
quite big move so we choose a conservative way in the design.

h4. Dependency on ZK
ZK is introduced to resolve the above two problems(server status change 
notification and unique id generation). If we don't need the designated server 
and incremental session-id, I think we can remove zk.

h4. Service discovery, Ease of use of the API and the number of ports in the 
firewall that needs to be opened for Livy HA can become a security concern
I think the point here is a single URL for the Livy cluster. It depends on the 
first questions. If we can remove the designated server, we just need to put a 
load balancer before all the servers.

If we keep the designated server design, we can use a http load balancer which 
is aware of 307 responses. Currently, we use a 307 response to route the 
request to the designated server on the client-side. The response can be 
handled automatically by some load balancer.

I think the key point of the discussion is the designated server. Please let me 
know your suggestions for my list issues and see if we can improve the design 
proposal.


> Support multi-active high availability in Livy
> ----------------------------------------------
>
>                 Key: LIVY-718
>                 URL: https://issues.apache.org/jira/browse/LIVY-718
>             Project: Livy
>          Issue Type: Epic
>          Components: RSC, Server
>            Reporter: Yiheng Wang
>            Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to