liuxun created ZEPPELIN-3612:
--------------------------------

             Summary: Cluster High availability module design
                 Key: ZEPPELIN-3612
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3612
             Project: Zeppelin
          Issue Type: Sub-task
          Components: zeppelin-server
    Affects Versions: 0.9.0
            Reporter: liuxun
             Fix For: 0.9.0


h3. In the case of a partial Zeppelin-Server service process or server 
exception, the service can continue to be served; the Zeppelin-Server service 
can sense the availability of all Interpreter processes in the server cluster;
 # *Raft protocol*

The Raft protocol ensures that only N/2+1 servers in the cluster need to be in 
a normal state without affecting the service.

 # *Interpreter process monitoring*

The Interpreter process creates a process heartbeat thread through the 
ClusterMonitor class, and periodically sends the Interpreter process heartbeat 
information and the IP and port information of the Thrift interface to the 
cluster.

When the Interpreter process is closed, the process of deleting the process 
metadata is sent to the cluster.

The process health check thread is created in the Zeppelin-Server through the 
ClusterMonitor class. The heartbeat of all Interpreter processes in the Cluster 
MetaData is periodically checked. If the timeout expires, the process metadata 
is deleted to prevent the Interpreter process from being abnormal.

 # *Interpreter process rebuild*

When the interpreter process is created, the Zeppelin-Server detects the 
session information of the Interpreter process. First, it checks whether the 
process is valid. If it is not available, the corresponding session is cleared 
and the Interpreter process is re-created. Preventing an Interpreter process 
from being abnormal or a server exception on the process causes Interpreter to 
be unavailable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to