Hi All This is an update on the WSO2 Stream Processor’s 2 Node minimum HA deployment. While some of the design elements changed during the implementation, this email will cover the overall design Two SP nodes work in an active-passive configuration where both nodes receive and process the same events. But only the active node will publish the events. If the active node goes down, the passive node would change its role and start publishing events as the active node. When the node that went down is restarted, it would act as the passive node ready to change roles when needed. For this to work it should be ensured that both nodes receive the same events continuously.
To ensure "at least once publishing" following techniques were adopted *1. Double State Syncing of Passive Node* The base of the HA implementation is that both nodes have the same state at a given time. For this, when the Passive node starts up it should sync up with the Active node. This syncing is done in two user configurable ways. i.e. Live Sync enabled or disabled. When live sync is enabled and a Siddhi application is deployed in the passive node, a REST call is made to the Active node to get the snapshot of the deployed Siddhi application. Once the snapshot is received and restored on the passive node the Siddhi application will be deployed. If such a snapshot is not found, the passive node would defer the deployment of the Siddhi application for a user configurable time period after which another state sync occurs. If yet a snapshot is not received, the Siddhi application will be in an inactive state. When live sync is disabled, the snapshot of the Siddhi application will be taken from the Active nodes last persisted state which is either from the database or the file system. After the initial syncing of snapshots, the passive node’s sources may not connect on time to process events. This means that passive node is not 100% in sync with the active node. Hence a second syncing of states happen after a user configured time period from the server start time. The snapshot off all Siddhi applications is taken and restored in the passive node. Since the size of the state may be large, an event queue is implemented in the passive node as a solution for the time taken for the snapshot to reach the passive node and restore. During the syncing, passive node would queue all events and start processing events from where the active node stopped processing to take the snapshot. This guarantees that the active and passive node process the same amount of events. In transferring snapshots using live sync, the snapshots are compressed using gzip to reduce the size and time taken for the snapshot to reach the passive node. *2. Periodic output syncing of Passive Node* Although both active and passive nodes process the same events in a 100% synced manner, once the active node goes down, the passive node may take some time to identify this and start publishing events. This means some events may not be published. As a solution, the passive node would queue all the events that are processed (per event sink). Periodically passive node would trim this queue according to the last processed event of the active node. When passive node identifies that the active node is down, it would start publishing the events in the queue first. This guarantees that no events are dropped. Enabling live sync would allow the passive node to directly ping the active node periodically to get the timestamp of the last published event and use it to trim the queue. When live sync is disabled, the active node would periodically save the same information in the database and the passive node would periodically read this value. Find the user story of the implementation at [1] which has details about the configuration of the 2 node minimum HA. [1] https://redmine.wso2.com/issues/6724 Thanks Anoukh On Fri, Aug 18, 2017 at 2:01 PM, Anoukh Jayawardena <[email protected]> wrote: > +architecture > > > On Thu, Aug 17, 2017 at 3:24 PM, Anoukh Jayawardena <[email protected]> > wrote: > >> Hi All, >> >> This is a high level overview of the 2 node minimum high availability >> (HA) deployment feature for the Stream processor (SP). The implementation >> would adopt an Active Passive approach with periodic state persistence. The >> process flow of how this feature would work is as follows >> >> *Prerequisites* >> >> - 2 SP workers, one would be the Active worker while the other would >> be the passive. Both nodes should include the same Siddhi Applications >> deployed. >> - A specified RDBMS or file location for periodic state persistence >> of Siddhi App states. >> - A running zookeeper service or RDBMS instance for coordination >> among the two nodes (Will be using carbon-coordination [1] for the purpose >> of distributed coordination). >> - Siddhi Client that would publish events to both Active and Passive >> nodes in a synced manner where message is published to passive node when >> the active node acknowledges message received. >> >> >> [image: 2 Node HA Overview.jpg] >> >> >> *Process* >> >> - Both nodes will receive events and process them. But only the >> active node will publish the output events. This ensures that both nodes >> are in sync. >> - Active node will periodically persist the siddhi app states so that >> the state can be retrieved in a failover scenario. >> - A user defined “Live State Sync” option would determine how the >> states are synced between active and passive in a failover scenario >> (Explained below). >> - Following is how the implementation works in different system states >> >> >> 1. When a new node is starting up when Active node is available >> - "Live State Sync Enabled" - The new node will detect the Active >> node is available, so it will register itself as the passive node and >> call >> the Active node and borrow the current state. When state borrowing >> happens >> the processing of events is paused in both nodes so that data is not >> lost. >> After that both nodes will process events in sync. >> - "Live State Sync Disabled" - The new node will detect the Active >> node is available, so it will register itself as the passive node and >> access the Database to get the last persisted state. This option may >> lead >> to few data loss since the Database will not contain a real time >> persisted >> state. >> 2. When a new node is starting up when Active node is unavailable >> - The new node will detect the Active node is unavailable, so it >> will register itself as the Active node and access the Database to get >> the >> last persisted state. (For a fresh restart an API should be called >> before >> hand to clean the DB) >> 3. When Active node goes down >> - The Passive node will detect that Active node is unavailable and >> would switch states and start publishing the output events. >> >> >> *Data loss minimizing strategies* >> >> >> 1. When active node goes down, the passive node does not know what >> the last event that was published by the active node. Therefore passive >> node might start publishing events from a later time. For example consider >> both nodes have processed 5 messages but only 2 messages have been >> published by Active node before failing. Passive node will start >> publishing >> from the 6th message onwards since it does not know what has not been >> published. >> >> As a solution a queue implemented in the passive node. The active node >> will know the last event it published. Passive node would periodically ping >> the active node and get the last published event and dequeue the buffer >> accordingly, so that events are not lost, but might be duplicated. (This >> might be a problem in live state sync) >> >> >> [1] https://github.com/wso2/carbon-coordination >> >> >> Thank You >> -- >> *Anoukh Jayawardena* >> *Software Engineer* >> >> *WSO2 Lanka (Private) Limited: http://wso2.com >> <http://wso2.com/>lean.enterprise.middle-ware* >> >> >> *phone: (+94) 77 99 28932* >> <https://wso2.com/signature> >> > > > > -- > *Anoukh Jayawardena* > *Software Engineer* > > *WSO2 Lanka (Private) Limited: http://wso2.com > <http://wso2.com/>lean.enterprise.middle-ware* > > > *phone: (+94) 77 99 28932* > <https://wso2.com/signature> > -- *Anoukh Jayawardena* *Software Engineer* *WSO2 Lanka (Private) Limited: http://wso2.com <http://wso2.com/>lean.enterprise.middle-ware* *phone: (+94) 77 99 28932* <https://wso2.com/signature>
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
