The image is missing. On Sun, Oct 22, 2017 at 3:48 PM, Anoukh Jayawardena <[email protected]> wrote:
> Hi All > > This is an update on the WSO2 Stream Processor’s 2 Node minimum HA > deployment. While some of the design elements changed during the > implementation, this email will cover the overall design > Two SP nodes work in an active-passive configuration where both nodes > receive and process the same events. But only the active node will publish > the events. If the active node goes down, the passive node would change its > role and start publishing events as the active node. When the node that > went down is restarted, it would act as the passive node ready to change > roles when needed. For this to work it should be ensured that both nodes > receive the same events continuously. > > > > To ensure "at least once publishing" following techniques were adopted > > *1. Double State Syncing of Passive Node* > > The base of the HA implementation is that both nodes have the same state > at a given time. For this, when the Passive node starts up it should sync > up with the Active node. This syncing is done in two user configurable > ways. i.e. Live Sync enabled or disabled. > > When live sync is enabled and a Siddhi application is deployed in the > passive node, a REST call is made to the Active node to get the snapshot of > the deployed Siddhi application. Once the snapshot is received and restored > on the passive node the Siddhi application will be deployed. If such a > snapshot is not found, the passive node would defer the deployment of the > Siddhi application for a user configurable time period after which another > state sync occurs. If yet a snapshot is not received, the Siddhi > application will be in an inactive state. When live sync is disabled, the > snapshot of the Siddhi application will be taken from the Active nodes last > persisted state which is either from the database or the file system. > > After the initial syncing of snapshots, the passive node’s sources may not > connect on time to process events. This means that passive node is not 100% > in sync with the active node. Hence a second syncing of states happen after > a user configured time period from the server start time. The snapshot off > all Siddhi applications is taken and restored in the passive node. Since > the size of the state may be large, an event queue is implemented in the > passive node as a solution for the time taken for the snapshot to reach the > passive node and restore. During the syncing, passive node would queue all > events and start processing events from where the active node stopped > processing to take the snapshot. This guarantees that the active and > passive node process the same amount of events. > In transferring snapshots using live sync, the snapshots are compressed > using gzip to reduce the size and time taken for the snapshot to reach the > passive node. > > > *2. Periodic output syncing of Passive Node* > > Although both active and passive nodes process the same events in a 100% > synced manner, once the active node goes down, the passive node may take > some time to identify this and start publishing events. This means some > events may not be published. As a solution, the passive node would queue > all the events that are processed (per event sink). Periodically passive > node would trim this queue according to the last processed event of the > active node. When passive node identifies that the active node is down, it > would start publishing the events in the queue first. This guarantees that > no events are dropped. > > Enabling live sync would allow the passive node to directly ping the > active node periodically to get the timestamp of the last published event > and use it to trim the queue. When live sync is disabled, the active node > would periodically save the same information in the database and the > passive node would periodically read this value. > > Find the user story of the implementation at [1] which has details about > the configuration of the 2 node minimum HA. > > [1] https://redmine.wso2.com/issues/6724 > > Thanks > Anoukh > > > On Fri, Aug 18, 2017 at 2:01 PM, Anoukh Jayawardena <[email protected]> > wrote: > >> +architecture >> >> >> On Thu, Aug 17, 2017 at 3:24 PM, Anoukh Jayawardena <[email protected]> >> wrote: >> >>> Hi All, >>> >>> This is a high level overview of the 2 node minimum high availability >>> (HA) deployment feature for the Stream processor (SP). The implementation >>> would adopt an Active Passive approach with periodic state persistence. The >>> process flow of how this feature would work is as follows >>> >>> *Prerequisites* >>> >>> - 2 SP workers, one would be the Active worker while the other would >>> be the passive. Both nodes should include the same Siddhi Applications >>> deployed. >>> - A specified RDBMS or file location for periodic state persistence >>> of Siddhi App states. >>> - A running zookeeper service or RDBMS instance for coordination >>> among the two nodes (Will be using carbon-coordination [1] for the >>> purpose >>> of distributed coordination). >>> - Siddhi Client that would publish events to both Active and Passive >>> nodes in a synced manner where message is published to passive node when >>> the active node acknowledges message received. >>> >>> >>> [image: 2 Node HA Overview.jpg] >>> >>> >>> *Process* >>> >>> - Both nodes will receive events and process them. But only the >>> active node will publish the output events. This ensures that both nodes >>> are in sync. >>> - Active node will periodically persist the siddhi app states so >>> that the state can be retrieved in a failover scenario. >>> - A user defined “Live State Sync” option would determine how the >>> states are synced between active and passive in a failover scenario >>> (Explained below). >>> - Following is how the implementation works in different system >>> states >>> >>> >>> 1. When a new node is starting up when Active node is available >>> - "Live State Sync Enabled" - The new node will detect the Active >>> node is available, so it will register itself as the passive node and >>> call >>> the Active node and borrow the current state. When state borrowing >>> happens >>> the processing of events is paused in both nodes so that data is not >>> lost. >>> After that both nodes will process events in sync. >>> - "Live State Sync Disabled" - The new node will detect the >>> Active node is available, so it will register itself as the passive >>> node >>> and access the Database to get the last persisted state. This option >>> may >>> lead to few data loss since the Database will not contain a real time >>> persisted state. >>> 2. When a new node is starting up when Active node is unavailable >>> - The new node will detect the Active node is unavailable, so it >>> will register itself as the Active node and access the Database to >>> get the >>> last persisted state. (For a fresh restart an API should be called >>> before >>> hand to clean the DB) >>> 3. When Active node goes down >>> - The Passive node will detect that Active node is unavailable >>> and would switch states and start publishing the output events. >>> >>> >>> *Data loss minimizing strategies* >>> >>> >>> 1. When active node goes down, the passive node does not know what >>> the last event that was published by the active node. Therefore passive >>> node might start publishing events from a later time. For example >>> consider >>> both nodes have processed 5 messages but only 2 messages have been >>> published by Active node before failing. Passive node will start >>> publishing >>> from the 6th message onwards since it does not know what has not been >>> published. >>> >>> As a solution a queue implemented in the passive node. The active node >>> will know the last event it published. Passive node would periodically ping >>> the active node and get the last published event and dequeue the buffer >>> accordingly, so that events are not lost, but might be duplicated. (This >>> might be a problem in live state sync) >>> >>> >>> [1] https://github.com/wso2/carbon-coordination >>> >>> >>> Thank You >>> -- >>> *Anoukh Jayawardena* >>> *Software Engineer* >>> >>> *WSO2 Lanka (Private) Limited: http://wso2.com >>> <http://wso2.com/>lean.enterprise.middle-ware* >>> >>> >>> *phone: (+94) 77 99 28932* >>> <https://wso2.com/signature> >>> >> >> >> >> -- >> *Anoukh Jayawardena* >> *Software Engineer* >> >> *WSO2 Lanka (Private) Limited: http://wso2.com >> <http://wso2.com/>lean.enterprise.middle-ware* >> >> >> *phone: (+94) 77 99 28932* >> <https://wso2.com/signature> >> > > > > -- > *Anoukh Jayawardena* > *Software Engineer* > > *WSO2 Lanka (Private) Limited: http://wso2.com > <http://wso2.com/>lean.enterprise.middle-ware* > > > *phone: (+94) 77 99 28932* > <https://wso2.com/signature> > -- *S. Suhothayan* Associate Director / Architect *WSO2 Inc. *http://wso2.com * <http://wso2.com/>* lean . enterprise . middleware *cell: (+94) 779 756 757 | blog: http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter: http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in: http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
