Sorry. I have attached the image with this On Wed, Oct 25, 2017 at 10:33 AM, Sriskandarajah Suhothayan <[email protected]> wrote:
> The image is missing. > > On Sun, Oct 22, 2017 at 3:48 PM, Anoukh Jayawardena <[email protected]> > wrote: > >> Hi All >> >> This is an update on the WSO2 Stream Processor’s 2 Node minimum HA >> deployment. While some of the design elements changed during the >> implementation, this email will cover the overall design >> Two SP nodes work in an active-passive configuration where both nodes >> receive and process the same events. But only the active node will publish >> the events. If the active node goes down, the passive node would change its >> role and start publishing events as the active node. When the node that >> went down is restarted, it would act as the passive node ready to change >> roles when needed. For this to work it should be ensured that both nodes >> receive the same events continuously. >> >> >> >> To ensure "at least once publishing" following techniques were adopted >> >> *1. Double State Syncing of Passive Node* >> >> The base of the HA implementation is that both nodes have the same state >> at a given time. For this, when the Passive node starts up it should sync >> up with the Active node. This syncing is done in two user configurable >> ways. i.e. Live Sync enabled or disabled. >> >> When live sync is enabled and a Siddhi application is deployed in the >> passive node, a REST call is made to the Active node to get the snapshot of >> the deployed Siddhi application. Once the snapshot is received and restored >> on the passive node the Siddhi application will be deployed. If such a >> snapshot is not found, the passive node would defer the deployment of the >> Siddhi application for a user configurable time period after which another >> state sync occurs. If yet a snapshot is not received, the Siddhi >> application will be in an inactive state. When live sync is disabled, the >> snapshot of the Siddhi application will be taken from the Active nodes last >> persisted state which is either from the database or the file system. >> >> After the initial syncing of snapshots, the passive node’s sources may >> not connect on time to process events. This means that passive node is not >> 100% in sync with the active node. Hence a second syncing of states happen >> after a user configured time period from the server start time. The >> snapshot off all Siddhi applications is taken and restored in the passive >> node. Since the size of the state may be large, an event queue is >> implemented in the passive node as a solution for the time taken for the >> snapshot to reach the passive node and restore. During the syncing, passive >> node would queue all events and start processing events from where the >> active node stopped processing to take the snapshot. This guarantees that >> the active and passive node process the same amount of events. >> In transferring snapshots using live sync, the snapshots are compressed >> using gzip to reduce the size and time taken for the snapshot to reach the >> passive node. >> >> >> *2. Periodic output syncing of Passive Node* >> >> Although both active and passive nodes process the same events in a 100% >> synced manner, once the active node goes down, the passive node may take >> some time to identify this and start publishing events. This means some >> events may not be published. As a solution, the passive node would queue >> all the events that are processed (per event sink). Periodically passive >> node would trim this queue according to the last processed event of the >> active node. When passive node identifies that the active node is down, it >> would start publishing the events in the queue first. This guarantees that >> no events are dropped. >> >> Enabling live sync would allow the passive node to directly ping the >> active node periodically to get the timestamp of the last published event >> and use it to trim the queue. When live sync is disabled, the active node >> would periodically save the same information in the database and the >> passive node would periodically read this value. >> >> Find the user story of the implementation at [1] which has details about >> the configuration of the 2 node minimum HA. >> >> [1] https://redmine.wso2.com/issues/6724 >> >> Thanks >> Anoukh >> >> >> On Fri, Aug 18, 2017 at 2:01 PM, Anoukh Jayawardena <[email protected]> >> wrote: >> >>> +architecture >>> >>> >>> On Thu, Aug 17, 2017 at 3:24 PM, Anoukh Jayawardena <[email protected]> >>> wrote: >>> >>>> Hi All, >>>> >>>> This is a high level overview of the 2 node minimum high availability >>>> (HA) deployment feature for the Stream processor (SP). The implementation >>>> would adopt an Active Passive approach with periodic state persistence. The >>>> process flow of how this feature would work is as follows >>>> >>>> *Prerequisites* >>>> >>>> - 2 SP workers, one would be the Active worker while the other >>>> would be the passive. Both nodes should include the same Siddhi >>>> Applications deployed. >>>> - A specified RDBMS or file location for periodic state persistence >>>> of Siddhi App states. >>>> - A running zookeeper service or RDBMS instance for coordination >>>> among the two nodes (Will be using carbon-coordination [1] for the >>>> purpose >>>> of distributed coordination). >>>> - Siddhi Client that would publish events to both Active and >>>> Passive nodes in a synced manner where message is published to passive >>>> node >>>> when the active node acknowledges message received. >>>> >>>> >>>> [image: 2 Node HA Overview.jpg] >>>> >>>> >>>> *Process* >>>> >>>> - Both nodes will receive events and process them. But only the >>>> active node will publish the output events. This ensures that both nodes >>>> are in sync. >>>> - Active node will periodically persist the siddhi app states so >>>> that the state can be retrieved in a failover scenario. >>>> - A user defined “Live State Sync” option would determine how the >>>> states are synced between active and passive in a failover scenario >>>> (Explained below). >>>> - Following is how the implementation works in different system >>>> states >>>> >>>> >>>> 1. When a new node is starting up when Active node is available >>>> - "Live State Sync Enabled" - The new node will detect the >>>> Active node is available, so it will register itself as the passive >>>> node >>>> and call the Active node and borrow the current state. When state >>>> borrowing >>>> happens the processing of events is paused in both nodes so that >>>> data is >>>> not lost. After that both nodes will process events in sync. >>>> - "Live State Sync Disabled" - The new node will detect the >>>> Active node is available, so it will register itself as the passive >>>> node >>>> and access the Database to get the last persisted state. This option >>>> may >>>> lead to few data loss since the Database will not contain a real time >>>> persisted state. >>>> 2. When a new node is starting up when Active node is unavailable >>>> - The new node will detect the Active node is unavailable, so it >>>> will register itself as the Active node and access the Database to >>>> get the >>>> last persisted state. (For a fresh restart an API should be called >>>> before >>>> hand to clean the DB) >>>> 3. When Active node goes down >>>> - The Passive node will detect that Active node is unavailable >>>> and would switch states and start publishing the output events. >>>> >>>> >>>> *Data loss minimizing strategies* >>>> >>>> >>>> 1. When active node goes down, the passive node does not know what >>>> the last event that was published by the active node. Therefore passive >>>> node might start publishing events from a later time. For example >>>> consider >>>> both nodes have processed 5 messages but only 2 messages have been >>>> published by Active node before failing. Passive node will start >>>> publishing >>>> from the 6th message onwards since it does not know what has not been >>>> published. >>>> >>>> As a solution a queue implemented in the passive node. The active node >>>> will know the last event it published. Passive node would periodically ping >>>> the active node and get the last published event and dequeue the buffer >>>> accordingly, so that events are not lost, but might be duplicated. (This >>>> might be a problem in live state sync) >>>> >>>> >>>> [1] https://github.com/wso2/carbon-coordination >>>> >>>> >>>> Thank You >>>> -- >>>> *Anoukh Jayawardena* >>>> *Software Engineer* >>>> >>>> *WSO2 Lanka (Private) Limited: http://wso2.com >>>> <http://wso2.com/>lean.enterprise.middle-ware* >>>> >>>> >>>> *phone: (+94) 77 99 28932* >>>> <https://wso2.com/signature> >>>> >>> >>> >>> >>> -- >>> *Anoukh Jayawardena* >>> *Software Engineer* >>> >>> *WSO2 Lanka (Private) Limited: http://wso2.com >>> <http://wso2.com/>lean.enterprise.middle-ware* >>> >>> >>> *phone: (+94) 77 99 28932* >>> <https://wso2.com/signature> >>> >> >> >> >> -- >> *Anoukh Jayawardena* >> *Software Engineer* >> >> *WSO2 Lanka (Private) Limited: http://wso2.com >> <http://wso2.com/>lean.enterprise.middle-ware* >> >> >> *phone: (+94) 77 99 28932* >> <https://wso2.com/signature> >> > > > > -- > > *S. Suhothayan* > Associate Director / Architect > *WSO2 Inc. *http://wso2.com > * <http://wso2.com/>* > lean . enterprise . middleware > > > *cell: (+94) 779 756 757 <+94%2077%20975%206757> | blog: > http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter: > http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in: > http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>* > -- *Anoukh Jayawardena* *Software Engineer* *WSO2 Lanka (Private) Limited: http://wso2.com <http://wso2.com/>lean.enterprise.middle-ware* *phone: (+94) 77 99 28932* <https://wso2.com/signature>
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
