Re: [Architecture] 2 Node Minimum HA Pattern for Stream Processor

Sriskandarajah Suhothayan Tue, 24 Oct 2017 22:05:21 -0700

The image is missing.

On Sun, Oct 22, 2017 at 3:48 PM, Anoukh Jayawardena <[email protected]> wrote:


> Hi All
>
> This is an update on the WSO2 Stream Processor’s 2 Node minimum HA
> deployment. While some of the design elements changed during the
> implementation, this email will cover the overall design
> Two SP nodes work in an active-passive configuration where both nodes
> receive and process the same events. But only the active node will publish
> the events. If the active node goes down, the passive node would change its
> role and start publishing events as the active node. When the node that
> went down is restarted, it would act as the passive node ready to change
> roles when needed. For this to work it should be ensured that both nodes
> receive the same events continuously.
>
>
> 
> To ensure "at least once publishing" following techniques were adopted
>
> *1. Double State Syncing of Passive Node*
>
> The base of the HA implementation is that both nodes have the same state
> at a given time. For this, when the Passive node starts up it should sync
> up with the Active node. This syncing is done in two user configurable
> ways. i.e. Live Sync enabled or disabled.
>
> When live sync is enabled and a Siddhi application is deployed in the
> passive node, a REST call is made to the Active node to get the snapshot of
> the deployed Siddhi application. Once the snapshot is received and restored
> on the passive node the Siddhi application will be deployed. If such a
> snapshot is not found, the passive node would defer the deployment of the
> Siddhi application for a user configurable time period after which another
> state sync occurs. If yet a snapshot is not received, the Siddhi
> application will be in an inactive state. When live sync is disabled, the
> snapshot of the Siddhi application will be taken from the Active nodes last
> persisted state which is either from the database or the file system.
>
> After the initial syncing of snapshots, the passive node’s sources may not
> connect on time to process events. This means that passive node is not 100%
> in sync with the active node. Hence a second syncing of states happen after
> a user configured time period from the server start time. The snapshot off
> all Siddhi applications is taken and restored in the passive node. Since
> the size of the state may be large, an event queue is implemented in the
> passive node as a solution for the time taken for the snapshot to reach the
> passive node and restore. During the syncing, passive node would queue all
> events and start processing events from where the active node stopped
> processing to take the snapshot. This guarantees that the active and
> passive node process the same amount of events.
> In transferring snapshots using live sync, the snapshots are compressed
> using gzip to reduce the size and time taken for the snapshot to reach the
> passive node.
>
>
> *2. Periodic output syncing of Passive Node*
>
> Although both active and passive nodes process the same events in a 100%
> synced manner, once the active node goes down, the passive node may take
> some time to identify this and start publishing events. This means some
> events may not be published. As a solution, the passive node would queue
> all the events that are processed (per event sink). Periodically passive
> node would trim this queue according to the last processed event of the
> active node. When passive node identifies that the active node is down, it
> would start publishing the events in the queue first. This guarantees that
> no events are dropped.
>
> Enabling live sync would allow the passive node to directly ping the
> active node periodically to get the timestamp of the last published event
> and use it to trim the queue. When live sync is disabled, the active node
> would periodically save the same information in the database and the
> passive node would periodically read this value.
>
> Find the user story of the implementation at [1] which has details about
> the configuration of the 2 node minimum HA.
>
> [1] https://redmine.wso2.com/issues/6724
>
> Thanks
> Anoukh
>
>
> On Fri, Aug 18, 2017 at 2:01 PM, Anoukh Jayawardena <[email protected]>
> wrote:
>
>> +architecture
>>
>>
>> On Thu, Aug 17, 2017 at 3:24 PM, Anoukh Jayawardena <[email protected]>
>> wrote:
>>
>>> Hi All,
>>>
>>> This is a high level overview of the 2 node minimum high availability
>>> (HA) deployment feature for the Stream processor (SP). The implementation
>>> would adopt an Active Passive approach with periodic state persistence. The
>>> process flow of how this feature would work is as follows
>>>
>>> *Prerequisites*
>>>
>>>    - 2 SP workers, one would be the Active worker while the other would
>>>    be the passive. Both nodes should include the same Siddhi Applications
>>>    deployed.
>>>    - A specified RDBMS or file location for periodic state persistence
>>>    of Siddhi App states.
>>>    - A running zookeeper service or RDBMS instance for coordination
>>>    among the two nodes (Will be using carbon-coordination [1] for the 
>>> purpose
>>>    of distributed coordination).
>>>    - Siddhi Client that would publish events to both Active and Passive
>>>    nodes in a synced manner where message is published to passive node when
>>>    the active node acknowledges message received.
>>>
>>>
>>> [image: 2 Node HA Overview.jpg]
>>>
>>>
>>> *Process*
>>>
>>>    - Both nodes will receive events and process them. But only the
>>>    active node will publish the output events. This ensures that both nodes
>>>    are in sync.
>>>    - Active node will periodically persist the siddhi app states so
>>>    that the state can be retrieved in a failover scenario.
>>>    - A user defined “Live State Sync” option would determine how the
>>>    states are synced between active and passive in a failover scenario
>>>    (Explained below).
>>>    - Following is how the implementation works in different system
>>>    states
>>>
>>>
>>>    1. When a new node is starting up when Active node is available
>>>       - "Live State Sync Enabled" - The new node will detect the Active
>>>       node is available, so it will register itself as the passive node and 
>>> call
>>>       the Active node and borrow the current state. When state borrowing 
>>> happens
>>>       the processing of events is paused in both nodes so that data is not 
>>> lost.
>>>       After that both nodes will process events in sync.
>>>       - "Live State Sync Disabled" - The new node will detect the
>>>       Active node is available, so it will register itself as the passive 
>>> node
>>>       and access the Database to get the last persisted state. This option 
>>> may
>>>       lead to few data loss since the Database will not contain a real time
>>>       persisted state.
>>>    2. When a new node is starting up when Active node is unavailable
>>>       - The new node will detect the Active node is unavailable, so it
>>>       will register itself as the Active node and access the Database to 
>>> get the
>>>       last persisted state. (For a fresh restart an API should be called 
>>> before
>>>       hand to clean the DB)
>>>       3. When Active node goes down
>>>       - The Passive node will detect that Active node is unavailable
>>>       and would switch states and start publishing the output events.
>>>
>>>
>>> *Data loss minimizing strategies*
>>>
>>>
>>>    1. When active node goes down, the passive node does not know what
>>>    the last event that was published by the active node. Therefore passive
>>>    node might start publishing events from a later time. For example 
>>> consider
>>>    both nodes have processed 5 messages but only 2 messages have been
>>>    published by Active node before failing. Passive node will start 
>>> publishing
>>>    from the 6th message onwards since it does not know what has not been
>>>    published.
>>>
>>> As a solution a queue implemented in the passive node. The active node
>>> will know the last event it published. Passive node would periodically ping
>>> the active node and get the last published event and dequeue the buffer
>>> accordingly, so that events are not lost, but might be duplicated. (This
>>> might be a problem in live state sync)
>>>
>>>
>>> [1] https://github.com/wso2/carbon-coordination
>>>
>>>
>>> Thank You
>>> --
>>> *Anoukh Jayawardena*
>>> *Software Engineer*
>>>
>>> *WSO2 Lanka (Private) Limited: http://wso2.com
>>> <http://wso2.com/>lean.enterprise.middle-ware*
>>>
>>>
>>> *phone: (+94) 77 99 28932*
>>> <https://wso2.com/signature>
>>>
>>
>>
>>
>> --
>> *Anoukh Jayawardena*
>> *Software Engineer*
>>
>> *WSO2 Lanka (Private) Limited: http://wso2.com
>> <http://wso2.com/>lean.enterprise.middle-ware*
>>
>>
>> *phone: (+94) 77 99 28932*
>> <https://wso2.com/signature>
>>
>
>
>
> --
> *Anoukh Jayawardena*
> *Software Engineer*
>
> *WSO2 Lanka (Private) Limited: http://wso2.com
> <http://wso2.com/>lean.enterprise.middle-ware*
>
>
> *phone: (+94) 77 99 28932*
> <https://wso2.com/signature>
>



-- 

*S. Suhothayan*
Associate Director / Architect
*WSO2 Inc. *http://wso2.com
* <http://wso2.com/>*
lean . enterprise . middleware


*cell: (+94) 779 756 757 | blog: http://suhothayan.blogspot.com/
<http://suhothayan.blogspot.com/>twitter: http://twitter.com/suhothayan
<http://twitter.com/suhothayan> | linked-in:
http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] 2 Node Minimum HA Pattern for Stream Processor

Reply via email to