[ 
https://issues.apache.org/jira/browse/SLING-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696829#comment-13696829
 ] 

Stefan Egli commented on SLING-2939:
------------------------------------

Earlier feedback from a private discussion:
--
Feedback from [~fmeschbe]: I am all for having a scalable implementation 
out-of-the-box. And, honestly, if the repository is not able to be the basis 
for such an implement, so be it. So, yes, improve the implementation, use 
proven technology – refactoring the existing implementation if that helps.
--
Feedback from [~ianeboston]: 

JGroups is embedded by default and adds cluster awareness to whatever its 
embedded into without the need to additional co-ordinating servers. By 
comparison to ZooKeeper, that normally expects dedicated servers, JGroups is 
very low impact. JGroups also supports site, rack, machine and is used to 
provide topology information in a number of apps (Infinispan1, JBoss 
Application Server2), without (IIUC) the need to deploy redundant replicated 
servers. IIUC Inifinspan DataGrid scales reasonably well over multiple racks, 
sites. The only downside of the JBoss originated components is you have to read 
the license carefully as the higher level components are often a variant of GPL 
licensed. JGroups is LGPL 2.1 Licensed, which may be an issue for Adobe.

On the other-hand, Zookeeper and the layers that sit ontop of it provide a 
richer pre-built cluster and topology management environment, at the expense of 
deployment complexity and dedicated server resources. I don't believe using 
Zookeeper as an embedded component is viable at scale. Zookeeper is Apache 
licensed, so no issue.

Perhaps seeing how these two subsystems behave in reality on a 100 node cluster 
of tiny EC2 instances would be a good way of gathering evidence to base a 
decision on ?

It might also be worth having a quick look to see how ElasticSearch manages 
topology. It is site,rack,machine aware and also AWS/EC2 aware. The topology 
management is embedded and ES has been used in large clusters, eg 3

1 
https://docs.jboss.org/author/display/ISPN/Getting+Started+Guide+-+Clustered+Cache+in+Java+SE?_sscc=t
2 https://issues.jboss.org/browse/AS7-3023
3 http://architects.dzone.com/articles/our-experience-creating-large
--
Feedback from [~rombert]: 
+1 for JGroups. I've worked with it previously and it's small, embeddable and 
does the job. My cluster was about 20 machines, but reportedly the primary 
author has sighted a JGroups cluster of 536 machines .

As for the licensing, JGroups is investigating moving to APL 2.0, but that move 
was not yet finalized.

1 http://belaban.blogspot.ro/2011/04/largest-jgroups-cluster-ever-536-nodes.html
2 http://belaban.blogspot.ro/2013/05/jgroups-to-investigate-adopting-apache.html
                
> 3rd-party based implementation of discovery.api
> -----------------------------------------------
>
>                 Key: SLING-2939
>                 URL: https://issues.apache.org/jira/browse/SLING-2939
>             Project: Sling
>          Issue Type: Task
>          Components: Extensions
>    Affects Versions: Discovery API 1.0.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>
> The Sling Discovery API introduces the abstraction of a topology which 
> contains (Sling) clusters and instances, supports liveliness-detection, 
> leader-election within a cluster and property-propagation between the 
> instances. As a default and reference implementation a resource-based, OOTB 
> implementation was created (org.apache.sling.discovery.impl).
> Pros and cons of the discovery.impl
> Although the discovery.impl supports everything required in discovery.api, it 
> has a few limitations. Here's a list of pros and cons:
> Pros
>     No additional software required (leverages repository for intra-cluster 
> communication/storage and HTTP-REST calls for cross-cluster communication)
>     Very small footprint
>     Perfectly suited for a single clusters, instance and for small, rather 
> stable hub-based topologies
> Cons
>     Config-/deployment-limitations (aka embedded-limitation): connections 
> between clusters are peer-to-peer and explicit. To span a topology, a number 
> of instances must (be made) know (to) each other, changes in the topology 
> typically requires config adjustments to guarantee high availability of the 
> discovery service
>         Except if a natural "hub cluster" exists that can serve as connection 
> point for all "satellite clusters"
>         Other than that, it is less suited for large and/or dynamic topologies
>     Change propagation (for topology parts reported via connectors) is 
> non-atomic and slow, hop-by-hop based
>     No guarantee on order of TopologyEvents sent in individual instances - ie 
> different instances might see different orders of TopologyEvents (ie changes 
> in the topology) but eventually the topology is guaranteed to be consistent
>     Robustness of discovery.impl wrt storm situations depends on robustness 
> of underlying cluster (not a real negative but discovery.impl might in theory 
> unveil repository bugs which would otherwise not have been a problem)
>     Rather new, little tested code which might have issues with edge cases 
> wrt network problems
>         although partitioning-support is not a requirement per se, similar 
> edge-cases might exist wrt network-delays/timing/crashes
> Reusing a suitable 3rd party library
> To provide an additional option as implementation of the discovery.api one 
> idea is to use a suitable 3rd party library.
> Requirements
> The following is a list of requirements a 3rd party library must support:
>     liveliness detection: detect whether an instance is up and running
>     stable leader election within a cluster: stable describes the fact that a 
> leader will remain leader until it leaves/crashes and no new, joining 
> instance shall take over while a leader exists
>     stable instance ordering: the list of instances within a cluster is 
> ordered and stable, new, joining instances are put at the end of the list
>     property propagation: propagate the properties provided within one 
> instance to everybody in the topology. there are no timing requirements bound 
> to this but the intention of this is not to be used as messaging but to 
> announce config parameters to the topology
>     support large, dynamic clusters: configuration of the new discovery 
> implementation should be easy and support frequent changes in the (large) 
> topology
>     no single point of failure: this is obvious, there should of course be no 
> single point of failure in the setup
>     embedded or dedicated: this might be a hot topic: embedding a library has 
> the advantages of not having to install anything additional. a dedicated 
> service on the other hand requires additional handling in deployment. 
> embedding implies a peer-to-peer setup: nodes communicate peer-to-peer rather 
> than via a centralized service. this IMHO is a negative for large topologies 
> which would typically be cross data-centers. hence a dedicated service could 
> be seen as an advantage in the end.
>     due to need for cross data-center deployments, the transport protocol 
> must be TCP (or HTTP for that matter)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to