[ https://issues.apache.org/jira/browse/SLING-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696876#comment-13696876 ]
Stefan Egli commented on SLING-2939: ------------------------------------ [~ianeboston], [~rombert]: regarding JGroups: I think JGroups is quite a good fit, except for two aspects: * large installations typically would be point-to-point rather than udp (the 536 machine cluster for example used udp-multicast). I believe that we would like to support Sling deployments across data-centers and use discovery between those data-centers for certain admin operations. My concern here is how feasible is udp cross data-centers. * I think the decision can be broken down to two deployment models: embedded or dedicated servers. With embedding you have the advantage of no additional services required but ideally use multicast (thus running into above concern). With a dedicated service there is the downside of such an additional component, but the scalability of the point-to-point setup, also cross data-center, seems better. (Scalability not in terms of pure performance - there multicast is best - but in terms of ease of configuration/setup). > 3rd-party based implementation of discovery.api > ----------------------------------------------- > > Key: SLING-2939 > URL: https://issues.apache.org/jira/browse/SLING-2939 > Project: Sling > Issue Type: Task > Components: Extensions > Affects Versions: Discovery API 1.0.0 > Reporter: Stefan Egli > Assignee: Stefan Egli > > The Sling Discovery API introduces the abstraction of a topology which > contains (Sling) clusters and instances, supports liveliness-detection, > leader-election within a cluster and property-propagation between the > instances. As a default and reference implementation a resource-based, OOTB > implementation was created (org.apache.sling.discovery.impl). > Pros and cons of the discovery.impl > Although the discovery.impl supports everything required in discovery.api, it > has a few limitations. Here's a list of pros and cons: > Pros > No additional software required (leverages repository for intra-cluster > communication/storage and HTTP-REST calls for cross-cluster communication) > Very small footprint > Perfectly suited for a single clusters, instance and for small, rather > stable hub-based topologies > Cons > Config-/deployment-limitations (aka embedded-limitation): connections > between clusters are peer-to-peer and explicit. To span a topology, a number > of instances must (be made) know (to) each other, changes in the topology > typically requires config adjustments to guarantee high availability of the > discovery service > Except if a natural "hub cluster" exists that can serve as connection > point for all "satellite clusters" > Other than that, it is less suited for large and/or dynamic topologies > Change propagation (for topology parts reported via connectors) is > non-atomic and slow, hop-by-hop based > No guarantee on order of TopologyEvents sent in individual instances - ie > different instances might see different orders of TopologyEvents (ie changes > in the topology) but eventually the topology is guaranteed to be consistent > Robustness of discovery.impl wrt storm situations depends on robustness > of underlying cluster (not a real negative but discovery.impl might in theory > unveil repository bugs which would otherwise not have been a problem) > Rather new, little tested code which might have issues with edge cases > wrt network problems > although partitioning-support is not a requirement per se, similar > edge-cases might exist wrt network-delays/timing/crashes > Reusing a suitable 3rd party library > To provide an additional option as implementation of the discovery.api one > idea is to use a suitable 3rd party library. > Requirements > The following is a list of requirements a 3rd party library must support: > liveliness detection: detect whether an instance is up and running > stable leader election within a cluster: stable describes the fact that a > leader will remain leader until it leaves/crashes and no new, joining > instance shall take over while a leader exists > stable instance ordering: the list of instances within a cluster is > ordered and stable, new, joining instances are put at the end of the list > property propagation: propagate the properties provided within one > instance to everybody in the topology. there are no timing requirements bound > to this but the intention of this is not to be used as messaging but to > announce config parameters to the topology > support large, dynamic clusters: configuration of the new discovery > implementation should be easy and support frequent changes in the (large) > topology > no single point of failure: this is obvious, there should of course be no > single point of failure in the setup > embedded or dedicated: this might be a hot topic: embedding a library has > the advantages of not having to install anything additional. a dedicated > service on the other hand requires additional handling in deployment. > embedding implies a peer-to-peer setup: nodes communicate peer-to-peer rather > than via a centralized service. this IMHO is a negative for large topologies > which would typically be cross data-centers. hence a dedicated service could > be seen as an advantage in the end. > due to need for cross data-center deployments, the transport protocol > must be TCP (or HTTP for that matter) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira