Such asymmetric splits can be problematic, because the unreachability
information is spread via the gossip. E.g A2 detecting B1 to be unreachable
and that is spread to A3-B3-C3 which will start ignoring gossip messages
from B1 even though they have some connectivity. I'm not sure if that is
what you are seeing.

Look for warning log messages with "Marking node(s) as UNREACHABLE" to see
which node is detecting which other node as unreachable.
Turning on debug log level and perhaps also
akka.cluster.verbose-heartbeat-logging=on might give you more insights into
what is going on.

Regards,
Patrik

On Wed, Mar 8, 2017 at 5:20 PM, 'Francesco laTorre' via Akka User List <
[email protected]> wrote:

> ​Hi everyone,
>
> Straight to the point, the scenario is the following, I've got the
> following cluster :
>
> Host 1) Nodes A1 - B1 - C1
> Host 2) Nodes A2 - B2 - C2
> Host 3) Nodes A3 - B3 - C3
>
> Where :
>
> Nodes AX and BX are singletons
> Nodes CX are worker routers
>
> I'm writing a downing provider to handle the
>
> *Given* a well formed cluster as described :
>
>                +-----------+
>                |           |
>       +--------+  Host-1   +----------+
>       |        |  A1 B1 C1 |          |
>       |        |           |          |
>       |        +-----------+          |
>       |                               |
>       |                               |
>       |                               |
>       |                               |
> +-----+-----+                 +-------+---+
> |           |                 |           |
> |   Host-2  |                 |   Host-3  |
> |  A2 B2 C2 +-----------------+  A3 B3 C3 |
> |           |                 |           |
> +-----------+                 +-----------+
>
> *When* I cut the connection between Host-1 and Host-2
>
>               +------------+
>               |            |
>               |   Host-1   +----------+
>               |  A1-B1-C1  |          |
>               |            |          |
>               +------------+          |
>                                       |
>                                       |
>                                       |
> +-----------+                 +-------+---+
> |           |                 |           |
> |   Host-2  |                 |   Host-3  |
> | A2-B2-C2  +-----------------+ A3-B3-C3  |
> |           |                 |           |
> +-----------+                 +-----------+
>
> *And* wait 30 seconds and also cut the connection between Host-1 and
> Host-3
>
>               +------------+
>               |            |
>               |   Host 1   |
>               |  A1-B1-C1  |
>               |            |
>               +------------+
>
>
>
> +-----------+                 +-----------+
> |           |                 |           |
> |   Host 2  |                 |   Host 3  |
> | A2-B2-C2  +-----------------+ A3-B3-C3  |
> |           |                 |           |
> +-----------+                 +-----------+
>
>
> *Then* whoever is the leader in the cluster {Host-2, Host-3} downs Host-1
> and the singletons are migrated to the active re-sized cluster.
>
> In my first implementation, got an actor on each host that simply
> subscribes to certain clusters events and based on the messaged received
> works out what to do, quite nicely given it still covers some specific
> specs.
>
> *The problem is when I try to implement is a downing provider. *
> My testing implementation looks like the following :
>
> public class TestDP extends DowningProvider {
>
>    private static final Logger logger = LoggerFactory.getLogger(TestDP.class);
>
>    ActorSystem actorSystem;
>
>    public TestDP(ActorSystem actorSystem) {
>       logger.info("Instantiating Test Downing Provider...");
>       this.actorSystem = actorSystem;
>    }
>
>    @Override
>    public FiniteDuration downRemovalMargin() {
>       return new FiniteDuration(
>          
> actorSystem.settings().config().getDuration("akka.cluster.down-removal-margin",
>  TimeUnit.SECONDS),
>          TimeUnit.SECONDS
>       );
>    }
>
>    @Override
>    public Option<Props> downingActorProps() {
>       final CollectorConfig collectorConfig = CollectorConfig.build 
> (actorSystem.settings().config());
>
>       return new Some(
>          ClusterPartitionResolver.props(
>             
> actorSystem.settings().config().getInt("akka.cluster.minimumNumberOfNodesForCollectorToOperate"),
>             collectorConfig.getCollectorSingleton().getRole(),
>             actorSystem.settings().config().getInt     
> ("akka.cluster.number-of-nodes-per-host"),
>             
> actorSystem.settings().config().getDuration("akka.cluster.max-asymmetric-network-failure-tolerance",
>  TimeUnit.SECONDS),
>             
> actorSystem.settings().config().getDuration("akka.cluster.down-unreachable-after",
>                    TimeUnit.SECONDS),
>             
> actorSystem.settings().config().getDuration("akka.cluster.down-removal-margin",
>                       TimeUnit.SECONDS)
>          )
>       );
>    }
> }
>
>
>
> and the configuration :
>
> cluster {
>         downing-provider-class                   = "com.TestDP"
>         max-asymmetric-network-failure-tolerance = 30s
>         down-removal-margin                      = 70s
>         number-of-nodes-per-host                 = 3
>         down-unreachable-after                   = 120s
>         seed-nodes = [
>             something like : { A{1..3}, B{1..3}, C{1..3} }
>         ]
>     }
>
> The problem : when usign the downing provider even after completely
> disconnectiong Host-1
>
>               +------------+
>               |            |
>               |   Host 1   |
>               |  A1-B1-C1  |
>               |            |
>               +------------+
>
>
>
> +-----------+                 +-----------+
> |           |                 |           |
> |   Host 2  |                 |   Host 3  |
> | A2-B2-C2  +-----------------+ A3-B3-C3  |
> |           |                 |           |
> +-----------+                 +-----------+
>
> Nodes on the cluster underneath don't detect C1 as Unreachable (or at
> least messages are not propagated correctly) , whilst A1 and B1 are
> correctly marked Unreachable.
> This is something the only happens when ClusterPartitionResolver is
> created by the downing provider, if I leave it a standalone actor then all
> the messages are created and the gossip seems to be working fine. Checked
> via JMX and got same output.
>
> Is anybody aware of any reason why the unreachable messages are not
> propagated ?
>
> Cheers,
> Francesco​
>
> --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/
> current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>



-- 

Patrik Nordwall
Akka Tech Lead
Lightbend <http://www.lightbend.com/> -  Reactive apps on the JVM
Twitter: @patriknw

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to