Hi Patrik,
I think indeed that is what we're seeing. It happened in production and
we can easily reproduce it by bringing down network connections one at a
time.
However, I wonder if this is something that could be solved in akka
itself, as opposed to trying to implement a custom downing provider.
I'm thinking along these lines:
If node B is reported unreachable by node A, but node A itself is
unreachable by node C and node B is reachable by node C, it is pretty
clear that it's node A who's isolated from both B and C.
Perhaps a config can be added, something like
"prune-unreachable-watchers-after" which would
a) remove from the cluster state any unreachability information from
watchers that are themselves marked as unreachable.
Or perhaps more strictly:
b) remove from the cluster state any unreachability information from
watchers that are themselves marked as unreachable iff the node they
report as unreachable is reachable by it's other reachable watchers.
Possibly a new watcher should be assigned after a watcher is pruned.
All comments welcome.
Happy to work on a PR if this approach is deemed valid.
Thanks,
Michał
On 10/03/17 09:32, Patrik Nordwall wrote:
Such asymmetric splits can be problematic, because the unreachability
information is spread via the gossip. E.g A2 detecting B1 to be
unreachable and that is spread to A3-B3-C3 which will start ignoring
gossip messages from B1 even though they have some connectivity. I'm
not sure if that is what you are seeing.
Look for warning log messages with "Marking node(s) as UNREACHABLE" to
see which node is detecting which other node as unreachable.
Turning on debug log level and perhaps also
akka.cluster.verbose-heartbeat-logging=on might give you more insights
into what is going on.
Regards,
Patrik
On Wed, Mar 8, 2017 at 5:20 PM, 'Francesco laTorre' via Akka User List
<[email protected] <mailto:[email protected]>> wrote:
Hi everyone,
Straight to the point, the scenario is the following, I've got the
following cluster :
Host 1) Nodes A1 - B1 - C1
Host 2) Nodes A2 - B2 - C2
Host 3) Nodes A3 - B3 - C3
Where :
Nodes AX and BX are singletons
Nodes CX are worker routers
I'm writing a downing provider to handle the
/*Given*/ a well formed cluster as described :
+-----------+
| |
+--------+ Host-1 +----------+
| | A1 B1 C1 | |
| | | |
| +-----------+ |
| |
| |
| |
| |
+-----+-----+ +-------+---+
| | | |
| Host-2 | | Host-3 |
| A2 B2 C2 +-----------------+ A3 B3 C3 |
| | | |
+-----------+ +-----------+
/*When*/ I cut the connection between Host-1 and Host-2
+------------+
| |
| Host-1 +----------+
| A1-B1-C1 | |
| | |
+------------+ |
|
|
|
+-----------+ +-------+---+
| | | |
| Host-2 | | Host-3 |
| A2-B2-C2 +-----------------+ A3-B3-C3 |
| | | |
+-----------+ +-----------+
/*And*/ wait 30 seconds and also cut the connection between Host-1
and Host-3
+------------+
| |
| Host 1 |
| A1-B1-C1 |
| |
+------------+
+-----------+ +-----------+
| | | |
| Host 2 | | Host 3 |
| A2-B2-C2 +-----------------+ A3-B3-C3 |
| | | |
+-----------+ +-----------+
/*Then*/ whoever is the leader in the cluster {Host-2, Host-3}
downs Host-1 and the singletons are migrated to the active
re-sized cluster.
In my first implementation, got an actor on each host that simply
subscribes to certain clusters events and based on the messaged
received works out what to do, quite nicely given it still covers
some specific specs.
_The problem is when I try to implement is a downing provider. _
My testing implementation looks like the following :
public class TestDP extends DowningProvider { private static final
Logger logger = LoggerFactory.getLogger(TestDP.class); ActorSystem
actorSystem; public TestDP(ActorSystem actorSystem) {
logger.info("Instantiating Test Downing Provider...");
this.actorSystem = actorSystem; } @Override public FiniteDuration
downRemovalMargin() { return new FiniteDuration(
actorSystem.settings().config().getDuration("akka.cluster.do
<http://akka.cluster.do>wn-removal-margin", TimeUnit.SECONDS),
TimeUnit.SECONDS ); } @Override public Option<Props>
downingActorProps() { final CollectorConfig collectorConfig =
CollectorConfig.build (actorSystem.settings().config()); return
new Some( ClusterPartitionResolver.props(
actorSystem.settings().config().getInt("akka.cluster.minimumNumberOfNodesForCollectorToOperate"),
collectorConfig.getCollectorSingleton().getRole(),
actorSystem.settings().config().getInt
("akka.cluster.number-of-nodes-per-host"),
actorSystem.settings().config().getDuration("akka.cluster.ma
<http://akka.cluster.ma>x-asymmetric-network-failure-tolerance",
TimeUnit.SECONDS),
actorSystem.settings().config().getDuration("akka.cluster.do
<http://akka.cluster.do>wn-unreachable-after", TimeUnit.SECONDS),
actorSystem.settings().config().getDuration("akka.cluster.do
<http://akka.cluster.do>wn-removal-margin", TimeUnit.SECONDS) ) ); } }
and the configuration :
cluster {
downing-provider-class = "com.TestDP"
max-asymmetric-network-failure-tolerance = 30s
down-removal-margin = 70s
number-of-nodes-per-host = 3
down-unreachable-after = 120s
seed-nodes = [
something like : { A{1..3}, B{1..3}, C{1..3} }
]
}
The problem : when usign the downing provider even after
completely disconnectiong Host-1
+------------+
| |
| Host 1 |
| A1-B1-C1 |
| |
+------------+
+-----------+ +-----------+
| | | |
| Host 2 | | Host 3 |
| A2-B2-C2 +-----------------+ A3-B3-C3 |
| | | |
+-----------+ +-----------+
Nodes on the cluster underneath don't detect C1 as Unreachable (or
at least messages are not propagated correctly) , whilst A1 and B1
are correctly marked Unreachable.
This is something the only happens when ClusterPartitionResolver
is created by the downing provider, if I leave it a standalone
actor then all the messages are created and the gossip seems to be
working fine. Checked via JMX and got same output.
Is anybody aware of any reason why the unreachable messages are
not propagated ?
Cheers,
Francesco
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ:
http://doc.akka.io/docs/akka/current/additional/faq.html
<http://doc.akka.io/docs/akka/current/additional/faq.html>
>>>>>>>>>> Search the archives:
https://groups.google.com/group/akka-user
<https://groups.google.com/group/akka-user>
---
You received this message because you are subscribed to the Google
Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to [email protected]
<mailto:[email protected]>.
To post to this group, send email to [email protected]
<mailto:[email protected]>.
Visit this group at https://groups.google.com/group/akka-user
<https://groups.google.com/group/akka-user>.
For more options, visit https://groups.google.com/d/optout
<https://groups.google.com/d/optout>.
--
Patrik Nordwall
Akka Tech Lead
Lightbend <http://www.lightbend.com/> - Reactive apps on the JVM
Twitter: @patriknw
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ:
http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google
Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected]
<mailto:[email protected]>.
To post to this group, send email to [email protected]
<mailto:[email protected]>.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
--
Signature
<http://www.openbet.com/> Michal Borowiecki
Senior Software Engineer L4
T: +44 208 742 1600
+44 203 249 8448
E: [email protected]
W: www.openbet.com <http://www.openbet.com/>
OpenBet Ltd
Chiswick Park Building 9
566 Chiswick High Rd
London
W4 5XT
UK
<https://www.openbet.com/email_promo>
This message is confidential and intended only for the addressee. If you
have received this message in error, please immediately notify the
[email protected] <mailto:[email protected]> and delete it
from your system as well as any copies. The content of e-mails as well
as traffic data may be monitored by OpenBet for employment and security
purposes. To protect the environment please do not print this e-mail
unless necessary. OpenBet Ltd. Registered Office: Chiswick Park Building
9, 566 Chiswick High Road, London, W4 5XT, United Kingdom. A company
registered in England and Wales. Registered no. 3134634. VAT no.
GB927523612
--
Read the docs: http://akka.io/docs/
Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.