Re: [akka-user] Downing Provider - Missing Unreachable messages

Patrik Nordwall Tue, 30 May 2017 05:31:57 -0700

I think it would be better to expose more details from the Reachability
state and then that information can be interpreted in various smart ways
depending on what is needed.


/Patrik

On Sat, May 27, 2017 at 10:12 PM, Michal Borowiecki <
[email protected]> wrote:

> Hi Patrik,
>
> I think indeed that is what we're seeing. It happened in production and we
> can easily reproduce it by bringing down network connections one at a time.
>
> However, I wonder if this is something that could be solved in akka
> itself, as opposed to trying to implement a custom downing provider.
>
> I'm thinking along these lines:
>
> If node B is reported unreachable by node A, but node A itself is
> unreachable by node C and node B is reachable by node C, it is pretty clear
> that it's node A who's isolated from both B and C.
>
> Perhaps a config can be added, something like 
> "prune-unreachable-watchers-after"
> which would
>
> a) remove from the cluster state any unreachability information from
> watchers that are themselves marked as unreachable.
>
> Or perhaps more strictly:
>
> b) remove from the cluster state any unreachability information from
> watchers that are themselves marked as unreachable iff the node they report
> as unreachable is reachable by it's other reachable watchers.
>
> Possibly a new watcher should be assigned after a watcher is pruned.
> All comments welcome.
>
> Happy to work on a PR if this approach is deemed valid.
>
> Thanks,
>
> Michał
>
> On 10/03/17 09:32, Patrik Nordwall wrote:
>
> Such asymmetric splits can be problematic, because the unreachability
> information is spread via the gossip. E.g A2 detecting B1 to be unreachable
> and that is spread to A3-B3-C3 which will start ignoring gossip messages
> from B1 even though they have some connectivity. I'm not sure if that is
> what you are seeing.
>
> Look for warning log messages with "Marking node(s) as UNREACHABLE" to see
> which node is detecting which other node as unreachable.
> Turning on debug log level and perhaps also 
> akka.cluster.verbose-heartbeat-logging=on
> might give you more insights into what is going on.
>
> Regards,
> Patrik
>
> On Wed, Mar 8, 2017 at 5:20 PM, 'Francesco laTorre' via Akka User List <
> [email protected]> wrote:
>
>> Hi everyone,
>>
>> Straight to the point, the scenario is the following, I've got the
>> following cluster :
>>
>> Host 1) Nodes A1 - B1 - C1
>> Host 2) Nodes A2 - B2 - C2
>> Host 3) Nodes A3 - B3 - C3
>>
>> Where :
>>
>> Nodes AX and BX are singletons
>> Nodes CX are worker routers
>>
>> I'm writing a downing provider to handle the
>>
>> *Given* a well formed cluster as described :
>>
>>                +-----------+
>>                |           |
>>       +--------+  Host-1   +----------+
>>       |        |  A1 B1 C1 |          |
>>       |        |           |          |
>>       |        +-----------+          |
>>       |                               |
>>       |                               |
>>       |                               |
>>       |                               |
>> +-----+-----+                 +-------+---+
>> |           |                 |           |
>> |   Host-2  |                 |   Host-3  |
>> |  A2 B2 C2 +-----------------+  A3 B3 C3 |
>> |           |                 |           |
>> +-----------+                 +-----------+
>>
>> *When* I cut the connection between Host-1 and Host-2
>>
>>               +------------+
>>               |            |
>>               |   Host-1   +----------+
>>               |  A1-B1-C1  |          |
>>               |            |          |
>>               +------------+          |
>>                                       |
>>                                       |
>>                                       |
>> +-----------+                 +-------+---+
>> |           |                 |           |
>> |   Host-2  |                 |   Host-3  |
>> | A2-B2-C2  +-----------------+ A3-B3-C3  |
>> |           |                 |           |
>> +-----------+                 +-----------+
>>
>> *And* wait 30 seconds and also cut the connection between Host-1 and
>> Host-3
>>
>>               +------------+
>>               |            |
>>               |   Host 1   |
>>               |  A1-B1-C1  |
>>               |            |
>>               +------------+
>>
>>
>>
>> +-----------+                 +-----------+
>> |           |                 |           |
>> |   Host 2  |                 |   Host 3  |
>> | A2-B2-C2  +-----------------+ A3-B3-C3  |
>> |           |                 |           |
>> +-----------+                 +-----------+
>>
>>
>> *Then* whoever is the leader in the cluster {Host-2, Host-3} downs
>> Host-1 and the singletons are migrated to the active re-sized cluster.
>>
>> In my first implementation, got an actor on each host that simply
>> subscribes to certain clusters events and based on the messaged received
>> works out what to do, quite nicely given it still covers some specific
>> specs.
>>
>> *The problem is when I try to implement is a downing provider. *
>> My testing implementation looks like the following :
>>
>> public class TestDP extends DowningProvider {
>>
>>    private static final Logger logger = 
>> LoggerFactory.getLogger(TestDP.class);
>>
>>    ActorSystem actorSystem;
>>
>>    public TestDP(ActorSystem actorSystem) {
>>       logger.info("Instantiating Test Downing Provider...");
>>       this.actorSystem = actorSystem;
>>    }
>>
>>    @Override   public FiniteDuration downRemovalMargin() {
>>       return new FiniteDuration(
>>          
>> actorSystem.settings().config().getDuration("akka.cluster.down-removal-margin",
>>  TimeUnit.SECONDS),
>>          TimeUnit.SECONDS      );
>>    }
>>
>>    @Override   public Option<Props> downingActorProps() {
>>       final CollectorConfig collectorConfig = CollectorConfig.build 
>> (actorSystem.settings().config());
>>
>>       return new Some(
>>          ClusterPartitionResolver.props(
>>             
>> actorSystem.settings().config().getInt("akka.cluster.minimumNumberOfNodesForCollectorToOperate"),
>>             collectorConfig.getCollectorSingleton().getRole(),
>>             actorSystem.settings().config().getInt     
>> ("akka.cluster.number-of-nodes-per-host"),
>>             
>> actorSystem.settings().config().getDuration("akka.cluster.max-asymmetric-network-failure-tolerance",
>>  TimeUnit.SECONDS),
>>             
>> actorSystem.settings().config().getDuration("akka.cluster.down-unreachable-after",
>>                    TimeUnit.SECONDS),
>>             
>> actorSystem.settings().config().getDuration("akka.cluster.down-removal-margin",
>>                       TimeUnit.SECONDS)
>>          )
>>       );
>>    }
>> }
>>
>>
>>
>> and the configuration :
>>
>> cluster {
>>         downing-provider-class                   = "com.TestDP"
>>         max-asymmetric-network-failure-tolerance = 30s
>>         down-removal-margin                      = 70s
>>         number-of-nodes-per-host                 = 3
>>         down-unreachable-after                   = 120s
>>         seed-nodes = [
>>             something like : { A{1..3}, B{1..3}, C{1..3} }
>>         ]
>>     }
>>
>> The problem : when usign the downing provider even after completely
>> disconnectiong Host-1
>>
>>               +------------+
>>               |            |
>>               |   Host 1   |
>>               |  A1-B1-C1  |
>>               |            |
>>               +------------+
>>
>>
>>
>> +-----------+                 +-----------+
>> |           |                 |           |
>> |   Host 2  |                 |   Host 3  |
>> | A2-B2-C2  +-----------------+ A3-B3-C3  |
>> |           |                 |           |
>> +-----------+                 +-----------+
>>
>> Nodes on the cluster underneath don't detect C1 as Unreachable (or at
>> least messages are not propagated correctly) , whilst A1 and B1 are
>> correctly marked Unreachable.
>> This is something the only happens when ClusterPartitionResolver is
>> created by the downing provider, if I leave it a standalone actor then all
>> the messages are created and the gossip seems to be working fine. Checked
>> via JMX and got same output.
>>
>> Is anybody aware of any reason why the unreachable messages are not
>> propagated ?
>>
>> Cheers,
>> Francesco
>> --
>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/c
>> urrent/additional/faq.html
>> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Akka User List" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/akka-user.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
>
> Patrik Nordwall
> Akka Tech Lead
> Lightbend <http://www.lightbend.com/> -  Reactive apps on the JVM
> Twitter: @patriknw
> --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/
> current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> <http://www.openbet.com/> Michal Borowiecki
> Senior Software Engineer L4
> T: +44 208 742 1600 <+44%2020%208742%201600>
>
>
> +44 203 249 8448 <+44%2020%203249%208448>
>
>
>
> E: [email protected]
> W: www.openbet.com
> OpenBet Ltd
>
> Chiswick Park Building 9
>
> 566 Chiswick High Rd
>
> London
>
> W4 5XT
>
> UK
> <https://www.openbet.com/email_promo>
> This message is confidential and intended only for the addressee. If you
> have received this message in error, please immediately notify the
> [email protected] and delete it from your system as well as any
> copies. The content of e-mails as well as traffic data may be monitored by
> OpenBet for employment and security purposes. To protect the environment
> please do not print this e-mail unless necessary. OpenBet Ltd. Registered
> Office: Chiswick Park Building 9, 566 Chiswick High Road, London, W4 5XT,
> United Kingdom. A company registered in England and Wales. Registered no.
> 3134634. VAT no. GB927523612
>



-- 

Patrik Nordwall
Akka Tech Lead
Lightbend <http://www.lightbend.com/> -  Reactive apps on the JVM
Twitter: @patriknw

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] Downing Provider - Missing Unreachable messages

Reply via email to