Vladimir, Such behavior can be introduced by an erroneous firewall configuration (I can't find a link, but I remember that quite a large number of major incidents are caused by an incorrect configuration change). If such a case can be detected, we prefer Ignite to shutdown some of the nodes rather than leave the whole cluster hanging on connection await.
сб, 6 июн. 2020 г. в 17:09, Denis Magda <dma...@apache.org>: > Finally, I got your question. > > Back in 2017-2018, there was a Discovery SPI's stabilization activity. The > networking component could fail in various hard-to-reproduce scenarios > affecting cluster availability and consistency. That ticket reminds me of > those notorious issues that would fire once a week or month under specific > configuration settings. So, I would not touch the code that fixes the issue > unless @Alexey Goncharuk <alexey.goncha...@gmail.com> or @Sergey Chugunov > <schugu...@gridgain.com> confirms that it's safe to do. Also, there should > be a test for this scenario. > > - > Denis > > > On Fri, Jun 5, 2020 at 12:28 AM Vladimir Steshin <vlads...@gmail.com> > wrote: > > > Denis, > > > > I have no nodes that I'm unable to interconnect. This case is simulated > > in IgniteDiscoveryMassiveNodeFailTest.testMassiveFailSelfKill() > > Introduced in [1]. > > > > I’m asking if it is real or supposed problem. Where it was met? Which > > network configuration/issues could be? > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-7163 > > > > 05.06.2020 1:01, Denis Magda пишет: > > > Vladimir, > > > > > > I'm suggesting to share the log files from the nodes that are unable to > > > interconnect so that the community can check them for potential issues. > > > Instead of sharing the logs from all the 5 nodes, try to start a > > two-nodes > > > cluster with the nodes that fail to discover each other and attach the > > logs > > > from those. > > > > > > - > > > Denis > > > > > > > > > On Thu, Jun 4, 2020 at 1:57 PM Vladimir Steshin <vlads...@gmail.com> > > wrote: > > > > > >> Denis, hi. > > >> > > >> Sorry, I didn’t catch your idea. Are you saying this can happen > > and > > >> suggest experiment? I’m not descripting a probable case. It is already > > >> done in [1]. I’m asking is it real, where it was met. > > >> > > >> > > >> 04.06.2020 23:33, Denis Magda пишет: > > >>> Vladimir, > > >>> > > >>> Please do the following experiment. Start a 2-nodes cluster booting > > node > > >> 3 > > >>> and, for instance, node 5. Those won't be able to interconnect > > according > > >> to > > >>> your description. Attach the log files from both nodes for analysis. > > This > > >>> should be a networking issue. > > >>> > > >>> - > > >>> Denis > > >>> > > >>> > > >>> On Thu, Jun 4, 2020 at 1:24 PM Vladimir Steshin <vlads...@gmail.com> > > >> wrote: > > >>>> Hi, Igniters. > > >>>> > > >>>> > > >>>> I wanted to ask how one node may not be able to connect to > > another > > >>>> whereas rest of the cluster can. This got covered in [1]. In short: > > node > > >>>> 3 can't connect to nodes 4 and 5 but can to 1. At the same time, > node > > 2 > > >>>> can connect to 4. Questions: > > >>>> > > >>>> 1) Is it real case? Where this problem came from? > > >>>> > > >>>> 2) If node 3 can’t connect to 4 and 5, does it mean node 2 can’t > > connect > > >>>> to 4 (and 5) too? > > >>>> > > >>>> Sergey, Dmitry maybe you bring light (I see you in [1])? I'm > > >>>> participating in [2] and found this backward connection checking. > > >>>> Answering would help us a lot. > > >>>> > > >>>> Thanks! > > >>>> > > >>>> [1] > > >>>> https://issues.apache.org/jira/browse/IGNITE-7163< > > >>>> https://issues.apache.org/jira/browse/IGNITE-7163> > > >>>> > > >>>> [2] > > >>>> > > >>>> > > >> > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up > > >>>> < > > >>>> > > >> > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up > > >