Vladimir,

Adding to what Alexey has said I remember that cases of short-term network
issues (blinking network) were also a driver for this improvement. They are
indeed hard to reproduce but have been seen in real world set-ups and have
proven to increase cluster stability.

On Sat, Jun 6, 2020 at 5:09 PM Denis Magda <dma...@apache.org> wrote:

> Finally, I got your question.
>
> Back in 2017-2018, there was a Discovery SPI's stabilization activity. The
> networking component could fail in various hard-to-reproduce scenarios
> affecting cluster availability and consistency. That ticket reminds me of
> those notorious issues that would fire once a week or month under specific
> configuration settings. So, I would not touch the code that fixes the issue
> unless @Alexey Goncharuk <alexey.goncha...@gmail.com> or @Sergey Chugunov
> <schugu...@gridgain.com> confirms that it's safe to do. Also, there should
> be a test for this scenario.
>
> -
> Denis
>
>
> On Fri, Jun 5, 2020 at 12:28 AM Vladimir Steshin <vlads...@gmail.com>
> wrote:
>
> > Denis,
> >
> > I have no nodes that I'm unable to interconnect. This case is simulated
> > in IgniteDiscoveryMassiveNodeFailTest.testMassiveFailSelfKill()
> > Introduced in [1].
> >
> > I’m asking if it is real or supposed problem. Where it was met? Which
> > network configuration/issues could be?
> >
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-7163
> >
> > 05.06.2020 1:01, Denis Magda пишет:
> > > Vladimir,
> > >
> > > I'm suggesting to share the log files from the nodes that are unable to
> > > interconnect so that the community can check them for potential issues.
> > > Instead of sharing the logs from all the 5 nodes, try to start a
> > two-nodes
> > > cluster with the nodes that fail to discover each other and attach the
> > logs
> > > from those.
> > >
> > > -
> > > Denis
> > >
> > >
> > > On Thu, Jun 4, 2020 at 1:57 PM Vladimir Steshin <vlads...@gmail.com>
> > wrote:
> > >
> > >> Denis, hi.
> > >>
> > >>       Sorry, I didn’t catch your idea. Are you saying this can happen
> > and
> > >> suggest experiment? I’m not descripting a probable case. It is already
> > >> done in [1]. I’m asking is it real, where it was met.
> > >>
> > >>
> > >> 04.06.2020 23:33, Denis Magda пишет:
> > >>> Vladimir,
> > >>>
> > >>> Please do the following experiment. Start a 2-nodes cluster booting
> > node
> > >> 3
> > >>> and, for instance, node 5. Those won't be able to interconnect
> > according
> > >> to
> > >>> your description. Attach the log files from both nodes for analysis.
> > This
> > >>> should be a networking issue.
> > >>>
> > >>> -
> > >>> Denis
> > >>>
> > >>>
> > >>> On Thu, Jun 4, 2020 at 1:24 PM Vladimir Steshin <vlads...@gmail.com>
> > >> wrote:
> > >>>>        Hi, Igniters.
> > >>>>
> > >>>>
> > >>>>        I wanted to ask how one node may not be able to connect to
> > another
> > >>>> whereas rest of the cluster can. This got covered in [1]. In short:
> > node
> > >>>> 3 can't connect to nodes 4 and 5 but can to 1. At the same time,
> node
> > 2
> > >>>> can connect to 4. Questions:
> > >>>>
> > >>>> 1) Is it real case? Where this problem came from?
> > >>>>
> > >>>> 2) If node 3 can’t connect to 4 and 5, does it mean node 2 can’t
> > connect
> > >>>> to 4 (and 5) too?
> > >>>>
> > >>>> Sergey, Dmitry maybe you bring light (I see you in [1])? I'm
> > >>>> participating in [2] and found this backward connection checking.
> > >>>> Answering would help us a lot.
> > >>>>
> > >>>> Thanks!
> > >>>>
> > >>>> [1]
> > >>>> https://issues.apache.org/jira/browse/IGNITE-7163<
> > >>>> https://issues.apache.org/jira/browse/IGNITE-7163>
> > >>>>
> > >>>> [2]
> > >>>>
> > >>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up
> > >>>> <
> > >>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-45%3A+Crash+Recovery+Speed-Up
> >
>

Reply via email to