Re: Nodes which started in separate JVM couldn't stop properly (in tests)

Dmitry Pavlov Thu, 15 Mar 2018 08:59:49 -0700

I see now. Thank you.

Nikolay, could you please merge this change?


чт, 15 мар. 2018 г. в 18:48, Vyacheslav Daradur <daradu...@gmail.com>:

> In brief:
> Nodes in *separate* JVMs are shutting down by the computing task
> *StopGridTask* which has sent from *local* JVM *synchronously* that
> means *local* node must wait for task's finish.
>
> At the same time when a node in *separate* JVM executes the received
> *StopGridTask* which *synchronously* calls *G.stop(igniteInstanceName,
> FALSE)* which is waiting for all computing task's finish, including
> *StopGridTask* which has invoked it.
>
> We have some kind of deadlock:
> *Local* node is waiting for the computing task's finish which is
> waiting for finish of execution *G.stop* which is waiting for all
> computing tasks finish including *StopGridTask*.
>
> We have not noticed that before because we use only stopAllGrids() in
> out tests which stop local JVM without waiting for nodes in other
> JVMs.
>
>
>
> On Thu, Mar 15, 2018 at 6:11 PM, Dmitry Pavlov <dpavlov....@gmail.com>
> wrote:
> > Please address comments in PR.
> >
> > I did not fully understood why sync GridStopMessage message was lost, but
> > async will be successfull. Probably we need discuss it briefly.
> >
> > чт, 1 мар. 2018 г. в 12:11, Vyacheslav Daradur <daradu...@gmail.com>:
> >>
> >> Thank you, Dmitry!
> >>
> >> I'll join this review soon.
> >>
> >> On Thu, Mar 1, 2018 at 12:07 PM, Dmitry Pavlov <dpavlov....@gmail.com>
> >> wrote:
> >> > Hi Vyacheslav,
> >> >
> >> > I will take a look, but first of all I am going to review
> >> > https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502  - it is
> >> > impact
> >> > change in testing framework. Hope you also will join to this review .
> >> >
> >> > Sincerely,
> >> > Dmitiry Pavlov
> >> >
> >> >
> >> > чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <daradu...@gmail.com>:
> >> >>
> >> >> Hi, Dmitry, could you please review it, because you are one of the
> >> >> most experienced people in the testing framework.
> >> >>
> >> >> Please see comment in Jira, because it is in pretty-format there.
> >> >>
> >> >> On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
> >> >> <daradu...@gmail.com> wrote:
> >> >> > Hi Igniters!
> >> >> >
> >> >> > I have investigated the issue [1] and found that stopping node in
> >> >> > separate JVM may stuck thread or leave system process alive after
> >> >> > test
> >> >> > finished.
> >> >> > The main reason is *StopGridTask* that we send from node in local
> JVM
> >> >> > to node in separate JVM via remote computing.
> >> >> > We send job synchronously to be sure that node will be stopped, but
> >> >> > job calls synchronously *G.stop(igniteInstanceName, cancel))* with
> >> >> > *cancel = false*, that means node must wait to compute jobs before
> it
> >> >> > goes down what leads to some kind of deadlock. Using of *cancel =
> >> >> > true* would solve the issue but may break some tests’ logic, for
> this
> >> >> > reason, I've reworked the method’s synchronization logic [2].
> >> >> >
> >> >> > We have not noticed that before because we use only
> *stopAllGrids()*
> >> >> > in out tests which stop local JVM without waiting for nodes in
> other
> >> >> > JVMs.
> >> >> > I believe this fix should reduce the number of flaky tests on
> >> >> > TeamCity, especially which fails because of a cluster from the
> >> >> > previous test has not been stopped properly.
> >> >> >
> >> >> > Ci.tests [3] look a bit better than in master.
> >> >> > Please review prepared PR [2] and share your thoughts.
> >> >> >
> >> >> > [1] https://issues.apache.org/jira/browse/IGNITE-5910
> >> >> > [2] https://github.com/apache/ignite/pull/2382
> >> >> > [3] https://ci.ignite.apache.org/viewLog.html?buildId=1105939
> >> >> >
> >> >> >
> >> >> > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur
> >> >> > <daradu...@gmail.com> wrote:
> >> >> >> Hi Igniters,
> >> >> >>
> >> >> >> Working on my task I found a bug at call the method
> #stopGrid(name),
> >> >> >> it produced ClassCastException. I created a ticket[1].
> >> >> >>
> >> >> >> After it was fixed[2] I saw that nodes which was started in a
> >> >> >> separate
> >> >> >> JVM
> >> >> >> could stay in process of operation system.
> >> >> >> It was fixed too, but not sure is it fixed in proper way or not.
> >> >> >>
> >> >> >> Could someone review it?
> >> >> >>
> >> >> >> [1] https://issues.apache.org/jira/browse/IGNITE-5910
> >> >> >> [2] https://github.com/apache/ignite/pull/2382
> >> >> >>
> >> >> >> --
> >> >> >> Best Regards, Vyacheslav D.
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Best Regards, Vyacheslav D.
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Best Regards, Vyacheslav D.
> >>
> >>
> >>
> >> --
> >> Best Regards, Vyacheslav D.
>
>
>
> --
> Best Regards, Vyacheslav D.
>

Re: Nodes which started in separate JVM couldn't stop properly (in tests)

Reply via email to