Fix looks good, but it still can be dangerous to merge last minute before release.
On Sat, Nov 28, 2015 at 4:44 PM, Yakov Zhdanov <[email protected]> wrote: > Cache processor has not received stop signal since stopping thread is > trapped in job processor waiting for all jobs to finish. > > --Yakov > > 2015-11-28 15:57 GMT+03:00 Semyon Boikov <[email protected]>: > > > Yakov, > > > > When node is stopped all cache futures are completed with error, where > did > > you see hang? > > > > > > On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <[email protected]> > > wrote: > > > > > Guys, > > > > > > I see the following code > > > > > > > > > (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129): > > > > > > try { > > > cctx.io().send(n, req, tx.ioPolicy()); > > > } > > > catch (ClusterTopologyCheckedException e) { > > > fut.onNodeLeft(e); > > > } > > > catch (IgniteCheckedException e) { > > > if (!cctx.kernalContext().isStopping()) > > > fut.onResult(e); > > > } > > > > > > > > > Which means that in case if node has just started stop procedure, all > > cache > > > operations may potentially hang. If cache.put() is called from job and > > node > > > is stopping gracefully, stop process hangs with 100% probability. > > > > > > This issue does not threaten failure detection and nodes crash cases > > since > > > this is handled by separate logic. > > > > > > I fixed Communication SPI to use its internal stopping flag instead of > > the > > > system wide one and this seems to fix the issue with graceful stop. > > > > > > Semyon, can you please see if this may cause any other issue of the > kind? > > > > > > My changes are here - https://github.com/apache/ignite/pull/278 > > > > > > --Yakov > > > > > >
