Re: Grid hang on compute

Dmitriy Setrakyan Wed, 07 Dec 2016 17:04:58 -0800

Is there any way we can detect this and prevent from happening? Or perhaps
start rejecting jobs if they can potentially block the system?


On Wed, Dec 7, 2016 at 8:11 AM, Yakov Zhdanov <yzhda...@apache.org> wrote:

> Proper solution here is to have communication backpressure per policy -
> SYSTEM or PUBLIC, but not single point as it is now. I think we can achieve
> this having two queues per communication session or (which looks a bit
> easier to implement) to have separate connections.
>
> As a workaround you can increase the limit. Setting it to 0 may lead to a
> potential OOME on sender or receiver sides.
>
> --Yakov
>
> 2016-12-07 20:35 GMT+07:00 Dmitry Karachentsev <dkarachent...@gridgain.com
> >:
>
> > Igniters!
> >
> > Recently faced with arguable issue, it looks like a bug. Scenario is
> > following:
> >
> > 1) Start two data nodes with some cache.
> >
> > 2) From one node in async mode post some big number of jobs to another.
> > That jobs do some cache operations.
> >
> > 3) Grid hangs almost immediately and all threads are sleeping except
> > public ones, they are waiting for response.
> >
> > This happens because all cache and job messages are queued on
> > communication and limited with default number (1024). It looks like jobs
> > are waiting for cache responses that could not be received due to this
> > limit. It's hard to diagnose and looks not convenient (as I know we have
> no
> > limitation in docs for using cache ops from compute jobs).
> >
> > So, my question is. Should we try to solve that or, may be, it's enough
> to
> > update documentation with recommendation to disable queue limit for such
> > cases?
> >
> >
>

Re: Grid hang on compute

Reply via email to