Is there any way we can detect this and prevent from happening? Or perhaps
start rejecting jobs if they can potentially block the system?

On Wed, Dec 7, 2016 at 8:11 AM, Yakov Zhdanov <yzhda...@apache.org> wrote:

> Proper solution here is to have communication backpressure per policy -
> SYSTEM or PUBLIC, but not single point as it is now. I think we can achieve
> this having two queues per communication session or (which looks a bit
> easier to implement) to have separate connections.
>
> As a workaround you can increase the limit. Setting it to 0 may lead to a
> potential OOME on sender or receiver sides.
>
> --Yakov
>
> 2016-12-07 20:35 GMT+07:00 Dmitry Karachentsev <dkarachent...@gridgain.com
> >:
>
> > Igniters!
> >
> > Recently faced with arguable issue, it looks like a bug. Scenario is
> > following:
> >
> > 1) Start two data nodes with some cache.
> >
> > 2) From one node in async mode post some big number of jobs to another.
> > That jobs do some cache operations.
> >
> > 3) Grid hangs almost immediately and all threads are sleeping except
> > public ones, they are waiting for response.
> >
> > This happens because all cache and job messages are queued on
> > communication and limited with default number (1024). It looks like jobs
> > are waiting for cache responses that could not be received due to this
> > limit. It's hard to diagnose and looks not convenient (as I know we have
> no
> > limitation in docs for using cache ops from compute jobs).
> >
> > So, my question is. Should we try to solve that or, may be, it's enough
> to
> > update documentation with recommendation to disable queue limit for such
> > cases?
> >
> >
>

Reply via email to