Proper solution here is to have communication backpressure per policy -
SYSTEM or PUBLIC, but not single point as it is now. I think we can achieve
this having two queues per communication session or (which looks a bit
easier to implement) to have separate connections.

As a workaround you can increase the limit. Setting it to 0 may lead to a
potential OOME on sender or receiver sides.

--Yakov

2016-12-07 20:35 GMT+07:00 Dmitry Karachentsev <dkarachent...@gridgain.com>:

> Igniters!
>
> Recently faced with arguable issue, it looks like a bug. Scenario is
> following:
>
> 1) Start two data nodes with some cache.
>
> 2) From one node in async mode post some big number of jobs to another.
> That jobs do some cache operations.
>
> 3) Grid hangs almost immediately and all threads are sleeping except
> public ones, they are waiting for response.
>
> This happens because all cache and job messages are queued on
> communication and limited with default number (1024). It looks like jobs
> are waiting for cache responses that could not be received due to this
> limit. It's hard to diagnose and looks not convenient (as I know we have no
> limitation in docs for using cache ops from compute jobs).
>
> So, my question is. Should we try to solve that or, may be, it's enough to
> update documentation with recommendation to disable queue limit for such
> cases?
>
>

Reply via email to