Proper solution here is to have communication backpressure per policy - SYSTEM or PUBLIC, but not single point as it is now. I think we can achieve this having two queues per communication session or (which looks a bit easier to implement) to have separate connections.
As a workaround you can increase the limit. Setting it to 0 may lead to a potential OOME on sender or receiver sides. --Yakov 2016-12-07 20:35 GMT+07:00 Dmitry Karachentsev <dkarachent...@gridgain.com>: > Igniters! > > Recently faced with arguable issue, it looks like a bug. Scenario is > following: > > 1) Start two data nodes with some cache. > > 2) From one node in async mode post some big number of jobs to another. > That jobs do some cache operations. > > 3) Grid hangs almost immediately and all threads are sleeping except > public ones, they are waiting for response. > > This happens because all cache and job messages are queued on > communication and limited with default number (1024). It looks like jobs > are waiting for cache responses that could not be received due to this > limit. It's hard to diagnose and looks not convenient (as I know we have no > limitation in docs for using cache ops from compute jobs). > > So, my question is. Should we try to solve that or, may be, it's enough to > update documentation with recommendation to disable queue limit for such > cases? > >