Is there any way we can detect this and prevent from happening? Or perhaps start rejecting jobs if they can potentially block the system?
On Wed, Dec 7, 2016 at 8:11 AM, Yakov Zhdanov <yzhda...@apache.org> wrote: > Proper solution here is to have communication backpressure per policy - > SYSTEM or PUBLIC, but not single point as it is now. I think we can achieve > this having two queues per communication session or (which looks a bit > easier to implement) to have separate connections. > > As a workaround you can increase the limit. Setting it to 0 may lead to a > potential OOME on sender or receiver sides. > > --Yakov > > 2016-12-07 20:35 GMT+07:00 Dmitry Karachentsev <dkarachent...@gridgain.com > >: > > > Igniters! > > > > Recently faced with arguable issue, it looks like a bug. Scenario is > > following: > > > > 1) Start two data nodes with some cache. > > > > 2) From one node in async mode post some big number of jobs to another. > > That jobs do some cache operations. > > > > 3) Grid hangs almost immediately and all threads are sleeping except > > public ones, they are waiting for response. > > > > This happens because all cache and job messages are queued on > > communication and limited with default number (1024). It looks like jobs > > are waiting for cache responses that could not be received due to this > > limit. It's hard to diagnose and looks not convenient (as I know we have > no > > limitation in docs for using cache ops from compute jobs). > > > > So, my question is. Should we try to solve that or, may be, it's enough > to > > update documentation with recommendation to disable queue limit for such > > cases? > > > > >