We had this same issue with the reduceByKeyAndWindow API that you are using. For fixing this issue, you have to use different flavor of that API, specifically the 2 versions that allow you to give a 'Filter function' to them. Putting in the filter functions helped stabilize our application too.
HTH NB On Sun, Jun 12, 2016 at 11:19 PM, Roshan Singh <singh.rosha...@gmail.com> wrote: > Hi all, > I have a python streaming job which is supposed to run 24x7. I am unable > to stabilize it. The job just counts no of links shared in a 30 minute > sliding window. I am using reduceByKeyAndWindow operation with a batch of > 30 seconds, slide interval of 60 seconds. > > The kafka queue has a rate of nearly 2200 messages/second which can > increase to 3000 but the mean is 2200. > > I have played around with batch size, slide interval, and by increasing > parallelism with no fruitful result. These just delay the destabilization. > > GC time is usually between 60-100 ms. > > I also noticed that the jobs were not distributed to other nodes in the > spark UI, for which I have used configured spark.locality.wait as 100ms. > After which I have noticed that the job is getting distributed properly. > > I have a cluster of 6 slaves and one master each with 16 cores and 15gb of > ram. > > Code and configuration: http://pastebin.com/93DMSiji > > Streaming screenshot: http://imgur.com/psNfjwJ > > I need help in debugging the issue. Any help will be appreciated. > > -- > Roshan Singh > >