Hi all, I have a python streaming job which is supposed to run 24x7. I am unable to stabilize it. The job just counts no of links shared in a 30 minute sliding window. I am using reduceByKeyAndWindow operation with a batch of 30 seconds, slide interval of 60 seconds.
The kafka queue has a rate of nearly 2200 messages/second which can increase to 3000 but the mean is 2200. I have played around with batch size, slide interval, and by increasing parallelism with no fruitful result. These just delay the destabilization. GC time is usually between 60-100 ms. I also noticed that the jobs were not distributed to other nodes in the spark UI, for which I have used configured spark.locality.wait as 100ms. After which I have noticed that the job is getting distributed properly. I have a cluster of 6 slaves and one master each with 16 cores and 15gb of ram. Code and configuration: http://pastebin.com/93DMSiji Streaming screenshot: http://imgur.com/psNfjwJ I need help in debugging the issue. Any help will be appreciated. -- Roshan Singh