Hi all,
I have a python streaming job which is supposed to run 24x7. I am unable to
stabilize it. The job just counts no of links shared in a 30 minute sliding
window. I am using reduceByKeyAndWindow operation with a batch of 30
seconds, slide interval of 60 seconds.

The kafka queue has a rate of nearly 2200 messages/second which can
increase to 3000 but the mean is 2200.

I have played around with batch size, slide interval, and by increasing
parallelism with no fruitful result. These just delay the destabilization.

GC time is usually between 60-100 ms.

I also noticed that the jobs were not distributed to other nodes in the
spark UI, for which I have used configured spark.locality.wait as 100ms.
After which I have noticed that the job is getting distributed properly.

I have a cluster of 6 slaves and one master each with 16 cores and 15gb of
ram.

Code and configuration: http://pastebin.com/93DMSiji

Streaming screenshot: http://imgur.com/psNfjwJ

I need help in debugging the issue. Any help will be appreciated.

-- 
Roshan Singh

Reply via email to