Hi Folks:

I am working on an application which uses spark streaming (version 1.1.0 
snapshot on a standalone cluster) to process text file and save counters in 
cassandra based on fields in each row.  I am testing the application in two 
modes:  

        * Process each row and save the counter in cassandra.  In this scenario 
after the text file has been consumed, there is no task/stages seen in the 
spark UI.

        * If instead I use reduce by key before saving to cassandra, the spark 
UI shows continuous generation of tasks/stages even afterprocessing the file 
has been completed.

I believe this is because the reduce by key requires merging of data from 
different partitions.  But I was wondering if anyone has any insights/pointers 
for understanding this difference in behavior and how to avoid generating 
tasks/stages when there is no data (new file) available.

Thanks

Mans

Reply via email to