The model for file stream is to pick up and process new files written atomically (by move) into a directory. So your file is being processed in a single batch, and then its waiting for any new files to be written into that directory.
TD On Fri, Jul 11, 2014 at 11:46 AM, M Singh <mans6si...@yahoo.com> wrote: > So, is it expected for the process to generate stages/tasks even after > processing a file ? > > Also, is there a way to figure out the file that is getting processed and > when that process is complete ? > > Thanks > > > On Friday, July 11, 2014 1:51 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > > > Whenever you need to do a shuffle=based operation like reduceByKey, > groupByKey, join, etc., the system is essentially redistributing the data > across the cluster and it needs to know how many parts should it divide the > data into. Thats where the default parallelism is used. > > TD > > > On Fri, Jul 11, 2014 at 3:16 AM, M Singh <mans6si...@yahoo.com> wrote: > > Hi TD: > > The input file is on hdfs. > > The file is approx 2.7 GB and when the process starts, there are 11 > tasks (since hdfs block size is 256M) for processing and 2 tasks for reduce > by key. After the file has been processed, I see new stages with 2 tasks > that continue to be generated. I understand this value (2) is the default > value for spark.default.parallelism but don't quite understand how is the > value determined for generating tasks for reduceByKey, how is it used > besides reduceByKey and what should be the optimal value for this. > > Thanks. > > > On Thursday, July 10, 2014 7:24 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > > > How are you supplying the text file? > > > On Wed, Jul 9, 2014 at 11:51 AM, M Singh <mans6si...@yahoo.com> wrote: > > Hi Folks: > > I am working on an application which uses spark streaming (version 1.1.0 > snapshot on a standalone cluster) to process text file and save counters in > cassandra based on fields in each row. I am testing the application in two > modes: > > - Process each row and save the counter in cassandra. In this > scenario after the text file has been consumed, there is no task/stages > seen in the spark UI. > - If instead I use reduce by key before saving to cassandra, the spark > UI shows continuous generation of tasks/stages even after processing the > file has been completed. > > I believe this is because the reduce by key requires merging of data from > different partitions. But I was wondering if anyone has any > insights/pointers for understanding this difference in behavior and how to > avoid generating tasks/stages when there is no data (new file) available. > > Thanks > > Mans > > > > > > > >