Thanks TD. BTW - If I have input file ~ 250 GBs - Is there any guideline on whether to use:
* a single input (250 GB) (in this case is there any max upper bound) or * split into 1000 files each of 250 MB (hdfs block size is 250 MB) or * a multiple of hdfs block size. Mans On Friday, July 11, 2014 4:38 PM, Tathagata Das <tathagata.das1...@gmail.com> wrote: The model for file stream is to pick up and process new files written atomically (by move) into a directory. So your file is being processed in a single batch, and then its waiting for any new files to be written into that directory. TD On Fri, Jul 11, 2014 at 11:46 AM, M Singh <mans6si...@yahoo.com> wrote: So, is it expected for the process to generate stages/tasks even after processing a file ? > > >Also, is there a way to figure out the file that is getting processed and when >that process is complete ? > > >Thanks > > > > >On Friday, July 11, 2014 1:51 PM, Tathagata Das <tathagata.das1...@gmail.com> >wrote: > > > >Whenever you need to do a shuffle=based operation like reduceByKey, >groupByKey, join, etc., the system is essentially redistributing the data >across the cluster and it needs to know how many parts should it divide the >data into. Thats where the default parallelism is used. > > >TD > > > >On Fri, Jul 11, 2014 at 3:16 AM, M Singh <mans6si...@yahoo.com> wrote: > >Hi TD: >> >> >>The input file is on hdfs. >> >> >>The file is approx 2.7 GB and when the process starts, there are 11 tasks >>(since hdfs block size is 256M) for processing and 2 tasks for reduce by key. >> After the file has been processed, I see new stages with 2 tasks that >>continue to be generated. I understand this value (2) is the default value >>for spark.default.parallelism but don't quite understand how is the value >>determined for generating tasks for reduceByKey, how is it used besides >>reduceByKey and what should be the optimal value for this. >> >> >>Thanks. >> >> >> >>On Thursday, July 10, 2014 7:24 PM, Tathagata Das >><tathagata.das1...@gmail.com> wrote: >> >> >> >>How are you supplying the text file? >> >> >> >>On Wed, Jul 9, 2014 at 11:51 AM, M Singh <mans6si...@yahoo.com> wrote: >> >>Hi Folks: >>> >>> >>> >>>I am working on an application which uses spark streaming (version 1.1.0 >>>snapshot on a standalone cluster) to process text file and save counters in >>>cassandra based on fields in each row. I am testing the application in two >>>modes: >>> >>> * Process each row and save the counter in cassandra. In this scenario >>> after the text file has been consumed, there is no task/stages seen in the >>> spark UI. >>> >>> * If instead I use reduce by key before saving to cassandra, the spark >>> UI shows continuous generation of tasks/stages even after processing the >>> file has been completed. >>> >>>I believe this is because the reduce by key requires merging of data from >>>different partitions. But I was wondering if anyone has any >>>insights/pointers for understanding this difference in behavior and how to >>>avoid generating tasks/stages when there is no data (new file) available. >>> >>> >>>Thanks >>> >>>Mans >> >> >> > > >