Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

Tathagata Das Fri, 11 Jul 2014 13:38:28 -0700

The model for file stream is to pick up and process new files written
atomically (by move) into a directory. So your file is being processed in a
single batch, and then its waiting for any new files to be written into
that directory.


TD


On Fri, Jul 11, 2014 at 11:46 AM, M Singh <mans6si...@yahoo.com> wrote:

> So, is it expected for the process to generate stages/tasks even after
> processing a file ?
>
> Also, is there a way to figure out the file that is getting processed and
> when that process is complete ?
>
> Thanks
>
>
>   On Friday, July 11, 2014 1:51 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>
> Whenever you need to do a shuffle=based operation like reduceByKey,
> groupByKey, join, etc., the system is essentially redistributing the data
> across the cluster and it needs to know how many parts should it divide the
> data into. Thats where the default parallelism is used.
>
> TD
>
>
> On Fri, Jul 11, 2014 at 3:16 AM, M Singh <mans6si...@yahoo.com> wrote:
>
> Hi TD:
>
> The input file is on hdfs.
>
>  The file is approx 2.7 GB and when the process starts, there are 11
> tasks (since hdfs block size is 256M) for processing and 2 tasks for reduce
> by key.  After the file has been processed, I see new stages with 2 tasks
> that continue to be generated. I understand this value (2) is the default
> value for spark.default.parallelism but don't quite understand how is the
> value determined for generating tasks for reduceByKey, how is it used
> besides reduceByKey and what should be the optimal value for this.
>
>  Thanks.
>
>
>   On Thursday, July 10, 2014 7:24 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>
> How are you supplying the text file?
>
>
> On Wed, Jul 9, 2014 at 11:51 AM, M Singh <mans6si...@yahoo.com> wrote:
>
> Hi Folks:
>
> I am working on an application which uses spark streaming (version 1.1.0
> snapshot on a standalone cluster) to process text file and save counters in
> cassandra based on fields in each row.  I am testing the application in two
> modes:
>
>    - Process each row and save the counter in cassandra.  In this
>    scenario after the text file has been consumed, there is no task/stages
>    seen in the spark UI.
>    - If instead I use reduce by key before saving to cassandra, the spark
>    UI shows continuous generation of tasks/stages even after processing the
>    file has been completed.
>
> I believe this is because the reduce by key requires merging of data from
> different partitions.  But I was wondering if anyone has any
> insights/pointers for understanding this difference in behavior and how to
> avoid generating tasks/stages when there is no data (new file) available.
>
> Thanks
>
> Mans
>
>
>
>
>
>
>
>

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

Reply via email to