[jira] [Comment Edited] (SPARK-4040) calling count() on RDD's emitted from a DStream blocks forEachRDD progress.

jay vyas (JIRA) Mon, 27 Oct 2014 13:37:01 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185764#comment-14185764
 ]


jay vyas edited comment on SPARK-4040 at 10/27/14 8:36 PM:
-----------------------------------------------------------

Ah, i think i see the issue. 
http://stackoverflow.com/questions/25785581/custom-receiver-stalls-worker-in-spark-streaming

When running spark in local mode - its possible to submit multiple tasks, but 
they dont actually ever get picked up unless you tell local mode to open up 
more than one slot, so you get resource starvation.

{noformat}
setMaster("local")
{noformat}

instead of 
{noformat}
setMaster("local[2]")
{noformat}

(so this isnt really a bug) .  closing.


was (Author: jayunit100):
Ah, i think i see the issue. 
http://stackoverflow.com/questions/25785581/custom-receiver-stalls-worker-in-spark-streaming

When running spark in local mode - its possible to submit multiple tasks, but 
they dont actually ever get picked up unless you tell local mode to open up 
more than one slot, so you get resource starvation.

{noformat}
setMaster("local")
{noformat}

instead of 
{noformat}
setMaster("local[2]")
{noformat}

> calling count() on RDD's emitted from a DStream blocks forEachRDD progress.
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-4040
>                 URL: https://issues.apache.org/jira/browse/SPARK-4040
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>            Reporter: jay vyas
>
> Please note that Im somewhat new to spark streaming's API, and am not a spark 
> expert - so I've done the best to write up and reproduce this "bug".  If its 
> not a bug i hope an expert will help to explain why and promptly close it.  
> However, it appears it could be a bug after discussing with [~rnowling] who 
> is a spark contributor.
> CC [~rnowling] [~willbenton] 
>  
> It appears that in a DStream context, a call to   {{MappedRDD.count()}} 
> blocks progress and prevents emission of RDDs from a stream.
> {noformat}
>     tweetStream.foreachRDD((rdd,lent)=> {
>       tweetStream.repartition(1)
>       //val count = rdd.count()  DONT DO THIS !
>       checks += 1;
>       if (checks > 20) {
>         ssc.stop()
>       }
>    }
> {noformat} 
> The above code block should inevitably halt, after 20 intervals of RDDs... 
> However, if we *uncomment the call* to {{rdd.count()}}, it turns out that we 
> get an *infinite stream which emits no RDDs*, and thus our program *runs 
> forever* (ssc.stop is unreachable), because *forEach doesnt receive any more 
> entries*.  
> I suspect this is actually because the foreach block never completes, because 
> {{count()}} is winds up calling {{compute}}, which ultimately just reads from 
> the stream.
> I havent put together a minimal reproducer or unit test yet, but I can work 
> on doing so if more info is needed.
> I guess this could be seen as an application bug - but i think spark might be 
> made smarter to throw its hands up when people execute blocking code in a 
> stream processor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-4040) calling count() on RDD's emitted from a DStream blocks forEachRDD progress.

Reply via email to