[ 
https://issues.apache.org/jira/browse/SPARK-23397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361851#comment-16361851
 ] 

Vivek Patangiwar commented on SPARK-23397:
------------------------------------------

Thanks for your response Sean.

An example to elaborate what Shahbaz said: if I have a code like this (Scala)

sourceDstream.transform(rdd=>{

  SqlContext.getOrCreate(rdd.sparkContext).createDataFrame(rdd, 
someStruct).select($"col1", $"col2").rdd

}).foreach(rdd=>\{rdd.foreach(println)})

The action results into a sparkPlan generation. Only problem is Spark-Streaming 
does it for every minibatch. The above example is really simple and it's not a 
big deal to make it into a spark plan. The operations in transform() can get 
really complex and in that case, it may take a considerable amount of time 
(comparable to batch interval) to generate spark-plan for it, in which case the 
batches get delayed significantly.

I would instead like to generate the spark plan once(logic remains the same, 
only the input RDD changes) and use it for every subsequent minibatch by 
placing the new RDD in the plan.

I'm sure structured streaming solves this problem but is there any way I could 
do that in spark-streaming (DStream API)?

> Scheduling delay causes Spark Streaming to miss batches.
> --------------------------------------------------------
>
>                 Key: SPARK-23397
>                 URL: https://issues.apache.org/jira/browse/SPARK-23397
>             Project: Spark
>          Issue Type: Bug
>          Components: DStreams
>    Affects Versions: 2.2.1
>            Reporter: Shahbaz Hussain
>            Priority: Major
>
> * For Complex Spark (Scala) based D-Stream based applications ,which requires 
> creating Ex: 40 Jobs for every batch ,its been observed that ,batches does 
> not get created on the specific time ,ex: if i started a Spark Streaming 
> based application with batch interval as 20 seconds and application is 
> creating 40 odd Jobs ,observe the next batch does not create 20 seconds later 
> than previous job creation time.
>  * This is due to the fact that Job Creation is Single Threaded, if Job 
> Creation delay is greater than Batch Interval time ,batch execution misses 
> its schedule.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to