[
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962708#comment-14962708
]
Saisai Shao commented on SPARK-11175:
-------------------------------------
There's a configuration "spark.streaming.concurrentJobs" that may satisfy your
needs, default number is 1, which means submitting jobs one by one.
> Concurrent execution of JobSet within a batch in Spark streaming
> ----------------------------------------------------------------
>
> Key: SPARK-11175
> URL: https://issues.apache.org/jira/browse/SPARK-11175
> Project: Spark
> Issue Type: Improvement
> Components: Streaming
> Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such
> as from different Kafka topics) that results in multiple independent jobs for
> each batch. These jobs should better be run concurrently to maximally take
> advantage of available resources. The current behavior is that these jobs end
> up in an invisible job queue to be submitted one by one.
> I went through a few hacks:
> 1. launch the rdd action into a new thread from the function passed to
> foreachRDD. However, it will mess up with streaming statistics since the
> batch will finish immediately even the jobs it launched are still running in
> another thread. This can further affect resuming from checkpoint, since all
> batches are completed right away even the actual threaded jobs may fail and
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the
> Jobset to wait for all threads to join, but doing this would mess up with
> closure serialization, and make checkpoint not usable.
> 3. Instead of running multiple Dstreams in one streaming context, just run
> them in separate streaming context (separate Spark applications). Putting
> aside the extra deployment overhead, when working with Spark standalone
> cluster which only has FIFO scheduler across applications, the resource has
> to be set in advance and it won't automatically adjust with resizing the
> cluster.
> Therefore, I think there is a good use case to make the default behavior just
> run all jobs of the current batch concurrently, and mark batch completion
> when all the jobs complete.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]