[ https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yongjia Wang closed SPARK-11175. -------------------------------- Resolution: Not A Problem > Concurrent execution of JobSet within a batch in Spark streaming > ---------------------------------------------------------------- > > Key: SPARK-11175 > URL: https://issues.apache.org/jira/browse/SPARK-11175 > Project: Spark > Issue Type: Improvement > Components: Streaming > Reporter: Yongjia Wang > > Spark StreamingContext can register multiple independent Input DStreams (such > as from different Kafka topics) that results in multiple independent jobs for > each batch. These jobs should better be run concurrently to maximally take > advantage of available resources. The current behavior is that these jobs end > up in an invisible job queue to be submitted one by one. > I went through a few hacks: > 1. launch the rdd action into a new thread from the function passed to > foreachRDD. However, it will mess up with streaming statistics since the > batch will finish immediately even the jobs it launched are still running in > another thread. This can further affect resuming from checkpoint, since all > batches are completed right away even the actual threaded jobs may fail and > checkpoint only resume the last batch. > 2. It's possible by just using foreachRDD and the available APIs to block the > Jobset to wait for all threads to join, but doing this would mess up with > closure serialization, and make checkpoint not usable. > 3. Instead of running multiple Dstreams in one streaming context, just run > them in separate streaming context (separate Spark applications). Putting > aside the extra deployment overhead, when working with Spark standalone > cluster which only has FIFO scheduler across applications, the resource has > to be set in advance and it won't automatically adjust with resizing the > cluster. > Therefore, I think there is a good use case to make the default behavior just > run all jobs of the current batch concurrently, and mark batch completion > when all the jobs complete. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org