Github user thunterdb commented on the issue:
https://github.com/apache/spark/pull/16774
Thanks for working on this task, this is a much requested feature. While it
will work for simple cases in the current shape, it is going to cause some
issues for any complex deployments (Apache Toree, Livy, Databricks, etc.)
because the threadpool that controls the computations is not managed. The
default assumption with `.par` is a lot of quick tasks. With the current
implementation, because the same thread pool is going to be shared across all
the parallel collections, users are going to encounter some mysterious freezes
in other places, while the ML models are finishing to train (I am talking from
experience here).
While the situation with `.par` has notably improved with scala 2.10, it is
better to:
- create a dedicated thread pool for each `.fit`, that users can replace.
- use futures, of which the execution context is tied to the thread pool
above.
- not use semaphores, but instead rely on the thread limit at the thread
pool level to cap the number of concurrent execution.
If you do not do that, users in a shared environment like any of the above
will experience some mysterious freezes depending on what other users are
doing. Ideally the default resources should be tied to `SparkSession`, but we
can start with a default static pool marked as experimental API.
More concretely, the API should look like this, I believe:
```
def setNumParallelEval(num: Int) // Creates an execution context with the
given max number of threads
def setExecutionExecutorService(exec: ExecutorService) // Will use the
given executor service instead of an executor service shared by all the ML
calculations
```
See the doc in:
https://twitter.github.io/scala_school/concurrency.html#executor
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html
More about parallel collections:
http://stackoverflow.com/questions/5424496/scala-parallel-collections-degree-of-parallelism
http://stackoverflow.com/questions/14214757/what-is-the-benefit-of-using-futures-over-parallel-collections-in-scala
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]