Github user thunterdb commented on the issue:

    https://github.com/apache/spark/pull/16774
  
    Thanks for working on this task, this is a much requested feature. While it 
will work for simple cases in the current shape, it is going to cause some 
issues for any complex deployments (Apache Toree, Livy, Databricks, etc.) 
because the threadpool that controls the computations is not managed. The 
default assumption with `.par` is a lot of quick tasks. With the current 
implementation, because the same thread pool is going to be shared across all 
the parallel collections, users are going to encounter some mysterious freezes 
in other places, while the ML models are finishing to train (I am talking from 
experience here).
    
    While the situation with `.par` has notably improved with scala 2.10, it is 
better to:
     - create a dedicated thread pool for each `.fit`, that users can replace.
     - use futures, of which the execution context is tied to the thread pool 
above.
     - not use semaphores, but instead rely on the thread limit at the thread 
pool level to cap the number of concurrent execution.
    
    If you do not do that, users in a shared environment like any of the above 
will experience some mysterious freezes depending on what other users are 
doing. Ideally the default resources should be tied to `SparkSession`, but we 
can start with a default static pool marked as experimental API.
    
    More concretely, the API should look like this, I believe:
    ```
      def setNumParallelEval(num: Int) // Creates an execution context with the 
given max number of threads
      def setExecutionExecutorService(exec: ExecutorService) // Will use the 
given executor service instead of an executor service shared by all the ML 
calculations
    ```
    
    See the doc in:
    https://twitter.github.io/scala_school/concurrency.html#executor
    
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html
    
    More about parallel collections:
    
http://stackoverflow.com/questions/5424496/scala-parallel-collections-degree-of-parallelism
    
http://stackoverflow.com/questions/14214757/what-is-the-benefit-of-using-futures-over-parallel-collections-in-scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to