mengxr edited a comment on issue #27722: [SPARK-30969][CORE] Remove resource 
coordination support from Standalone
URL: https://github.com/apache/spark/pull/27722#issuecomment-592057231
 
 
   @tgravescs I suggested removal of this resource coordination code in an 
offline 3.0 API audit discussion to keep Spark codebase simple. Scheduling is 
already an area that lacks of maintainers. Increasing its complexity would keep 
potential maintainers away. This is why we have been very careful when we 
introduced the resource-aware scheduling feature. Here are the reasons to 
remove this feature before Spark 3.0 release:
   
   * In a real standalone deployment, having multiple workers on the same host 
is not longer needed. As @Ngone51 mentioned, the feature was introduced to keep 
worker JVM small to avoid long GC pause when worker and executors share the 
same JVM. Now executors run in a different JVM, so having one worker per host 
is sufficient. Correct me if I'm wrong since you mentioned several deployment 
using multiple workers on the same host.
   * Due to the reason above, instead of supporting GPU scheduling for multiple 
workers on the same host, we should deprecate entirely the support of multiple 
workers on the same host in 3.0 and remove it in a future release, to further 
simplify the codebase.
   * The local-cluster mode is not a public feature, which should only be used 
in Spark tests. Actually this change is the first to mention "local-cluster" in 
the user guide, which makes it "public". I don't think we want to add (even 
localized) complexity just for this mode. In a test setup, we can use one 
worker process and separate driver/worker scripts to simplify the resource 
allocation. Again, using multiple workers in local-cluster mode is to simulate 
a real cluster, not a "production" setup. If this is for simulation, we just 
need fake discovery scripts.
   * For worker recovery, my understanding is that there shouldn't be a case 
that the old worker and the recovery worker are running at the same time. 
Because the recovery is usually done by monit which monitors the worker 
process. cc: @jiangxb1987 
   
   cc: @squito 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to