jiangxb1987 commented on issue #24841: [SPARK-27369][CORE] Setup resources when 
Standalone Worker starts up
URL: https://github.com/apache/spark/pull/24841#issuecomment-505987017
 
 
   Just my 2cents: We may rely on the discovery script to discover all the 
available resources on the host, but that doesn't means the worker shall occupy 
them all. My idea is the Worker shall send the information of all available 
resources to the Master on register, and then the Master shall be responsible 
for coordination, and tell the Worker which addresses it shall use.
   One tricky thing here is when to recycle the resources. Ideally we shall 
recycle them when a Worker is lost, but as Tom mentioned, a lost Worker may be 
still running, so there is possibility that the resources addresses are still 
being used. Instead, we shall make Worker to explicitly send a 
`ReleaseResource` message to the Master to release the resource addresses it no 
longer need. This way we may completely avoid resource address collision, but 
when a Worker dies silently, we may never recycle the resources allocated to it.
    
   > **Issue1**
   > I said, suppose the case: in machine node A we have provided a discovery 
script like `{"name": "gpu", "addresses": ["0","1"]}`, then if we launch 2 
workers on node A, then what will happen ? The 2 workers are both allocated 
gpu-0 and gpu-1. Where's the code to detect this case and then raise error ?
   The Master will keep the information that which addresses have been 
allocated to a Worker, so when a new Worker joins, the Master will detect which 
addresses are free, and make sure to only allocate the free addresses to it.
   
   > **Issue2**
   > I said, for example, suppose we have config worker requesting gpu amount 
to be 2, but the discovery script allocate more than 2 gpus when this worker 
launching, is it a good discovery script ? I think the discovery script should 
exactly allocate the gpu amount the worker requesting, so I suggest the 
verification in `assertAllResourceAllocationsMeetRequests` can be 
"numAllocatedByDiscoveryScript==numRequsted" (current verification is 
"numAllocatedByDiscoveryScript>=numRequsted")
   
   We don't require the number of addresses detected by the discovery script to 
be exactly the same as the request, because the assumption is too strong, and 
it may happen that the hardware on a node may change over time too (we may add 
new devices, and devices may broken). So we only require the number of discover 
addresses is greater than the request, and Spark will ensure we only use the 
requested number of devices.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to