jiangxb1987 commented on issue #24841: [SPARK-27369][CORE] Setup resources when Standalone Worker starts up URL: https://github.com/apache/spark/pull/24841#issuecomment-505987017 Just my 2cents: We may rely on the discovery script to discover all the available resources on the host, but that doesn't means the worker shall occupy them all. My idea is the Worker shall send the information of all available resources to the Master on register, and then the Master shall be responsible for coordination, and tell the Worker which addresses it shall use. One tricky thing here is when to recycle the resources. Ideally we shall recycle them when a Worker is lost, but as Tom mentioned, a lost Worker may be still running, so there is possibility that the resources addresses are still being used. Instead, we shall make Worker to explicitly send a `ReleaseResource` message to the Master to release the resource addresses it no longer need. This way we may completely avoid resource address collision, but when a Worker dies silently, we may never recycle the resources allocated to it. > **Issue1** > I said, suppose the case: in machine node A we have provided a discovery script like `{"name": "gpu", "addresses": ["0","1"]}`, then if we launch 2 workers on node A, then what will happen ? The 2 workers are both allocated gpu-0 and gpu-1. Where's the code to detect this case and then raise error ? The Master will keep the information that which addresses have been allocated to a Worker, so when a new Worker joins, the Master will detect which addresses are free, and make sure to only allocate the free addresses to it. > **Issue2** > I said, for example, suppose we have config worker requesting gpu amount to be 2, but the discovery script allocate more than 2 gpus when this worker launching, is it a good discovery script ? I think the discovery script should exactly allocate the gpu amount the worker requesting, so I suggest the verification in `assertAllResourceAllocationsMeetRequests` can be "numAllocatedByDiscoveryScript==numRequsted" (current verification is "numAllocatedByDiscoveryScript>=numRequsted") We don't require the number of addresses detected by the discovery script to be exactly the same as the request, because the assumption is too strong, and it may happen that the hardware on a node may change over time too (we may add new devices, and devices may broken). So we only require the number of discover addresses is greater than the request, and Spark will ensure we only use the requested number of devices.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
