tgravescs commented on issue #25047: [WIP][SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone URL: https://github.com/apache/spark/pull/25047#issuecomment-509276365 I have a few general questions, note I haven't look at all of the code yet. I'm not an expert in standalone mode but it supports both a client mode and a cluster mode. In your description are you saying even the client mode will use the resource file and lock it? How do you know the client is running on a node with GPU's or a worker? I guess as long as location is the same it doesn't matter. This is one thing in YARN we never have handled, in client mode the user is on their own for resource coordination. It seems unreliable to assume you have multiple workers per node (for the case a worker crashes). When the worker dies it automatically kills any executors, correct? is there a chance it doesn't? It feels like you really want something like Worker restart and recovery, meaning if a worker crashes and you restart it, it should have same id and discover what it had reserved before and possibly any executors still running. But that is probably a much bigger change. standalone uses a PID dir to tell what workers are running, correct? it could use this to track and check allocated resources. If you track the pid with the assignments if a new worker/driver starts there could check if there aren't enough resources for it to allocate, or based on what Master says, or both really, when new Worker comes up ask Master who is supposed to be there and then checks process existence.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
