viirya commented on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-820814520
> Correct me if I'm wrong: Spark tries its best to schedule SS tasks on executors that have existing state store data. This is already the case and is implemented via the preferred location. The problem we are solving here is the first micro-batch, where there is no existing state store data and we want to schedule the tasks of the first micro-batch evenly on the cluster. This is to avoid skews in the future that many SS tasks are running on very few executors. That is correct. However, even for not first micro-batch, we currently use preferred location + non-trivial locality config (e.g., 10h) to force Spark schedule tasks to previous locations. I think it is not flexible because locality is a global setting. A non-trivial locality config might cause sub-optimal result for other stages. And, it requires end-users to set it. It makes me feel that it is not a user-friendly approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
