viirya commented on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-820814520


   > Correct me if I'm wrong: Spark tries its best to schedule SS tasks on 
executors that have existing state store data. This is already the case and is 
implemented via the preferred location. The problem we are solving here is the 
first micro-batch, where there is no existing state store data and we want to 
schedule the tasks of the first micro-batch evenly on the cluster. This is to 
avoid skews in the future that many SS tasks are running on very few executors.
   
   That is correct. However, even for not first micro-batch, we currently use 
preferred location + non-trivial locality config (e.g., 10h) to force Spark 
schedule tasks to previous locations. I think it is not flexible because 
locality is a global setting. A non-trivial locality config might cause 
sub-optimal result for other stages. And, it requires end-users to set it. It 
makes me feel that it is not a user-friendly approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to