viirya commented on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-819145361
> a) In the case of streaming workloads, I think the locality info here is about the state store instead of data. e.g., > So I think that locality preference scheduling (or delay scheduling) would also apply to it (the state store location). > > b) That being said, I actually had the same concern when I toughed the streaming code. Because I know that delay scheduling doesn't guarantee the final scheduling location to be the preferred location provided by the task. So, the cost of reloading statestore would still exist potentially. Let me figure the difference between a and b. So (a) looks like using locality for state store location and (b) is that locality cannot guarantee actual location. Right? Please let me know if I misunderstand. I have tried to use locality for tasks with state stores in #30812. As you know (in the code snippet), actually SS already uses locality for state store location. However it has a few problems: 1. it uses previous state store location as locality, so if no previous location info, we still let Spark pick up executor arbitrarily. 2. It depends on initial chosen executor-state store mapping. So if Spark choose a sub-optimal mapping, locality doesn't work well for later batches. 3. Forcibly assigning state stores to executors can possibly lead to unreasonable scheduling decision. For example, we don't know if the executor satisfy resource requirement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
