viirya commented on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-819145361


   > a) In the case of streaming workloads, I think the locality info here is 
about the state store instead of data. e.g.,
   > So I think that locality preference scheduling (or delay scheduling) would 
also apply to it (the state store location).
   > 
   > b) That being said, I actually had the same concern when I toughed the 
streaming code. Because I know that delay scheduling doesn't guarantee the 
final scheduling location to be the preferred location provided by the task. 
So, the cost of reloading statestore would still exist potentially.
   
   Let me figure the difference between a and b. So (a) looks like using 
locality for state store location and (b) is that locality cannot guarantee 
actual location. Right? Please let me know if I misunderstand.
   
   I have tried to use locality for tasks with state stores in #30812. As you 
know (in the code snippet), actually SS already uses locality for state store 
location. However it has a few problems: 1. it uses previous state store 
location as locality, so if no previous location info, we still let Spark pick 
up executor arbitrarily. 2. It depends on initial chosen executor-state store 
mapping. So if Spark choose a sub-optimal mapping, locality doesn't work well 
for later batches. 3. Forcibly assigning state stores to executors can possibly 
lead to unreasonable scheduling decision. For example, we don't know if the 
executor satisfy resource requirement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to