viirya commented on pull request #30812: URL: https://github.com/apache/spark/pull/30812#issuecomment-749356702
> I see. This makes sense. But why do we need to avoid this? > What's the cost did you mean? The execution memory used by states? > It would be great if you can explain your case and what issue you would like to solve in the PR description. To avoid skew memory usage on an executor. Yes, it is mainly for memory. For streaming queries that store large states, memory usage is severe. I will update the PR description to make it more clear. > Ideally, we should let the Spark task scheduler to do its work rather than doing the task scheduling work in SS because we don't have the full context of the executors. For example, this PR has to assume each executor has the same capability, while the task scheduler knows more about slow and fast executors. Preferred location doesn't replace the task scheduler, it is just a suggestion and task scheduler can choose to use it or not. For example we already asked later batch to schedule tasks on same executors that store states in previous batch. This is how the preferred locations work, isn't? This PR doesn't assume executor capacity but suggests the task scheduler to evenly distribute statuful tasks across executors if possible, when no store location is available. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
