viirya commented on pull request #30812: URL: https://github.com/apache/spark/pull/30812#issuecomment-749121051
> IMO, this looks a hacky approach to workaround a Spark task scheduler issue. Have we tried to improve Spark task scheduler which will be a general improvement for all cases that miss preferred locations? For example, reading files from a cloud storage such as S3 has the same issue. Right? This is an issue for stateful operations because the expected high cost for maintaining multiple states in same executor. Because we prefer same executor of previous streaming batch as state location, so once the first batch chooses an bad state distribution, it could cause performance issue in later batches. More, it becomes more severe for Structured Streaming because the first batch usually takes very quick time to finish, when it takes payload from latest offsets. I think it is not an issue for Spark task scheduler for general tasks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
