[GitHub] [spark] viirya commented on pull request #30812: [SPARK-33814][SS] Provide preferred locations for stateful operations without reported state store locations

GitBox Mon, 21 Dec 2020 10:17:26 -0800


viirya commented on pull request #30812:
URL: https://github.com/apache/spark/pull/30812#issuecomment-749121051



   > IMO, this looks a hacky approach to workaround a Spark task scheduler 
issue. Have we tried to improve Spark task scheduler which will be a general 
improvement for all cases that miss preferred locations? For example, reading 
files from a cloud storage such as S3 has the same issue. Right?
   
   This is an issue for stateful operations because the expected high cost for 
maintaining multiple states in same executor. Because we prefer same executor 
of previous streaming batch as state location, so once the first batch chooses 
an bad state distribution, it could cause performance issue in later batches.
   
   More, it becomes more severe for Structured Streaming because the first 
batch usually takes very quick time to finish, when it takes payload from 
latest offsets.
   
   I think it is not an issue for Spark task scheduler for general tasks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on pull request #30812: [SPARK-33814][SS] Provide preferred locations for stateful operations without reported state store locations

Reply via email to