hiboyang opened a new pull request, #45803: URL: https://github.com/apache/spark/pull/45803
### What changes were proposed in this pull request? Check `spark.shuffle.readHostLocalDisk` config to determine whether read shuffle block from same local host machine. ### Why are the changes needed? Spark has a special shuffle optimization to check whether the shuffle block is on the same local host machine, and read from local disk when possible. This local host machine check is done by comparing block ip address with current host ip address. This will cause issue when running Spark on Kubernetes, because Kubernetes may reuse pod ip when some old executor exits and the new executor starts. Consider following sequence: 1. Executor 1 starts with IP address 10.0.0.1. 2. Some shuffle block (e.g. block1) is written on Executor 1. 3. Executor 1 terminates. 4. Executor 2 starts with same IP address 10.0.0.1 (this is rare, but did happen in our test, because Kubernetes may reuse ip when launching pods). 5. Executor 2 tries to read block1. It finds block1's address is same as current host address, thus assumes block1 exists on its local disk. 6. Executor 2 will read from local disk and get error, since block1 is not there (block1 is on Executor 1, which is gone) This is already a Spark config (spark.shuffle.readHostLocalDisk). We can reuse this config and check it in BlockStoreShuffleReader. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
