akirillov opened a new pull request #25500: [MESOS] Fixed executors advertised 
address when running in virtual network
URL: https://github.com/apache/spark/pull/25500
 
 
   ### What changes were proposed in this pull request?
   This patch fixes a bug which occurs when shuffle jobs are launched by Mesos 
in a virtual network. Mesos scheduler sets executor `--hostname` parameter to 
`0.0.0.0` in the case when `spark.mesos.network.name` is provided. This makes 
executors use `0.0.0.0` as their advertised address and, in the presence of 
shuffle, executors fail to fetch shuffle blocks from each other using `0.0.0.0` 
as the origin. When a virtual network is used the hostname or IP address is not 
known upfront and assigned to a container at its start time so the executor 
process needs to advertise the correct dynamically assigned address to be 
reachable by other executors.
   
   Changes:
   - added a fallback to `Utils.localHostName()` in Spark Executors when 
`--hostname` is not provided
   - removed setting executor address to `0.0.0.0` from Mesos scheduler
   - refactored the code related to building executor command in Mesos scheduler
   - added network configuration support to Docker containerizer
   - added unit tests
   
   ### Why are the changes needed?
   The bug described above prevents Mesos users from running any jobs which 
involve shuffle due to the inability of executors to fetch shuffle blocks 
because of incorrect advertised address when virtual network is used.
   
   ### Does this PR introduce any user-facing change?
   No
   
   ### How was this patch tested?
   - added unit test to `MesosCoarseGrainedSchedulerBackendSuite` which 
verifies the absence of `--hostname` parameter  when `spark.mesos.network.name` 
is provided and its presence otherwise
   - added unit test to `MesosSchedulerBackendUtilSuite` which verifies that 
`MesosSchedulerBackendUtil.buildContainerInfo` sets network-related properties 
for Docker containerizer
   - integration tests from [DCOS Spark 
repo](https://github.com/mesosphere/spark-build), more specifically - 
[test_spark_cni.py](https://github.com/mesosphere/spark-build/blob/master/tests/test_spark_cni.py)
 which runs a specific [shuffle 
job](https://github.com/mesosphere/spark-build/blob/master/tests/jobs/scala/src/main/scala/ShuffleApp.scala)
 and verifies job successful completion, Mesos task network configuration, and 
ip addresses for both Mesos and Docker containerizers

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to