[
https://issues.apache.org/jira/browse/SPARK-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253968#comment-15253968
]
Luc Bourlier commented on SPARK-14849:
--------------------------------------
I have dug at the problem. It is created during the registration of the
executor with the driver (spark-shell).
Because it is running in standalone (I assume) the executor doesn't have its
own Netty instance, but use the one of the worker. So in
[NettyRpcEnv|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L125]
the address is set to {{null}}.
This information (or lack of) is sent to the driver in the registration
message. The driver (in
[CoarseGrainedSchedulerBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L158]),
seeing that the information is missing, attempt to guess the IP address of the
worker by looking at the TCP/IP connection which is up, and in this case pick
the external IP address of the machine acting as router.
This bad information is then sent back to the worker in the registration
confirmation message, and used by the worker as his 'external' IP address.
Later down the execution, the worker needs to share information about the
blocks it holds, and use the bad IP address in the BlockManagerIds. These
BlockManagerIds are then unusable by the rest of the system.
I'll push a PR with a fix shortly. The executor should always send its 'public'
address, and the driver should not try to find an address just by looking at
the other side of a TCP connection. It easily can be wrong.
> shuffle broken when accessing standalone cluster through NAT
> ------------------------------------------------------------
>
> Key: SPARK-14849
> URL: https://issues.apache.org/jira/browse/SPARK-14849
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.6.1
> Reporter: Luc Bourlier
> Labels: nat, network
>
> I have the following network configuration:
> {code}
> +--------------------+
> | |
> | spark-shell |
> | |
> +- ip: 10.110.101.2 -+
> |
> |
> +- ip: 10.110.101.1 -+
> | | NAT + routing
> | spark-master | configured
> | |
> +- ip: 10.110.100.1 -+
> |
> +------------------------+
> | |
> +- ip: 10.110.101.2 -+ +- ip: 10.110.101.3 -+
> | | | |
> | spark-worker 1 | | spark-worker 2 |
> | | | |
> +--------------------+ +--------------------+
> {code}
> I have NAT, DNS and routing correctly configure such as each machine can
> communicate with each other.
> Launch spark-shell against the cluster works well. Simple map operations work
> too:
> {code}
> scala> sc.makeRDD(1 to 5).map(_ * 5).collect
> res0: Array[Int] = Array(5, 10, 15, 20, 25)
> {code}
> But operations requiring shuffling fail:
> {code}
> scala> sc.makeRDD(1 to 5).map(i => (i,1)).reduceByKey(_ + _).collect
> 16/04/22 15:33:17 WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 19,
> 10.110.101.1): FetchFailed(BlockManagerId(0, 10.110.101.1, 42842),
> shuffleId=0, mapId=6, reduceId=4, message=
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to
> /10.110.101.1:42842
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
> [ ... ]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to /10.110.101.1:42842
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> [ ... ]
> at org.apache.spark.network.shuffle.RetryingBlockFetcher.access
> [ ... ]
> {code}
> It makes sense that a connection to 10.110.101.1:42842 would fail, no part of
> the system should have a direct knowledge of the IP address 10.110.101.1.
> So a part of the system is wrongly discovering this IP address.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]