Luc Bourlier created SPARK-14849:
------------------------------------
Summary: shuffle broken when accessing standalone cluster through
NAT
Key: SPARK-14849
URL: https://issues.apache.org/jira/browse/SPARK-14849
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.6.1
Reporter: Luc Bourlier
I have the following network configuration:
{code}
+--------------------+
| |
| spark-shell |
| |
+- ip: 10.110.101.2 -+
|
|
+- ip: 10.110.101.1 -+
| | NAT + routing
| spark-master | configured
| |
+- ip: 10.110.100.1 -+
|
+------------------------+
| |
+- ip: 10.110.101.2 -+ +- ip: 10.110.101.3 -+
| | | |
| spark-worker 1 | | spark-worker 2 |
| | | |
+--------------------+ +--------------------+
{code}
I have NAT, DNS and routing correctly configure such as each machine can
communicate with each other.
Launch spark-shell against the cluster works well. Simple map operations work
too:
{code}
scala> sc.makeRDD(1 to 5).map(_ * 5).collect
res0: Array[Int] = Array(5, 10, 15, 20, 25)
{code}
But operations requiring shuffling fail:
{code}
scala> sc.makeRDD(1 to 5).map(i => (i,1)).reduceByKey(_ + _).collect
16/04/22 15:33:17 WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 19,
10.110.101.1): FetchFailed(BlockManagerId(0, 10.110.101.1, 42842), shuffleId=0,
mapId=6, reduceId=4, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to
/10.110.101.1:42842
at
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
[ ... ]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to /10.110.101.1:42842
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
[ ... ]
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access
[ ... ]
{code}
It makes sense that a connection to 10.110.101.1:42842 would fail, no part of
the system should have a direct knowledge of the IP address 10.110.101.1.
So a part of the system is wrongly discovering this IP address.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]