Luc Bourlier created SPARK-14849:
------------------------------------

             Summary: shuffle broken when accessing standalone cluster through 
NAT
                 Key: SPARK-14849
                 URL: https://issues.apache.org/jira/browse/SPARK-14849
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.1
            Reporter: Luc Bourlier


I have the following network configuration:

{code}
             +--------------------+
             |                    |
             |  spark-shell       |
             |                    |
             +- ip: 10.110.101.2 -+
                       |
                       |
             +- ip: 10.110.101.1 -+
             |                    | NAT + routing
             |  spark-master      | configured
             |                    |
             +- ip: 10.110.100.1 -+
                       |
          +------------------------+
          |                        |
+- ip: 10.110.101.2 -+    +- ip: 10.110.101.3 -+
|                    |    |                    |
|  spark-worker 1    |    |  spark-worker 2    |
|                    |    |                    |
+--------------------+    +--------------------+
{code}

I have NAT, DNS and routing correctly configure such as each machine can 
communicate with each other.

Launch spark-shell against the cluster works well. Simple map operations work 
too:

{code}
scala> sc.makeRDD(1 to 5).map(_ * 5).collect
res0: Array[Int] = Array(5, 10, 15, 20, 25)
{code}

But operations requiring shuffling fail:

{code}
scala> sc.makeRDD(1 to 5).map(i => (i,1)).reduceByKey(_ + _).collect

16/04/22 15:33:17 WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 19, 
10.110.101.1): FetchFailed(BlockManagerId(0, 10.110.101.1, 42842), shuffleId=0, 
mapId=6, reduceId=4, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
/10.110.101.1:42842
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
[ ... ]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to /10.110.101.1:42842
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
[ ... ]
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.access

[ ... ]
{code}

It makes sense that a connection to 10.110.101.1:42842 would fail, no part of 
the system should have a direct knowledge of the IP address 10.110.101.1.
So a part of the system is wrongly discovering this IP address.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to