[ https://issues.apache.org/jira/browse/SPARK-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15510745#comment-15510745 ]
Shixiong Zhu commented on SPARK-14849: -------------------------------------- [~skyluc] do you still see the error in Spark 2.0.0? > shuffle broken when accessing standalone cluster through NAT > ------------------------------------------------------------ > > Key: SPARK-14849 > URL: https://issues.apache.org/jira/browse/SPARK-14849 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.1 > Reporter: Luc Bourlier > Labels: nat, network > > I have the following network configuration: > {code} > +--------------------+ > | | > | spark-shell | > | | > +- ip: 10.110.101.2 -+ > | > | > +- ip: 10.110.101.1 -+ > | | NAT + routing > | spark-master | configured > | | > +- ip: 10.110.100.1 -+ > | > +------------------------+ > | | > +- ip: 10.110.101.2 -+ +- ip: 10.110.101.3 -+ > | | | | > | spark-worker 1 | | spark-worker 2 | > | | | | > +--------------------+ +--------------------+ > {code} > I have NAT, DNS and routing correctly configure such as each machine can > communicate with each other. > Launch spark-shell against the cluster works well. Simple map operations work > too: > {code} > scala> sc.makeRDD(1 to 5).map(_ * 5).collect > res0: Array[Int] = Array(5, 10, 15, 20, 25) > {code} > But operations requiring shuffling fail: > {code} > scala> sc.makeRDD(1 to 5).map(i => (i,1)).reduceByKey(_ + _).collect > 16/04/22 15:33:17 WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 19, > 10.110.101.1): FetchFailed(BlockManagerId(0, 10.110.101.1, 42842), > shuffleId=0, mapId=6, reduceId=4, message= > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > /10.110.101.1:42842 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > [ ... ] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to /10.110.101.1:42842 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > [ ... ] > at org.apache.spark.network.shuffle.RetryingBlockFetcher.access > [ ... ] > {code} > It makes sense that a connection to 10.110.101.1:42842 would fail, no part of > the system should have a direct knowledge of the IP address 10.110.101.1. > So a part of the system is wrongly discovering this IP address. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org