[ 
https://issues.apache.org/jira/browse/SPARK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-17933:
---------------------------------
    Description: 
h4. Problem

When I run a job that requires some shuffle, some tasks fail because the 
executor cannot fetch the shuffle blocks from another executor.

{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
10-250-20-140:44042
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
        at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
        at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
        at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
        at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
Caused by: java.nio.channels.UnresolvedAddressException
        at sun.nio.ch.Net.checkAddress(Net.java:101)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
        at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
        at 
io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
        at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
        at 
io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
        at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
        at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:438)
        at 
io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:908)
        at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:203)
        at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        ... 1 more
{noformat}

When you look closely you notice that it is trying to connect not through a 
correct IP address but the hostname (which is the IP address but with dashes). 
Unfortunately the hostname is not valid from another host other than the one so 
the other executors cannot talk to this one.

On the executor page (two screenshots) you can see what is happening (different 
IP addresses but same behaviour).

!screenshot-1.png!

!screenshot-2.png!

Why is the executor advertised using the hostname in this particular case? Is 
it a bug or expected behaviour? This only happens when the executor is on the 
same host as the driver.

  was:
h4. Problem

When I run a job that requires some shuffle, some tasks fail because the 
executor cannot fetch the shuffle blocks from another executor.

{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
10-250-20-140:44042
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
        at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
        at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
        at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
        at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
Caused by: java.nio.channels.UnresolvedAddressException
        at sun.nio.ch.Net.checkAddress(Net.java:101)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
        at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
        at 
io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
        at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
        at 
io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
        at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
        at 
io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:438)
        at 
io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:908)
        at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:203)
        at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        ... 1 more
{noformat}

When you look closely you notice that it is trying to connect not through a 
correct IP address but the hostname (which is the IP address but with dashes). 
Unfortunately the hostname is not valid from another host other than the one so 
the other executors cannot talk to this one.

On the executor page you can see what is happening (different IP addresses but 
same behaviour)

!screenshot-1.png|thumbnail!

!screenshot-2.png|thumbnail!

Why is the executor advertised using the hostname in this particular case? Is 
it a bug or expected behaviour? This only happens when the executor is on the 
same host as the driver.


> Shuffle fails when driver is on one of the same machines as executor
> --------------------------------------------------------------------
>
>                 Key: SPARK-17933
>                 URL: https://issues.apache.org/jira/browse/SPARK-17933
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 1.6.2
>            Reporter: Frank Rosner
>         Attachments: screenshot-1.png, screenshot-2.png
>
>
> h4. Problem
> When I run a job that requires some shuffle, some tasks fail because the 
> executor cannot fetch the shuffle blocks from another executor.
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> 10-250-20-140:44042
>       at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
>       at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
>       at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>       at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>       at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>       at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>       at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
>       at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
>       at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
>       at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>       at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>       at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>       at org.apache.spark.scheduler.Task.run(Task.scala:89)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 10-250-20-140:44042
>       at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>       at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>       at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
>       at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>       at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>       at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       ... 3 more
> Caused by: java.nio.channels.UnresolvedAddressException
>       at sun.nio.ch.Net.checkAddress(Net.java:101)
>       at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
>       at 
> io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:209)
>       at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:207)
>       at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1097)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
>       at 
> io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
>       at 
> io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:471)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:456)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:438)
>       at 
> io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:908)
>       at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:203)
>       at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
>       at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>       at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>       at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>       ... 1 more
> {noformat}
> When you look closely you notice that it is trying to connect not through a 
> correct IP address but the hostname (which is the IP address but with 
> dashes). Unfortunately the hostname is not valid from another host other than 
> the one so the other executors cannot talk to this one.
> On the executor page (two screenshots) you can see what is happening 
> (different IP addresses but same behaviour).
> !screenshot-1.png!
> !screenshot-2.png!
> Why is the executor advertised using the hostname in this particular case? Is 
> it a bug or expected behaviour? This only happens when the executor is on the 
> same host as the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to