[
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500429#comment-14500429
]
Jon Chase commented on SPARK-6962:
----------------------------------
Here's the stderr from the executors at the time of the lock up (there are 3
executors).
18:26:00 is when the lockup happened, and after 20+ minutes, these are still
the most recent logs in executor 1:
15/04/17 18:26:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task
1132
15/04/17 18:26:00 INFO executor.Executor: Running task 110.0 in stage 15.0 (TID
1132)
15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Getting 1008
non-empty blocks out of 1008 blocks
15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote
fetches in 3 ms
15/04/17 18:26:00 INFO executor.Executor: Finished task 107.0 in stage 15.0
(TID 1129). 8325 bytes result sent to driver
15/04/17 18:26:00 INFO executor.CoarseGrainedExecutorBackend: Got assigned task
1133
15/04/17 18:26:00 INFO executor.Executor: Running task 111.0 in stage 15.0 (TID
1133)
15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Getting 1008
non-empty blocks out of 1008 blocks
15/04/17 18:26:00 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote
fetches in 2 ms
Here's executor 2, it doesn't have any activity for about 20 minutes (again,
the lockup happened at ~18:26:00):
15/04/17 18:25:48 INFO storage.ShuffleBlockFetcherIterator: Getting 1008
non-empty blocks out of 1008 blocks
15/04/17 18:25:48 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote
fetches in 11 ms
15/04/17 18:25:49 INFO executor.Executor: Finished task 13.0 in stage 15.0 (TID
1035). 12013 bytes result sent to driver
15/04/17 18:25:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task
1068
15/04/17 18:25:49 INFO executor.Executor: Running task 46.0 in stage 15.0 (TID
1068)
15/04/17 18:25:49 INFO storage.ShuffleBlockFetcherIterator: Getting 1008
non-empty blocks out of 1008 blocks
15/04/17 18:25:49 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote
fetches in 16 ms
15/04/17 18:41:19 WARN server.TransportChannelHandler: Exception in connection
from /10.106.144.109:49697
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/04/17 18:41:27 WARN server.TransportChannelHandler: Exception in connection
from /10.106.145.10:38473
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
Same with executor 3:
15/04/17 18:25:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned task
1092
15/04/17 18:25:52 INFO executor.Executor: Running task 70.0 in stage 15.0 (TID
1092)
15/04/17 18:25:52 INFO storage.ShuffleBlockFetcherIterator: Getting 1008
non-empty blocks out of 1008 blocks
15/04/17 18:25:52 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote
fetches in 4 ms
15/04/17 18:25:52 INFO executor.Executor: Finished task 56.0 in stage 15.0 (TID
1078). 9791 bytes result sent to driver
15/04/17 18:25:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned task
1093
15/04/17 18:25:52 INFO executor.Executor: Running task 71.0 in stage 15.0 (TID
1093)
15/04/17 18:25:52 INFO storage.ShuffleBlockFetcherIterator: Getting 1008
non-empty blocks out of 1008 blocks
15/04/17 18:25:52 INFO storage.ShuffleBlockFetcherIterator: Started 2 remote
fetches in 3 ms
15/04/17 18:41:14 WARN server.TransportChannelHandler: Exception in connection
from /10.106.143.39:46278
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
> Netty BlockTransferService hangs in the middle of SQL query
> -----------------------------------------------------------
>
> Key: SPARK-6962
> URL: https://issues.apache.org/jira/browse/SPARK-6962
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.2.0, 1.2.1, 1.3.0
> Reporter: Jon Chase
> Attachments: jstacks.txt
>
>
> Spark SQL queries (though this seems to be a Spark Core issue - I'm just
> using queries in the REPL to surface this, so I mention Spark SQL) hang
> indefinitely under certain (not totally understood) circumstances.
> This is resolved by setting spark.shuffle.blockTransferService=nio, which
> seems to point to netty as the issue. Netty was set as the default for the
> block transport layer in 1.2.0, which is when this issue started. Setting
> the service to nio allows queries to complete normally.
> I do not see this problem when running queries over smaller (~20 5MB files)
> datasets. When I increase the scope to include more data (several hundred
> ~5MB files), the queries will get through several steps but eventuall hang
> indefinitely.
> Here's the email chain regarding this issue, including stack traces:
> http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/<cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com>
> For context, here's the announcement regarding the block transfer service
> change:
> http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/<cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]