[
https://issues.apache.org/jira/browse/SPARK-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255008#comment-14255008
]
Derrick Burns commented on SPARK-2681:
--------------------------------------
Appears to still happen in 1.1.1:
2014-12-20 22:54:00,574 INFO [connection-manager-thread]
network.ConnectionManager (Logging.scala:logInfo(59)) - Key not valid ?
sun.nio.ch.SelectionKeyImpl@6045fa68
2014-12-20 22:54:00,574 INFO [handle-read-write-executor-0]
network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
SendingConnection to
ConnectionManagerId(ip-10-89-134-186.us-west-2.compute.internal,49171)
2014-12-20 22:54:00,574 INFO [handle-read-write-executor-2]
network.ConnectionManager (Logging.scala:logInfo(59)) - Removing
ReceivingConnection to
ConnectionManagerId(ip-10-89-134-186.us-west-2.compute.internal,49171)
2014-12-20 22:54:00,575 INFO [sparkDriver-akka.actor.default-dispatcher-14]
cluster.YarnClientSchedulerBackend (Logging.scala:logInfo(59)) - Executor 7
disconnected, so removing it
2014-12-20 22:54:00,576 ERROR [handle-read-write-executor-2]
network.ConnectionManager (Logging.scala:logError(75)) - Corresponding
SendingConnection to
ConnectionManagerId(ip-10-89-134-186.us-west-2.compute.internal,49171) not found
2014-12-20 22:54:00,576 ERROR [sparkDriver-akka.actor.default-dispatcher-14]
cluster.YarnClientClusterScheduler (Logging.scala:logError(75)) - Lost executor
7 on ip-10-89-134-186.us-west-2.compute.internal: remote Akka client
disassociated
2014-12-20 22:54:00,576 INFO [connection-manager-thread]
network.ConnectionManager (Logging.scala:logInfo(80)) - key already cancelled ?
sun.nio.ch.SelectionKeyImpl@6045fa68
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:392)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:145)
> Spark can hang when fetching shuffle blocks
> -------------------------------------------
>
> Key: SPARK-2681
> URL: https://issues.apache.org/jira/browse/SPARK-2681
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.0.1
> Reporter: Guoqiang Li
> Priority: Blocker
> Attachments: jstack-26027.log
>
>
> executor log :
> {noformat}
> 14/07/24 22:56:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned
> task 53628
> 14/07/24 22:56:52 INFO executor.Executor: Running task ID 53628
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
> 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
> 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Updating epoch to 236
> and clearing cache
> 14/07/24 22:56:52 INFO spark.CacheManager: Partition rdd_51_83 not found,
> computing it
> 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Don't have map outputs
> for shuffle 9, fetching them
> 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker
> actor =
> Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
> 14/07/24 22:56:53 INFO spark.MapOutputTrackerWorker: Got the output locations
> 14/07/24 22:56:53 INFO
> storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight:
> 50331648, targetRequestSize: 10066329
> 14/07/24 22:56:53 INFO
> storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024
> non-empty blocks out of 1024 blocks
> 14/07/24 22:56:53 INFO
> storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote
> fetches in 8 ms
> 14/07/24 22:56:55 INFO storage.MemoryStore: ensureFreeSpace(28728) called
> with curMem=920109320, maxMem=4322230272
> 14/07/24 22:56:55 INFO storage.MemoryStore: Block rdd_51_83 stored as values
> to memory (estimated size 28.1 KB, free 3.2 GB)
> 14/07/24 22:56:55 INFO storage.BlockManagerMaster: Updated info of block
> rdd_51_83
> 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_189_83 not found,
> computing it
> 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs
> for shuffle 28, fetching them
> 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker
> actor =
> Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
> 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Got the output locations
> 14/07/24 22:56:55 INFO
> storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight:
> 50331648, targetRequestSize: 10066329
> 14/07/24 22:56:55 INFO
> storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty
> blocks out of 1024 blocks
> 14/07/24 22:56:55 INFO
> storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 1 remote
> fetches in 0 ms
> 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_50_83 not found,
> computing it
> 14/07/24 22:56:55 INFO
> storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight:
> 50331648, targetRequestSize: 10066329
> 14/07/24 22:56:55 INFO
> storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024
> non-empty blocks out of 1024 blocks
> 14/07/24 22:56:55 INFO
> storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote
> fetches in 4 ms
> 14/07/24 22:57:09 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(tuan221,51153)
> 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection
> to ConnectionManagerId(tuan221,51153)
> 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection
> to ConnectionManagerId(tuan221,51153)
> 14/07/24 23:05:07 INFO network.ConnectionManager: Key not valid ?
> sun.nio.ch.SelectionKeyImpl@3dcc1da1
> 14/07/24 23:05:07 INFO network.ConnectionManager: Removing SendingConnection
> to ConnectionManagerId(tuan211,43828)
> 14/07/24 23:05:07 INFO network.ConnectionManager: key already cancelled ?
> sun.nio.ch.SelectionKeyImpl@3dcc1da1
> java.nio.channels.CancelledKeyException
> at
> org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
> at
> org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
> 14/07/24 23:05:07 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(tuan211,43828)
> 14/07/24 23:05:07 ERROR network.ConnectionManager: Corresponding
> SendingConnectionManagerId not found
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]