[jira] [Resolved] (SPARK-2202) saveAsTextFile hangs on final 2 tasks

Josh Rosen (JIRA) Tue, 17 Feb 2015 14:09:06 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Josh Rosen resolved SPARK-2202.
-------------------------------
    Resolution: Cannot Reproduce

I'm going to resolve this as "Cannot Reproduce" since it's really old.  Please 
re-open or file a new issue if you're still observing this problem in newer 
Spark versions.

> saveAsTextFile hangs on final 2 tasks
> -------------------------------------
>
>                 Key: SPARK-2202
>                 URL: https://issues.apache.org/jira/browse/SPARK-2202
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>         Environment: CentOS 5.7
> 16 nodes, 24 cores per node, 14g RAM per executor
>            Reporter: Suren Hiraman
>         Attachments: spark_trace.1.txt, spark_trace.2.txt
>
>
> I have a flow that takes in about 10 GB of data and writes out about 10 GB of 
> data.
> The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining 
> tasks, always on the same node.
> It seems that the 2 tasks are waiting for data from a remote task/RDD 
> partition.
> After about 2 hours or so, the stuck tasks get a closed connection exception 
> and you can see the remote side logging that as well. Log lines are below.
> My custom settings are:
>         conf.set("spark.executor.memory", "14g")     // TODO make this 
> configurable
>         
>         // shuffle configs
>         conf.set("spark.default.parallelism", "320")
>         conf.set("spark.shuffle.file.buffer.kb", "200")
>         conf.set("spark.reducer.maxMbInFlight", "96")
>         
>         conf.set("spark.rdd.compress","true")
>         
>         conf.set("spark.worker.timeout","180")
>         
>         // akka settings
>         conf.set("spark.akka.threads", "300")
>         conf.set("spark.akka.timeout", "180")
>         conf.set("spark.akka.frameSize", "100")
>         conf.set("spark.akka.batchSize", "30")
>         conf.set("spark.akka.askTimeout", "30")
>         
>         // block manager
>         conf.set("spark.storage.blockManagerTimeoutIntervalMs", "180000")
>         conf.set("spark.blockManagerHeartBeatMs", "80000")
> "STUCK" WORKER
> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from 
> connection to ConnectionManagerId(172.16.25.103,57626)
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
> REMOTE WORKER
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing 
> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding 
> SendingConnectionManagerId not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-2202) saveAsTextFile hangs on final 2 tasks

Reply via email to