[
https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043730#comment-14043730
]
Bharath Ravi Kumar edited comment on SPARK-1112 at 6/25/14 4:51 PM:
--------------------------------------------------------------------
Can a clear workaround be specified for this bug please? For those unable to
upgrade to run on 1.0.1 or 1.1.0 in production, general instructions on the
workaround are required. This is a huge blocker for current production
deployments (even on 1.0.0) otherwise. For instance, running a saveAsTextFile()
on an RDD (~400MB) causes execution to freeze with the last log statements seen
on the driver being:
14/06/25 16:38:55 INFO spark.SparkContext: Starting job: saveAsTextFile at
Test.java:99
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Got job 6 (saveAsTextFile at
Test.java:99) with 2 output partitions (allowLocal=false)
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Final stage: Stage
6(saveAsTextFile at Test.java:99)
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Missing parents: List()
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Submitting Stage 6
(MappedRDD[558] at saveAsTextFile at Test.java:99), which has no missing parents
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from
Stage 6 (MappedRDD[558] at saveAsTextFile at Test.java:99)
14/06/25 16:38:55 INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 2
tasks
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Starting task 6.0:0 as TID 5
on executor 1: somehost.corp (PROCESS_LOCAL)
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Serialized task 6.0:0 as
351777 bytes in 36 ms
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Starting task 6.0:1 as TID 6
on executor 0: someotherhost.corp (PROCESS_LOCAL)
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Serialized task 6.0:1 as
186453 bytes in 16 ms
The test setup for reproducing this issue has two slaves (each with 24G)
running spark standalone. The driver runs with Xmx 4G.
Thanks.
was (Author: reachbach):
Can a clear workaround be specified for this bug please? For those unable to
upgrade to run on 1.0.1 or 1.1.0 in production, general instructions on the
workaround are required. This is a huge blocker for current production
deployments (even on 1.0.0) otherwise. For instance, running a saveAsTextFile()
on an RDD (~400MB) causes execution to freeze with the last log statements seen
on the driver being:
14/06/25 16:38:55 INFO spark.SparkContext: Starting job: saveAsTextFile at
Test.java:99
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Got job 6 (saveAsTextFile at
Test.java:99) with 2 output partitions (allowLocal=false)
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Final stage: Stage
6(saveAsTextFile at Test.java:99)
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Parents of final stage: List()
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Missing parents: List()
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Submitting Stage 6
(MappedRDD[558] at saveAsTextFile at Test.java:99), which has no missing parents
14/06/25 16:38:55 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from
Stage 6 (MappedRDD[558] at saveAsTextFile at Test.java:99)
14/06/25 16:38:55 INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 2
tasks
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Starting task 6.0:0 as TID 5
on executor 1: somehost.corp (PROCESS_LOCAL)
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Serialized task 6.0:0 as
351777 bytes in 36 ms
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Starting task 6.0:1 as TID 6
on executor 0: someotherhost.corp (PROCESS_LOCAL)
14/06/25 16:38:55 INFO scheduler.TaskSetManager: Serialized task 6.0:1 as
186453 bytes in 16 ms
Thanks.
> When spark.akka.frameSize > 10, task results bigger than 10MiB block execution
> ------------------------------------------------------------------------------
>
> Key: SPARK-1112
> URL: https://issues.apache.org/jira/browse/SPARK-1112
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 0.9.0, 1.0.0
> Reporter: Guillaume Pitel
> Assignee: Xiangrui Meng
> Priority: Blocker
> Fix For: 1.0.1, 1.1.0
>
>
> When I set the spark.akka.frameSize to something over 10, the messages sent
> from the executors to the driver completely block the execution if the
> message is bigger than 10MiB and smaller than the frameSize (if it's above
> the frameSize, it's ok)
> Workaround is to set the spark.akka.frameSize to 10. In this case, since
> 0.8.1, the blockManager deal with the data to be sent. It seems slower than
> akka direct message though.
> The configuration seems to be correctly read (see actorSystemConfig.txt), so
> I don't see where the 10MiB could come from
--
This message was sent by Atlassian JIRA
(v6.2#6252)