[
https://issues.apache.org/jira/browse/SPARK-9096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-9096:
-----------------------------
Priority: Minor (was: Major)
Issue Type: Improvement (was: Bug)
I am not sure it is a bug, yet. It's worth explaining the difference though,
but we need to rule out environment factors, and know more about the cause. Can
you say more about why the data is not evenly distributed? it looks like it
should be in your sample.
> Unevenly distributed task loads after using JavaRDD.subtract()
> --------------------------------------------------------------
>
> Key: SPARK-9096
> URL: https://issues.apache.org/jira/browse/SPARK-9096
> Project: Spark
> Issue Type: Improvement
> Components: Java API
> Affects Versions: 1.4.0, 1.4.1
> Reporter: Gisle Ytrestøl
> Priority: Minor
> Attachments: ReproduceBug.java, reproduce.1.3.1.log.gz,
> reproduce.1.4.1.log.gz
>
>
> When using JavaRDD.subtract(), it seems that the tasks are unevenly
> distributed in the the following operations on the new JavaRDD which is
> created by "subtract". The result is that in the following operation on the
> new JavaRDD, a few tasks process almost all the data, and these tasks will
> take a long time to finish.
> I've reproduced this bug in the attached Java file, which I submit with
> spark-submit.
> The logs for 1.3.1 and 1.4.1 are attached. In 1.4.1, we see that a few tasks
> in the count job takes a lot of time:
> 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1459.0 in stage 2.0 (TID
> 4659) in 708 ms on 148.251.190.217 (1597/1600)
> 15/07/16 09:13:17 INFO TaskSetManager: Finished task 1586.0 in stage 2.0 (TID
> 4786) in 772 ms on 148.251.190.217 (1598/1600)
> 15/07/16 09:17:51 INFO TaskSetManager: Finished task 1382.0 in stage 2.0 (TID
> 4582) in 275019 ms on 148.251.190.217 (1599/1600)
> 15/07/16 09:20:02 INFO TaskSetManager: Finished task 1230.0 in stage 2.0 (TID
> 4430) in 407020 ms on 148.251.190.217 (1600/1600)
> 15/07/16 09:20:02 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks
> have all completed, from pool
> 15/07/16 09:20:02 INFO DAGScheduler: ResultStage 2 (count at
> ReproduceBug.java:56) finished in 420.024 s
> 15/07/16 09:20:02 INFO DAGScheduler: Job 0 finished: count at
> ReproduceBug.java:56, took 442.941395 s
> In comparison, all tasks are more or less equal in size when running the same
> application in Spark 1.3.1. In overall, this
> attached application (ReproduceBug.java) takes about 7 minutes on Spark
> 1.4.1, and completes in roughly 30 seconds in Spark 1.3.1.
> Spark 1.4.0 behaves similar to Spark 1.4.1 wrt this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]