[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once

Mark Hamstra (JIRA) Fri, 18 Jul 2014 10:20:28 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066553#comment-14066553
 ]


Mark Hamstra commented on SPARK-2568:
-------------------------------------

Sure, if they can be cleanly separated -- but there's also interaction with the 
ShuffleManager refactoring.

Do you have some strategy in mind for addressing just SPARK-2568 in isolation?

> RangePartitioner should go through the data only once
> -----------------------------------------------------
>
>                 Key: SPARK-2568
>                 URL: https://issues.apache.org/jira/browse/SPARK-2568
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Reynold Xin
>            Assignee: Xiangrui Meng
>
> As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
> the count and once to do sampling. As a result, to do sortByKey, Spark goes 
> through data 3 times (once to count, once to sample, and once to sort).
> RangePartitioner should go through data only once (remove the count step).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once

Reply via email to