[jira] [Comment Edited] (SPARK-17447) performance improvement in Partitioner.DefaultPartitioner

WangJianfei (JIRA) Thu, 08 Sep 2016 19:29:04 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475607#comment-15475607
 ]


WangJianfei edited comment on SPARK-17447 at 9/9/16 2:28 AM:
-------------------------------------------------------------

we can just scan the rdd array only one time to find the rdd with the maxmum 
partitions and whose partitioner is difined.
you can just look the code of spark and the code of me:
the code of spark:
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse
    for (r <- bySize if r.partitioner.isDefined && 
r.partitioner.get.numPartitions > 0) {
      return r.partitioner.get
    }
    if (rdd.context.conf.contains("spark.default.parallelism")) {
      new HashPartitioner(rdd.context.defaultParallelism)
    } else {
      new HashPartitioner(bySize.head.partitions.length)
    }
  }

this is my basic logic:

    var maxP=0
    if(rdd.partitioner.isDefined && rdd.partitioner.get.numPartitions >0 ){
      maxP=rdd.partitioner.get.numPartitions
    }
    for(i <- 0 until others.length){
      if(others(i).partitioner.isDefined && 
others(i).partitioner.get.numPartitions > maxP){
        maxP=others(i).partitioner.get.numPartitions
      }
    }


was (Author: codlife):
we can just scan the rdd array only one time to find the rdd with the maxmum 
partitions and whose partitioner is difined.
you can just look the code of spark and the code of me:
the code of spark:
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse
    for (r <- bySize if r.partitioner.isDefined && 
r.partitioner.get.numPartitions > 0) {
      return r.partitioner.get
    }
    if (rdd.context.conf.contains("spark.default.parallelism")) {
      new HashPartitioner(rdd.context.defaultParallelism)
    } else {
      new HashPartitioner(bySize.head.partitions.length)
    }
  }

this is my basic logic:

  var seq=List[(Int,Int)]()
    for(i <- 0 until others.length){
      if(others(i).partitioner.isDefined) {
        seq = seq :+ (others(i).partitions.length, i)
      }
    }
    println("this is seq:")
    seq.foreach(println)
    var maxP=0
    if(rdd.partitioner.isDefined && rdd.partitioner.get.numPartitions >0 ){
      maxP=rdd.partitioner.get.numPartitions
    }
    for(i <- 0 until seq.length){
      if(others(seq(i)._2).partitioner.isDefined && 
others(seq(i)._2).partitioner.get.numPartitions > maxP){
        maxP=others(seq(i)._2).partitioner.get.numPartitions
      }
    }

> performance improvement in Partitioner.DefaultPartitioner 
> ----------------------------------------------------------
>
>                 Key: SPARK-17447
>                 URL: https://issues.apache.org/jira/browse/SPARK-17447
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.0.0
>            Reporter: WangJianfei
>              Labels: performance
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> if there are many rdds in some situations,the sort will loss he performance 
> servely,actually we needn't sort the rdds , we can just scan the rdds one 
> time to gain the same goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17447) performance improvement in Partitioner.DefaultPartitioner

Reply via email to