Jonny Serencsa created SPARK-22724:
--------------------------------------

             Summary: TakeOrderedAndProjectExec operator has poor performance 
when sorting on low cardinality keys
                 Key: SPARK-22724
                 URL: https://issues.apache.org/jira/browse/SPARK-22724
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.0.2
            Reporter: Jonny Serencsa


The com.google.guava.collect.Ordering implementation used by current versions 
of spark (including 2.0.2 which I use) has a performance issue when performing 
TopK operations using sort keys with relatively low cardinalities. 

For example, when performing a top-k for 1M rows of randomly chosen integers 
between 0-9 we have the following (approximate) number of compare operations 
for various k:
k, # compares
1000, 1.6E6
2000, 3.2E6
4000, 2.6E7
8000, 1E9
16000, 4E9
32000, 16E9

While the distribution isn't perfectly O(K^2), it's pretty close. 

Seems like guava has addressed this problem in their latest version. 

Here is the code for the debug script I used for the experiment.
{noformat}
import scala.collection.JavaConverters._
import com.google.common.collect.{Ordering => GuavaOrdering}
import scala.util.Random

object SortTest {

  class OrderingWithCounter extends GuavaOrdering[Int] {

    var counter = 0L

    override def compare(t: Int, t1: Int): Int = {
      counter += 1
      t.compare(t1)
    }

  }

  def run(size: Int, limit: Int, card: Int): Unit = {
    val r = new Random(13)
    val e = 0 until size map(a => r.nextInt(card))

    least(e, limit)
  }

  def least(e: Seq[Int], limit:Int): Unit = {
    val o = new OrderingWithCounter
    o.leastOf(e.asJava, limit)
    println(f"${e.size},${e.distinct.size},$limit,${o.counter}")
  }

  /**
    *
    * Output:
Limit test
1000000,10,1000,1557236
1000000,10,2000,3205997
1000000,10,4000,26011722
1000000,10,8000,102248988
1000000,10,16000,407248708
1000000,10,32000,1623467135
    *
    */
  def main(args: Array[String]) {
    val card = 10
    val size = 1e6.toInt

    println("Limit test")
    Seq(1000, 2000, 4000, 8000, 16000, 32000).foreach { limit =>
      run(size, limit, card)
    }

  }

}

{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to