Jonny Serencsa created SPARK-22724:
--------------------------------------
Summary: TakeOrderedAndProjectExec operator has poor performance
when sorting on low cardinality keys
Key: SPARK-22724
URL: https://issues.apache.org/jira/browse/SPARK-22724
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.0.2
Reporter: Jonny Serencsa
The com.google.guava.collect.Ordering implementation used by current versions
of spark (including 2.0.2 which I use) has a performance issue when performing
TopK operations using sort keys with relatively low cardinalities.
For example, when performing a top-k for 1M rows of randomly chosen integers
between 0-9 we have the following (approximate) number of compare operations
for various k:
k, # compares
1000, 1.6E6
2000, 3.2E6
4000, 2.6E7
8000, 1E9
16000, 4E9
32000, 16E9
While the distribution isn't perfectly O(K^2), it's pretty close.
Seems like guava has addressed this problem in their latest version.
Here is the code for the debug script I used for the experiment.
{noformat}
import scala.collection.JavaConverters._
import com.google.common.collect.{Ordering => GuavaOrdering}
import scala.util.Random
object SortTest {
class OrderingWithCounter extends GuavaOrdering[Int] {
var counter = 0L
override def compare(t: Int, t1: Int): Int = {
counter += 1
t.compare(t1)
}
}
def run(size: Int, limit: Int, card: Int): Unit = {
val r = new Random(13)
val e = 0 until size map(a => r.nextInt(card))
least(e, limit)
}
def least(e: Seq[Int], limit:Int): Unit = {
val o = new OrderingWithCounter
o.leastOf(e.asJava, limit)
println(f"${e.size},${e.distinct.size},$limit,${o.counter}")
}
/**
*
* Output:
Limit test
1000000,10,1000,1557236
1000000,10,2000,3205997
1000000,10,4000,26011722
1000000,10,8000,102248988
1000000,10,16000,407248708
1000000,10,32000,1623467135
*
*/
def main(args: Array[String]) {
val card = 10
val size = 1e6.toInt
println("Limit test")
Seq(1000, 2000, 4000, 8000, 16000, 32000).foreach { limit =>
run(size, limit, card)
}
}
}
{noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]