Hi,
I have been playing around with doing union between a large number of
dataframes and saw that the performance of the actual union (not the action) is
worse than O(N^2). Since a union basically defines a lineage (i.e. current +
union with of other as a child) this should be almost instantaneous, however in
practice this can be very costly.
I was wondering why this is and if there is a way to fix this.
A sample test:
def testUnion(n: Int): Long = {
val dataframes = for {
x <- 0 until n
} yield spark.range(1000)
val t0 = System.currentTimeMillis()
val allDF = dataframes.reduceLeft(_.union(_))
val t1 = System.currentTimeMillis()
val totalTime = t1 - t0
println(s"$totalTime miliseconds")
totalTime
}
scala> testUnion(100)
193 miliseconds
res5: Long = 193
scala> testUnion(200)
759 miliseconds
res1: Long = 759
scala> testUnion(500)
4438 miliseconds
res2: Long = 4438
scala> testUnion(1000)
18441 miliseconds
res6: Long = 18441
scala> testUnion(2000)
88498 miliseconds
res7: Long = 88498
scala> testUnion(5000)
822305 miliseconds
res8: Long = 822305
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.