Iterative union like this creates a deeply nested recursive structure in a similar manner to described here http://stackoverflow.com/q/34461804
You can try something like this http://stackoverflow.com/a/37612978 but there is of course on overhead of conversion between Dataset and RDD. On 12/29/2016 06:21 PM, assaf.mendelson wrote: > > Hi, > > > > I have been playing around with doing union between a large number of > dataframes and saw that the performance of the actual union (not the > action) is worse than O(N^2). Since a union basically defines a > lineage (i.e. current + union with of other as a child) this should be > almost instantaneous, however in practice this can be very costly. > > > > I was wondering why this is and if there is a way to fix this. > > > > A sample test: > > *def *testUnion(n: Int): Long = { > *val *dataframes = *for *{ > x <- 0 until n > } *yield */spark/.range(1000) > > *val *t0 = System./currentTimeMillis/() > *val *allDF = dataframes.reduceLeft(_.union(_)) > *val *t1 = System./currentTimeMillis/() > *val *totalTime = t1 - t0 > /println/(*s"**$*totalTime*miliseconds"*) > totalTime > } > > > > scala> testUnion(100) > > 193 miliseconds > > res5: Long = 193 > > > > scala> testUnion(200) > > 759 miliseconds > > res1: Long = 759 > > > > scala> testUnion(500) > > 4438 miliseconds > > res2: Long = 4438 > > > > scala> testUnion(1000) > > 18441 miliseconds > > res6: Long = 18441 > > > > scala> testUnion(2000) > > 88498 miliseconds > > res7: Long = 88498 > > > > scala> testUnion(5000) > > 822305 miliseconds > > res8: Long = 822305 > > > > > > > ------------------------------------------------------------------------ > View this message in context: repeated unioning of dataframes take > worse than O(N^2) time > <http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. -- Maciej Szymkiewicz
