Actually, as you use Dataset's union API, unlike RDD's union API, it will break the nested structure. So that should not be the issue.
The additional time introduced when the number of dataframes grows, is spent on analysis stage. I can think that as the Union has a long children list, the analyzer needs more time to traverse the tree. When the dataset of Union(Range1, Range2) is created, the Analyzer needs to go through 2 Range(s). When the next union happens, i.e., Union(Range1, Range2, Range3), the Analyzer needs to go through 3 Range(s), except for the first 2 Range(s). The two Range plans are overlapped. But the Analyzer still goes through them. If there is an Union with 5 Range logical plans, the Analyzer goes through: 2 + 3 + 4 + 5 = 14 Range(s) under the Union When you increase the Range plans to 10. It becomes: 2 + 3 + 4 + 5 + ... + 10 = 54 Range(s) So if an Union of 100 Range plans, there are 5049 Range(s) needed to go through. For 200 Range plans, it becomes 20099. You can see it is not linear relation. ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394p20408.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org