Actually, as you use Dataset's union API, unlike RDD's union API, it will
break the nested structure. So that should not be the issue.

The additional time introduced when the number of dataframes grows, is spent
on analysis stage. I can think that as the Union has a long children list,
the analyzer needs more time to traverse the tree.

When the dataset of Union(Range1, Range2) is created, the Analyzer needs to
go through 2 Range(s). When the next union happens, i.e., Union(Range1,
Range2, Range3), the Analyzer needs to go through 3 Range(s), except for the
first 2 Range(s). The two Range plans are overlapped. But the Analyzer still
goes through them.

If there is an Union with 5 Range logical plans, the Analyzer goes through:

2 + 3 + 4 + 5 = 14 Range(s) under the Union

When you increase the Range plans to 10. It becomes:

2 + 3 + 4 + 5 + ... + 10 = 54 Range(s)

So if an Union of 100 Range plans, there are 5049 Range(s) needed to go
through. For 200 Range plans, it becomes 20099.

You can see it is not linear relation.





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394p20408.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to