Hi Sujith, I saw your updated post. Seems it makes sense to me now.
If you use a very big limit number, the shuffling before `GlobalLimit` would be a bottleneck for performance, of course, even it can eventually shuffle enough data to the single partition. Unlike `CollectLimit`, actually I think there is no reason `GlobalLimit` must shuffle all limited data from all partitions to one single machine with respect to query execution. In other words, I think we can avoid shuffling data in `GlobalLimit`. I have an idea to improve this and may update here later if I can make it work. sujith71955 wrote > Dear Liang, > > Thanks for your valuable feedback. > > There was a mistake in the previous post i corrected it, as you mentioned > the `GlobalLimit` we will only take the required number of rows from the > input iterator which really pulls data from local blocks and remote > blocks. > but if the limit value is very high >= 10000000, and when there will be a > shuffle exchange happens between `GlobalLimit` and `LocalLimit` to > retrieve data from all partitions to one partition, since the limit value > is very large the performance bottleneck still exists. > > soon in next post i will publish a test report with sample data and also > figuring out a solution for this problem. > > Please let me know for any clarifications or suggestions regarding this > issue. > > Regards, > Sujith ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Limit-Query-Performance-Suggestion-tp20570p20652.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org