Hi Sujith,

I saw your updated post. Seems it makes sense to me now.

If you use a very big limit number, the shuffling before `GlobalLimit` would
be a bottleneck for performance, of course, even it can eventually shuffle
enough data to the single partition.

Unlike `CollectLimit`, actually I think there is no reason `GlobalLimit`
must shuffle all limited data from all partitions to one single machine with
respect to query execution. In other words, I think we can avoid shuffling
data in `GlobalLimit`.

I have an idea to improve this and may update here later if I can make it
work.


sujith71955 wrote
> Dear Liang,
> 
> Thanks for your valuable feedback.
> 
> There was a mistake in the previous post  i corrected it, as you mentioned
> the  `GlobalLimit` we will only take the required number of rows from the
> input iterator which really pulls data from local blocks and remote
> blocks.
> but if the limit value is very high >= 10000000,  and when there will be a
> shuffle exchange happens  between `GlobalLimit` and `LocalLimit` to
> retrieve data from all partitions to one partition, since the limit value
> is very large the performance bottleneck still exists.
>  
> soon in next  post i will publish a test report with sample data and also
> figuring out a solution for this problem. 
> 
> Please let me know for any clarifications or suggestions regarding this
> issue.
> 
> Regards,
> Sujith





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Limit-Query-Performance-Suggestion-tp20570p20652.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to