Github user lucio-yz commented on the issue: https://github.com/apache/spark/pull/20472 I tested on 2 datasets: 1. _rcv1.binary_, which has 47,236 dimensions. Before improvement, the shuffle write size in _findSplitsBySorting_ is 1GB. After improvement, the shuffle size is 7.7MB. 2. _news20.binary_, which has 1,355,191 dimensions. Before improvement, the shuffle write size in _findSplitsBySorting_ is 51 GB. After improvement, the shuffle size is 24.1 MB. ps: I tested on a cluster which has 10 nodes.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org