Github user lucio-yz commented on the issue:
https://github.com/apache/spark/pull/20472
I tested on 2 datasets:
1. _rcv1.binary_, which has 47,236 dimensions. Before improvement, the
shuffle write size in _findSplitsBySorting_ is 1GB. After improvement, the
shuffle size is 7.7MB.
2. _news20.binary_, which has 1,355,191 dimensions. Before improvement, the
shuffle write size in _findSplitsBySorting_ is 51 GB. After improvement, the
shuffle size is 24.1 MB.
ps: I tested on a cluster which has 10 nodes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]