Github user lucio-yz commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    I tested on 2 datasets:
    
    1.  _rcv1.binary_, which has 47,236 dimensions. Before improvement, the 
shuffle write size in _findSplitsBySorting_ is 1GB. After improvement, the 
shuffle size is 7.7MB.
    
    2. _news20.binary_, which has 1,355,191 dimensions. Before improvement, the 
shuffle write size in _findSplitsBySorting_ is 51 GB. After improvement, the 
shuffle size is 24.1 MB.
    
    ps: I tested on a cluster which has 10 nodes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to