[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

lucio-yz Mon, 05 Feb 2018 17:59:20 -0800

Github user lucio-yz commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    I tested on 2 datasets:
    
    1.  _rcv1.binary_, which has 47,236 dimensions. Before improvement, the 
shuffle write size in _findSplitsBySorting_ is 1GB. After improvement, the 
shuffle size is 7.7MB.
    
    2. _news20.binary_, which has 1,355,191 dimensions. Before improvement, the 
shuffle write size in _findSplitsBySorting_ is 51 GB. After improvement, the 
shuffle size is 24.1 MB.
    
    ps: I tested on a cluster which has 10 nodes.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Reply via email to