[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

jkbradley Mon, 15 Aug 2016 21:24:06 -0700

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    Btw, to give back-of-the-envelope estimates, we can look at 2 numbers:
    (1) How many nodes will be split on each iteration?
    (2) How big is the forest which is serialized and sent to workers on each 
iteration?
    
    For (1), here's an example:
    * 1000 features, each with 50 bins -> 50000 possible splits
    * set maxMemoryInMB = 256 (default)
    * regression => 3 Double values per possible split
    * 256 * 10^6 / (3 * 50000 * 8) = 213 nodes/iteration
    
    This implies that for trees of depth > 8 or so, many iterations will only 
split nodes from 1 or 2 trees.  I.e., we should avoid communicating most trees.
    
    For (2), the forest can be pretty expensive to send.
    * Each node:
      * leaf node: 5 Doubles
      * internal node: ~8 Doubles/references + Split
        * Split: O(# categories) or 2 values for continuous, say 3 Doubles on 
average
      * => say 8 Doubles/node on average
    * 100 trees of depth 8 => 25600 nodes => 1.6MB
    * 100 trees of depth 14 => 105MB
    * I've heard of many cases of users wanting to fit 500-1000 trees and use 
trees of depth 18-20.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Reply via email to