[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

sethah Tue, 20 Sep 2016 11:39:04 -0700

Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/14359
  
    This is a really nice improvement. The communication overhead is reduced, 
based on some simple local tests. I wonder how we can add a test to verify that 
the algorithm focuses on completing whole trees at once. Potentially, we can 
add a test of `selectNodesToSplit` to verify that it chooses nodes from fewer 
number of trees, but I'm not sure it's necessary. Thoughts?
    
    Also, it might not be too hard to take this a step further. We could group 
the nodes to be trained by tree, and keep track of the amount of memory they 
require. Then to select nodes to split, we can simply pick off the trees that 
require the most memory until we exceed the threshold. This way we truly 
minimize the number of trees while still occupying the memory size. We could 
leave it for another JIRA.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

Reply via email to