Github user codedeft commented on the pull request:

    https://github.com/apache/spark/pull/2868#issuecomment-61155725
  
    Yea, I'm trying to run depth 30 tests, but I got failures (both without and 
with node Id cache) that seem to happen often in our clusters when using 
TorrentBroadcast. Trying to test again with HttpBroadcast. But anyhow, I have a 
hard time imagining people training deep trees without local training. So for 
now, node Id cache seems not very necessary.
    
    I think though that this might be a good addition for adding local training 
later. Eventually once deep trees become very easy to train, I think passing 
them back and forth would not be advisable. So this could be a check-in for 
future preparation. What do you think?
    
    It's hard to compare against Sequoia Forest because SF has been highly 
optimized in data structures, and currently even without local-training runs 
about 3 times faster than this (e.g. it took 18 minutes to train 100 trees with 
depthlimit of 10 on mnist8m without local training whereas DecisionTreeRunner 
took about an hour).
    
    I think it has a lot to do with a lot of small things (e.g., SF doesn't 
need to pass back and forth bin information, doesn't use any map structure to 
prevent auto-boxing and faster lookup, etc.). So I'm not sure if the node Id 
cache had anything to do with it. These are optimizations we can add later on 
MLLib as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to