Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-61155725
Yea, I'm trying to run depth 30 tests, but I got failures (both without and
with node Id cache) that seem to happen often in our clusters when using
TorrentBroadcast. Trying to test again with HttpBroadcast. But anyhow, I have a
hard time imagining people training deep trees without local training. So for
now, node Id cache seems not very necessary.
I think though that this might be a good addition for adding local training
later. Eventually once deep trees become very easy to train, I think passing
them back and forth would not be advisable. So this could be a check-in for
future preparation. What do you think?
It's hard to compare against Sequoia Forest because SF has been highly
optimized in data structures, and currently even without local-training runs
about 3 times faster than this (e.g. it took 18 minutes to train 100 trees with
depthlimit of 10 on mnist8m without local training whereas DecisionTreeRunner
took about an hour).
I think it has a lot to do with a lot of small things (e.g., SF doesn't
need to pass back and forth bin information, doesn't use any map structure to
prevent auto-boxing and faster lookup, etc.). So I'm not sure if the node Id
cache had anything to do with it. These are optimizations we can add later on
MLLib as well.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]