Github user manishamde commented on the pull request:
https://github.com/apache/spark/pull/1941#issuecomment-52188362
Thanks for the PR @chouqin. The redundant findBin calculation should
definitely be performed once and it will definitely speedup the computation. A
couple of thoughts after looking at your implementation:
1. Do you need the store the original features along with the bins?
2. You are using Int for bin ids. You can pack them tightly as Byte if the
number of bins are less than 256. One can use multiple Byte format for other
bins sizes.
I have an implementation similar to yours
https://github.com/manishamde/spark/compare/ent
A slight difference is that I am creating an internal TreePoint class that
can store the bin mapping and this class is extended while performing Random
Forest computation. Finally, I think @jkbradley is working on more
optimizations on top these changes. I will let him elaborate on that.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]