Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/1941#issuecomment-52188362
  
    Thanks for the PR @chouqin. The redundant findBin calculation should 
definitely be performed once and it will definitely speedup the computation. A 
couple of thoughts after looking at your implementation:
    1. Do you need the store the original features along with the bins?
    2. You are using Int for bin ids. You can pack them tightly as Byte if the 
number of bins are less than 256. One can use multiple Byte format for other 
bins sizes.
    
    I have an implementation similar to yours
    https://github.com/manishamde/spark/compare/ent
    
    A slight difference is that I am creating an internal TreePoint class that 
can store the bin mapping and this class is extended while performing Random 
Forest computation. Finally, I think @jkbradley is working on more 
optimizations on top these changes. I will let him elaborate on that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to