Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/2780#issuecomment-59590801
@chouqin Sorry for the slow response!
About the RandomForestSuite failure: The change to fix the failure
(maxBins) is OK with me. It is a somewhat brittle test. Good point about the
first threshold being wasted.
About the histogram methodâs speed: I would guess that the extra
computation will not be that bad. Even if maxBins grows, I would expect the
runtime of the whole algorithm to slow down as well, and the number of samples
is capped at 10000. I will run some tests though to make sure.
About the histogram methodâs references: The PLANET paper uses
âequidepthâ histograms, citing the paper below. Looking at that paper,
âequidepthâ means the same method which @manishamde implemented previously.
I will look into this a little more to see if I find a match for the method
you implemented.
* PLANET paper: âPLANET: Massively Parallel Learning of Tree Ensembles
with MapReduceâ
* Paper they cite for histograms: G. S. Manku, S. Rajagopalan, and B. G.
Lindsay. Random sampling techniques for space efficient online computation of
order statistics of large datasets. In International Conference on ACM Special
Interest Group on Management of Data (SIGMOD), pages 251â262, 1999.
Iâll make a pass now and add comments.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]