GitHub user chouqin opened a pull request:
https://github.com/apache/spark/pull/1941
[SPARK-3022] [mllib] FindBinsForLevel in decision tree should call findBin
only once for each feature
`findbinsForLevel` is applied to every `LabeledPoint` to find bins for all
nodes at a given level. Given a specific `LabeledPoint` and a specific feature,
the bin to put this labeled point should always be same.But in current
implementation, `findBin` on a (labeledpoint, feature) pair is called for all
nodes and all levels, which is a waste of computation.
In my implementation, `findBin` for each (labeledpoint, feature) pair is
executed only once before the start of level-wise training of decision tree.
Then, at each level, this `feature2bin` array can be reused.
What's more, `findbinsForLevel` now return a array of smaller size, all the
nodes on which this labeledPoint is valid share the same `feature2bin` array,
instead of each node having a copy of it.
CC: @mengxr @manishamde @jkbradley, Please have a look at this, thanks.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/chouqin/spark dt-findbins
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1941.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1941
----
commit 065a42c21c5810c2a09a125b39e8d56c38a18ebc
Author: qiping.lqp <[email protected]>
Date: 2014-08-14T06:06:43Z
improve decision tree: findbins called only once for each feature
commit 45549f777eafe166bfb49e606ac12c3faec5f965
Author: qiping.lqp <[email protected]>
Date: 2014-08-14T06:49:44Z
fix unit test for decision tree
commit 4a7bc0bab5a31712ebc8a5f37349a125d0d125b0
Author: qiping.lqp <[email protected]>
Date: 2014-08-14T06:58:02Z
fix style: line length doesn't exceed 100.
commit 59b67819cd0a256beae6e2ca67cfebfb20b690d7
Author: qiping.lqp <[email protected]>
Date: 2014-08-14T07:36:24Z
add comments
commit 4ef9e7fd53f6cefc998def2be2e08e0c22ec9e7a
Author: qiping.lqp <[email protected]>
Date: 2014-08-14T07:51:33Z
fix indentation
commit f1a20e3b347fb3df77da9e51d50a7e0d54c0a75e
Author: qiping.lqp <[email protected]>
Date: 2014-08-14T07:57:57Z
fix indentation too
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]