GitHub user chouqin opened a pull request:
https://github.com/apache/spark/pull/2780
[SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree
more adaptively
DecisionTree splits on continuous features by choosing an array of values
from a subsample of the data.
Currently, it does not check for identical values in the subsample, so it
could end up having multiple copies of the same split. In this PR, we choose
splits for a continuous feature in 3 steps:
1. Sort sample values for this feature
2. Get number of occurrence of each distinct value
3. Iterate the value count array computed in step 2 to choose splits.
After find splits, `numSplits` and `numBins` in metadata will be updated.
CC: @mengxr @manishamde @jkbradley, please help me review this, thanks.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/chouqin/spark dt-findsplits
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2780.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2780
----
commit af7cb7962ff9f5041981ea5e4fe2465eceb6f0e5
Author: Qiping Li <[email protected]>
Date: 2014-10-09T11:47:09Z
Choose splits for continuous features in DecisionTree more adaptively
commit 365282375ce3d1a26664695893ebad13d1b3bc47
Author: Qiping Li <[email protected]>
Date: 2014-10-09T12:40:55Z
fix bug
commit 0cd744a4e710463591324b36f01d9dab028e79ef
Author: liqi <[email protected]>
Date: 2014-10-10T04:33:24Z
fix bug
commit 1b25a3530f5429b245a50d4c706ebad2d2875726
Author: Qiping Li <[email protected]>
Date: 2014-10-11T01:36:38Z
Merge branch 'master' of https://github.com/apache/spark into dt-findsplits
commit 9e7138e09dfe27c41d8d20ba6fcf9cb59d64a46b
Author: Qiping Li <[email protected]>
Date: 2014-10-13T01:11:31Z
Merge branch 'dt-findsplits' of https://github.com/chouqin/spark into
dt-findsplits
Conflicts:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
commit 8f46af6b57149fefd1e32120947ebe3291730af0
Author: Qiping Li <[email protected]>
Date: 2014-10-13T03:48:42Z
add comments and unit test
commit 369f812a9ffce7dd10fc37e4a937158f2fa93e1c
Author: Qiping Li <[email protected]>
Date: 2014-10-13T03:53:07Z
fix style
commit c339a614362f3045ee95975f99b6fde884657d48
Author: Qiping Li <[email protected]>
Date: 2014-10-13T04:31:23Z
fix bug
commit 2a8267ab9bd8853fa1f638b69373dbbbf0d1a329
Author: Qiping Li <[email protected]>
Date: 2014-10-13T04:43:44Z
fix bug
commit af6dc974258a9b07020e233e16cbbb584f501122
Author: Qiping Li <[email protected]>
Date: 2014-10-13T05:03:43Z
fix bug
commit ab303a4ab1931b0c1a90ae2c3923f25d8f266178
Author: Qiping Li <[email protected]>
Date: 2014-10-13T06:10:33Z
fix bug
commit f69f47f25f292995aa8710da6384bf631787711a
Author: Qiping Li <[email protected]>
Date: 2014-10-13T06:12:10Z
fix bug
commit 092efcb89c4113eba8374e47587c6f1272aa7125
Author: Qiping Li <[email protected]>
Date: 2014-10-13T06:31:58Z
fix bug
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]