GitHub user chouqin opened a pull request:

    https://github.com/apache/spark/pull/2780

    [SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree 
more adaptively

    DecisionTree splits on continuous features by choosing an array of values 
from a subsample of the data.
    Currently, it does not check for identical values in the subsample, so it 
could end up having multiple copies of the same split. In this PR, we choose 
splits for a continuous feature in 3 steps:
    
    1. Sort sample values for this feature
    2. Get number of occurrence of each distinct value
    3. Iterate the value count array computed in step 2 to choose splits.
    
    After find splits, `numSplits` and `numBins` in metadata will be updated.
    
    
    CC: @mengxr @manishamde @jkbradley, please help me review this, thanks.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chouqin/spark dt-findsplits

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2780.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2780
    
----
commit af7cb7962ff9f5041981ea5e4fe2465eceb6f0e5
Author: Qiping Li <[email protected]>
Date:   2014-10-09T11:47:09Z

    Choose splits for continuous features in DecisionTree more adaptively

commit 365282375ce3d1a26664695893ebad13d1b3bc47
Author: Qiping Li <[email protected]>
Date:   2014-10-09T12:40:55Z

    fix bug

commit 0cd744a4e710463591324b36f01d9dab028e79ef
Author: liqi <[email protected]>
Date:   2014-10-10T04:33:24Z

    fix bug

commit 1b25a3530f5429b245a50d4c706ebad2d2875726
Author: Qiping Li <[email protected]>
Date:   2014-10-11T01:36:38Z

    Merge branch 'master' of https://github.com/apache/spark into dt-findsplits

commit 9e7138e09dfe27c41d8d20ba6fcf9cb59d64a46b
Author: Qiping Li <[email protected]>
Date:   2014-10-13T01:11:31Z

    Merge branch 'dt-findsplits' of https://github.com/chouqin/spark into 
dt-findsplits
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

commit 8f46af6b57149fefd1e32120947ebe3291730af0
Author: Qiping Li <[email protected]>
Date:   2014-10-13T03:48:42Z

    add comments and unit test

commit 369f812a9ffce7dd10fc37e4a937158f2fa93e1c
Author: Qiping Li <[email protected]>
Date:   2014-10-13T03:53:07Z

    fix style

commit c339a614362f3045ee95975f99b6fde884657d48
Author: Qiping Li <[email protected]>
Date:   2014-10-13T04:31:23Z

    fix bug

commit 2a8267ab9bd8853fa1f638b69373dbbbf0d1a329
Author: Qiping Li <[email protected]>
Date:   2014-10-13T04:43:44Z

    fix bug

commit af6dc974258a9b07020e233e16cbbb584f501122
Author: Qiping Li <[email protected]>
Date:   2014-10-13T05:03:43Z

    fix bug

commit ab303a4ab1931b0c1a90ae2c3923f25d8f266178
Author: Qiping Li <[email protected]>
Date:   2014-10-13T06:10:33Z

    fix bug

commit f69f47f25f292995aa8710da6384bf631787711a
Author: Qiping Li <[email protected]>
Date:   2014-10-13T06:12:10Z

    fix bug

commit 092efcb89c4113eba8374e47587c6f1272aa7125
Author: Qiping Li <[email protected]>
Date:   2014-10-13T06:31:58Z

    fix bug

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to