GitHub user dikejiang opened a pull request:

    https://github.com/apache/spark/pull/3583

    [mllib] [random forest] functions returning the category with weights

    In this version, we add two functions: 1) 
predictByVotingWithWeight(features: Vector) and 2) predictWithWeight(features: 
Vector). And we also modify the function: predictByVoting(features: Vector).
    
    There are at least two reasons why we make such improvement:
    
    1 ) In our practice, we want to find the top N samples from one category. 
However in 1.3.0 version, the function of predict can only give the predicted 
category but without weights.
    
    2) What's more, in our practice, the numbers of positive and negative 
samples are very unbalance. There are much less positive samples than negative 
samples. According to the results of votes, there are very few samples 
predicted as positive sample. If the weights are also given, users can make a 
proper threshold to modify the results so that the performance can be improved.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dikejiang/spark 20141203

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3583.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3583
    
----
commit c45247094016ff89829ce3ded74e8c29a7eeb878
Author: dikejiang <[email protected]>
Date:   2014-12-03T12:23:24Z

    functions returning the category with weights
    
    In this version, we add two functions: 1) 
predictByVotingWithWeight(features: Vector) and 2) predictWithWeight(features: 
Vector). And we also modify the function: predictByVoting(features: Vector).
    
    There are at least two reasons why we make such improvement:
    
    1 ) In our practice, we want to find the top N samples from one category. 
However in 1.3.0 version, the function of predict can only give the predicted 
category but without weights.
    
    2) What's more, in our practice, the numbers of positive and negative 
samples are very unbalance. There are much less positive samples than negative 
samples. According to the results of votes, there are very few samples 
predicted as positive sample. If the weights are also given, users can make a 
proper threshold to modify the results so that the performance can be improved.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to