Hi Andy,

please find this version of the code in which I changed the refit issue.

thanks!
Shalu

On Wed, Feb 25, 2015 at 11:35 PM, shalu jhanwar <shalu.jhanwa...@gmail.com>
wrote:

> Hi Andy,
>
> Please see the code. Hereby I am attaching following files:
> i) Code: RandomForest_IndependentDataset_prabability_values.py
> ii) Test dataset: test.txt
> iii) Training dataset: training_data.txt
>
> Please use this command to run the code:
> python RandomForest_IndependentDataset_prabability_values.py -d
>  training_data.txt -D <output_dir name> -C "3,4,5,6,7,8,15,17" -c "1" -g
> test.txt
>
> When you will run the code, 2 output file will be generated in the output
> directory named as: output.txt
>
> In that file, you can look for those entries for discrepancy results:
>   chr3_125709142_125709481  chr19_32769611_32770111  chr18_3593848_3594348
> chr19_49466802_49467527  chr12_860254_860664  chr19_49465555_49466264
> chr2_64836549_64836646
>
> thanks!
> Shalu
>
>
> On Wed, Feb 25, 2015 at 11:13 PM, Andy <t3k...@gmail.com> wrote:
>
>>  please show the code.
>>
>>
>>
>> On 02/25/2015 04:51 PM, shalu jhanwar wrote:
>>
>> Hi guys!
>>
>>  I removed refitting the data, but didn't set random_state explicitly.
>> The same problem persist .Look at these few examples:
>>
>>  Y_true       Y_predict      Class0_prob.     Class1_prob.
>>    1                  0                     0.28                  0.72
>>    0                  0                     0.32                  0.68
>>    0                  0                     0.41                  0.59
>>    1                  0                     0.41                  0.59
>>    1                  0                     0.48                  0.52
>>    1                  1                     0.57                  0.42
>>
>>  Please let me know still  am I missing something??
>> thanks!
>> Shalu
>>
>>
>>
>>
>>
>> On Wed, Feb 25, 2015 at 9:53 PM, shalu jhanwar <shalu.jhanwa...@gmail.com
>> > wrote:
>>
>>>  Hi guys!
>>>
>>>  Ahh, ok,  I check it and will confirm you.
>>>
>>>  thanks!
>>>  Shalu
>>>
>>> On Wed, Feb 25, 2015 at 9:32 PM, Andy <t3k...@gmail.com> wrote:
>>>
>>>>  You fit the data again before calling predict_proba.
>>>> You did not fix the random seed, so the outcome of the fit will be
>>>> different and you can't expect it to be consistent.
>>>> Just remove the second call to fit.
>>>>
>>>>
>>>>
>>>> On 02/25/2015 06:35 AM, shalu jhanwar wrote:
>>>>
>>>>  Hey Guys,
>>>>
>>>>  I am using Random forest classifier to perform binary classification
>>>> on my dataset. I wanted to have a confidence value of both the classes
>>>> corresponding to each sample. For that purpose, I used "predict_proba"
>>>> method to predict class probabilities for X samples.
>>>> I saw 2-3 strange observations in my samples as below:
>>>>
>>>>  S.No.  Y_true   *Y_predicted_forest*   Class_0_prob      Class_1_prob
>>>>  1.        1                           0                      0.28
>>>>              0.72
>>>>  2.        0                           1                      0.56
>>>>              0.44
>>>>
>>>>  Here, based on the probabilities of classes, the algorithm should
>>>> provide true positives. But it gave wrong predictions in spite of the high
>>>> probability value of each class.
>>>>
>>>>  Can anyone please explain this strange observation when the predicted
>>>> probability of  class 0 is more than class 1, still the output is class 1
>>>> and visa-versa?
>>>>
>>>>  For further details, I am providing a chunk of my code used:
>>>>    #For Random Forest
>>>>    clf = RandomForestClassifier(n_estimators=40)
>>>>     scores = clf.fit(X_train, y_train).score(X_test, y_test)
>>>>    y_pred = clf.predict(X_test)
>>>>    * #Get proba for each class:*
>>>>    y_score = clf.fit(X_train, y_train).predict_proba(X_test)
>>>>    #Get value of each class as:
>>>>      y_score[:,0] - #For 0 class
>>>>      y_score[:,1]  -  #For 1 class
>>>>
>>>>  thanks!
>>>>  Shalu
>>>>
>>>>
>>>>   
>>>> ------------------------------------------------------------------------------
>>>> Dive into the World of Parallel Programming The Go Parallel Website, 
>>>> sponsored
>>>> by Intel and developed in partnership with Slashdot Media, is your hub for 
>>>> all
>>>> things parallel software development, from weekly thought leadership blogs 
>>>> to
>>>> news, videos, case studies, tutorials and more. Take a look and join the
>>>> conversation now. http://goparallel.sourceforge.net/
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing 
>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Dive into the World of Parallel Programming The Go Parallel Website,
>>>> sponsored
>>>> by Intel and developed in partnership with Slashdot Media, is your hub
>>>> for all
>>>> things parallel software development, from weekly thought leadership
>>>> blogs to
>>>> news, videos, case studies, tutorials and more. Take a look and join the
>>>> conversation now. http://goparallel.sourceforge.net/
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website, 
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub for 
>> all
>> things parallel software development, from weekly thought leadership blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>>
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing 
>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming The Go Parallel Website,
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub
>> for all
>> things parallel software development, from weekly thought leadership
>> blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
#!/usr/bin/python

import sys
import getopt
import numpy as np
import pylab as pl
import random
from sklearn import datasets
from StringIO import StringIO
from scipy import interp
import matplotlib.pyplot as plt
from scipy import interp
from sklearn import svm
from sklearn import preprocessing
from sklearn import linear_model
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score
from sklearn import cross_validation
from sklearn.cross_validation import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import fbeta_score
from sklearn.ensemble import ExtraTreesClassifier
import os
os.getcwd()

def main(argv):
    
    
    # get options passed at command line
    
    try:
        opts, args = getopt.getopt(argv, "d:D:c:C:g:")
    
    except getopt.GetoptError:
        
        #print helpString
        
        sys.exit(2)
    #print opts
    for opt, arg in opts:
        
        if opt == '-d':
            
            data_file = arg
        
        elif opt == '-D':
            
            out_folder = arg
        
        elif opt == '-c':
            
            label_col = int(arg)
        
        elif opt == '-C':
            
            data_cols = arg
        
        elif opt == '-g':
            
            test_file = arg
    
    
    
    print "hiiiii", data_file, out_folder, label_col, data_cols, test_file
    data_cols = [int(x) for x in data_cols.split(",")]
    x_data = np.loadtxt(data_file, usecols=data_cols, delimiter = "\t")
    y_data = np.genfromtxt(data_file,  usecols = label_col, delimiter = "\t")
    
    test_x_data = np.loadtxt(test_file, usecols=data_cols, delimiter = "\t")
    test_y_data = np.genfromtxt(test_file,  usecols = label_col, delimiter = "\t")
    
    #same Scaling on both test and train data (min max scaling)
    #min_max_scaler = preprocessing.MinMaxScaler()
    #x_data = min_max_scaler.fit_transform(x_data)
    #test_x_data = min_max_scaler.transform(test_x_data)
    
    #same Scaling on both test and train data (centering the data scaling)
    scaler = preprocessing.StandardScaler()
    x_data = scaler.fit_transform(x_data)
    test_x_data = scaler.transform(test_x_data)
    
    np.random.seed(0)
    indices = np.random.permutation(len(x_data))
    X_train = x_data[indices]
    y_train = y_data[indices]
    
    np.random.seed(0)
    indices = np.random.permutation(len(test_x_data))
    X_test = test_x_data[indices]
    y_test = test_y_data[indices]
    
    #Take enhancer names for the testset:
    cols = 0
    with open (test_file,"r") as f:
        a =  '\n'.join(line.strip("\n") for line in f)
        b = np.genfromtxt(StringIO(a), usecols = cols, delimiter="\t", dtype=None)
        enhancer_names = b[indices]
    #enhancer_name = np.genfromtxt(test_file,  usecols = colnames, delimiter = " ")
    #enhancer_name = enhancer_name[indices]
    #print enhancer_name, "\n"
    
    #i) Perform the analysis with all the dataset
    
    #For Random Forest
    clf = RandomForestClassifier(n_estimators=40)
    #clf = RandomForestClassifier() #For Poeya as default n_estimator is best suited
    scores = clf.fit(X_train, y_train).score(X_test, y_test)
    print "train length", len(X_train), "\n"
    print "ytrain", len(y_train), "\n"
    print "X_test", len(X_test), "\n"
    print "y_test", len(y_test), "\n"
    #Perform model evaluation by calculating the accuracy
    #For Random Forest
    
    y_pred = clf.predict(X_test)
    print "y_pred", y_pred
    y_score = clf.predict_proba(X_test)
    y_score = np.around(y_score, decimals=2)
    
    accurate = accuracy_score(y_test, y_pred)
    print "Accuracy All dataset: ", accurate
    prec = precision_score(y_test, y_pred, average='micro')
    rec = recall_score(y_test, y_pred, average='micro')
    fscore = fbeta_score(y_test, y_pred, average='micro', beta=0.5)
    areaRoc = roc_auc_score(y_test, y_score[:,1])
    
    #Generate ROC curve for each cross-validation
    fpr, tpr, thresholds = roc_curve(y_test, y_score[:,1], pos_label = 1)  #Pos level for positive class
    precision, recall, threshold = precision_recall_curve(y_test, y_score[:,1])
    
    random_mean_auc_10 = auc(fpr, tpr)
    
    print "######################################################\n"
    print "#####################\n"
    print "Score All dataset: accuracy\tprecision\trecall\tfscore\tareaRoc\n"
    print "Score All dataset:", accurate, "\t", prec, "\t", rec, "\t", fscore, "\t", areaRoc, "\n"
    
    print "Roc area", areaRoc, "mean AUC", np.mean(areaRoc)
    
    
    #Print y_pred and enhancer_names in a file
    #combined = zip(enhancer_names, y_test, y_pred)
    combined = zip(enhancer_names, y_test, y_pred, y_score[:,0], y_score[:,1])
    f = open(out_folder + "/output.txt", 'w')
    f.write("Enhancer_name\tY_true_labels\tY_predicted_labels\tProb_Class0\tProb_class1\n")
    for i in combined:
        line = '\t'.join(str(x) for x in i)
        f.write(line + '\n')
    
    plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Standard')
    plt.plot(fpr, tpr, 'k--',label='RF_ROC_all_data (area = %0.2f)' % random_mean_auc_10, lw=2, color=(0.45, 0.42, 0.18)) #Plot mean ROC area in cross validation
    
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    
    pl.savefig(out_folder + "/117_misclassified_RF_FANTOM_validation_featureSelected_top5_tssDist_H3K27me3_H3K36me3.png", transparent=True, bbox_inches='tight', pad_inches=0.2)
    
    plt.show()


if __name__ == "__main__":
    main(sys.argv[1:])
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to