Something like your_df['prediction'] = pd.Series(clf.predict(X_test), index=X_test.index)
should handle all the alignment. On Thu, Jul 20, 2017 at 11:04 AM, Ruchika Nayyar <ruchika.w...@gmail.com> wrote: > The original dataset contains both trainng/testing, I have predictions > only on testing dataset. If I do what you suggest > will it preserve indexing? > > Thanks, > Ruchika > > > On Thu, Jul 20, 2017 at 11:37 AM, Julio Antonio Soto de Vicente < > ju...@esbet.es> wrote: > >> Hi Ruchika, >> >> The predictions outputted by all sklearn models are just 1-d Numpy >> arrays, so it should be trivial to add it to any existing DataFrame: >> >> your_df["prediction"] = clf.predict(X_test) >> >> -- >> Julio >> >> El 20 jul 2017, a las 17:23, Ruchika Nayyar <ruchika.w...@gmail.com> >> escribió: >> >> Hi Scikit-learn Users, >> >> I am analyzing some proxy logs to use Machine learning to classify the >> events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet >> of my code: >> The input file is a csv with tokenized string fields. >> >> ************** >> # load the file >> M = pd.read_csv("output100k.csv").fillna('') >> >> # define the fields to use >> min_df = 0.001 >> max_df = .7 >> TxtCols = ['request__tokens', 'requestClientApplication__tokens', >> 'destinationZoneURI__tokens','cs-categories__tokens', >> 'fileType__tokens', 'requestMethod__tokens','tcp_status1', >> 'app','tcp_status2','dhost' >> ] >> NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length'] >> >> # vectorize the fields >> TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t]) >> for t in TxtCols] >> >> # define the columns of sparse matrix >> X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels, >> TxtCols)] + \ >> [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for >> n in NumCols]) >> >> # target variable >> Y = M.act.values >> >> ## Define train/test parts and scale them >> X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2) >> scaler = StandardScaler(with_mean=False, with_std=True) >> scaler.fit(X_train) >> X_train=scaler.transform(X_train) >> X_test=scaler.transform(X_test) >> >> >> # define the model and train >> clf = MLPClassifier(activation='logistic', solver='lbfgs').fit(X_train,y_ >> train) >> # use the model to predict on X_test and convert into a data frame >> df=pd.DataFrame(clf.predict(X_test)) >> >> ** >> >> 199845 OBSERVED >> 199846 OBSERVED >> >> [199847 rows x 1 columns]> >> >> ** >> >> Now at the end I have a DataFrame with 20K entries with just one column >> "Label", how di I connect it to the main dataframe M, since I want to do >> some >> investigations on this outcome ? >> >> Any help? >> >> Thanks, >> Ruchika >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn