Hi Tom This was also the first thing that came to my mind, but I thought sincr your_df is X_train+X_test it may complain that values do not match with the given indices.
Thanks, Ruchika On Thu, Jul 20, 2017 at 12:19 PM, Tom Augspurger <tom.augspurge...@gmail.com > wrote: > Something like > > your_df['prediction'] = pd.Series(clf.predict(X_test), > index=X_test.index) > > should handle all the alignment. > > On Thu, Jul 20, 2017 at 11:04 AM, Ruchika Nayyar <ruchika.w...@gmail.com> > wrote: > >> The original dataset contains both trainng/testing, I have predictions >> only on testing dataset. If I do what you suggest >> will it preserve indexing? >> >> Thanks, >> Ruchika >> >> >> On Thu, Jul 20, 2017 at 11:37 AM, Julio Antonio Soto de Vicente < >> ju...@esbet.es> wrote: >> >>> Hi Ruchika, >>> >>> The predictions outputted by all sklearn models are just 1-d Numpy >>> arrays, so it should be trivial to add it to any existing DataFrame: >>> >>> your_df["prediction"] = clf.predict(X_test) >>> >>> -- >>> Julio >>> >>> El 20 jul 2017, a las 17:23, Ruchika Nayyar <ruchika.w...@gmail.com> >>> escribió: >>> >>> Hi Scikit-learn Users, >>> >>> I am analyzing some proxy logs to use Machine learning to classify the >>> events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet >>> of my code: >>> The input file is a csv with tokenized string fields. >>> >>> ************** >>> # load the file >>> M = pd.read_csv("output100k.csv").fillna('') >>> >>> # define the fields to use >>> min_df = 0.001 >>> max_df = .7 >>> TxtCols = ['request__tokens', 'requestClientApplication__tokens', >>> 'destinationZoneURI__tokens','cs-categories__tokens', >>> 'fileType__tokens', 'requestMethod__tokens','tcp_status1', >>> 'app','tcp_status2','dhost' >>> ] >>> NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length'] >>> >>> # vectorize the fields >>> TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t]) >>> for t in TxtCols] >>> >>> # define the columns of sparse matrix >>> X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels, >>> TxtCols)] + \ >>> [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for >>> n in NumCols]) >>> >>> # target variable >>> Y = M.act.values >>> >>> ## Define train/test parts and scale them >>> X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2) >>> scaler = StandardScaler(with_mean=False, with_std=True) >>> scaler.fit(X_train) >>> X_train=scaler.transform(X_train) >>> X_test=scaler.transform(X_test) >>> >>> >>> # define the model and train >>> clf = MLPClassifier(activation='logistic', >>> solver='lbfgs').fit(X_train,y_train) >>> # use the model to predict on X_test and convert into a data frame >>> df=pd.DataFrame(clf.predict(X_test)) >>> >>> ** >>> >>> 199845 OBSERVED >>> 199846 OBSERVED >>> >>> [199847 rows x 1 columns]> >>> >>> ** >>> >>> Now at the end I have a DataFrame with 20K entries with just one column >>> "Label", how di I connect it to the main dataframe M, since I want to do >>> some >>> investigations on this outcome ? >>> >>> Any help? >>> >>> Thanks, >>> Ruchika >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn