Hi Scikit-learn Users, I am analyzing some proxy logs to use Machine learning to classify the events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet of my code: The input file is a csv with tokenized string fields.
************** # load the file M = pd.read_csv("output100k.csv").fillna('') # define the fields to use min_df = 0.001 max_df = .7 TxtCols = ['request__tokens', 'requestClientApplication__tokens', 'destinationZoneURI__tokens','cs-categories__tokens', 'fileType__tokens', 'requestMethod__tokens','tcp_status1', 'app','tcp_status2','dhost' ] NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length'] # vectorize the fields TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t]) for t in TxtCols] # define the columns of sparse matrix X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels, TxtCols)] + \ [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for n in NumCols]) # target variable Y = M.act.values ## Define train/test parts and scale them X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2) scaler = StandardScaler(with_mean=False, with_std=True) scaler.fit(X_train) X_train=scaler.transform(X_train) X_test=scaler.transform(X_test) # define the model and train clf = MLPClassifier(activation='logistic', solver='lbfgs').fit(X_train,y_train) # use the model to predict on X_test and convert into a data frame df=pd.DataFrame(clf.predict(X_test)) ** 199845 OBSERVED 199846 OBSERVED [199847 rows x 1 columns]> ** Now at the end I have a DataFrame with 20K entries with just one column "Label", how di I connect it to the main dataframe M, since I want to do some investigations on this outcome ? Any help? Thanks, Ruchika
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn