I am trying to use the pipeline combined with a countvectorizer, tfidftransformer and randomforest. However the output of the second step is a sparse array and randomforest requires a dense one. How can I add a step to allow for a conversion of the matrix from sparse to dense, using something along the lines of data.toarray(). Additionally, I would like to add some additional features to the dataset after the text has been processed. How can I create a step for this (normally I could use something like hstack)? My code is as follows:
pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(SVC(probability=True))), ]) I would like to adjust this somehow to the following: pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('change_to_dense', SOME HOW CHANGE TO DENSE), ('add_more_data', SOME HOW ADD FEATURES), ('clf', OneVsRestClassifier(SVC(probability=True))), ]) My first dataset, lets call it data1 is just an array of sentences. Below is an example: data1 = ['This is the first sentence', 'This is the second sentence', 'This is the third sentence'] The second dataset is numerical data of the following form: data2 = array([[0], [1], [0]]) Thanks! ------------------------------------------------------------------------------ Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general