I am trying to use the pipeline combined with a countvectorizer,
tfidftransformer and randomforest. However the output of the second step is a
sparse array and randomforest requires a dense one. How can I add a step to
allow for a conversion of the matrix from sparse to dense, using something
along the lines of data.toarray(). Additionally, I would like to add some
additional features to the dataset after the text has been processed. How can I
create a step for this (normally I could use something like hstack)? My code is
as follows:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(SVC(probability=True))),
])
I would like to adjust this somehow to the following:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('change_to_dense', SOME HOW CHANGE TO DENSE),
('add_more_data', SOME HOW ADD FEATURES),
('clf', OneVsRestClassifier(SVC(probability=True))),
])
My first dataset, lets call it data1 is just an array of sentences. Below is an
example:
data1 = ['This is the first sentence',
'This is the second sentence',
'This is the third sentence']
The second dataset is numerical data of the following form:
data2 = array([[0],
[1],
[0]])
Thanks!
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general