I am trying to use the pipeline combined with a countvectorizer, 
tfidftransformer and randomforest. However the output of the second step is a 
sparse array and randomforest requires a dense one. How can I add a step to 
allow for a conversion of the matrix from sparse to dense, using something 
along the lines of data.toarray(). Additionally, I would like to add some 
additional features to the dataset after the text has been processed. How can I 
create a step for this (normally I could use something like hstack)? My code is 
as follows:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(SVC(probability=True))),
])
I would like to adjust this somehow to the following:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('change_to_dense', SOME HOW CHANGE TO DENSE),
    ('add_more_data', SOME HOW ADD FEATURES),
    ('clf', OneVsRestClassifier(SVC(probability=True))),
])

My first dataset, lets call it data1 is just an array of sentences. Below is an 
example:

data1 = ['This is the first sentence',
             'This is the second sentence',
             'This is the third sentence']

The second dataset is numerical data of the following form:

data2 = array([[0],
                     [1],
                     [0]])


Thanks!
------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to