Hello!

I have a DataFrame with a column of text, and I would like to vectorize the
text using CountVectorizer. However, the text includes missing values, and
so I would like to impute a constant value (for any missing values) before
vectorizing.

My initial thought was to create a Pipeline of SimpleImputer (with
strategy='constant') and CountVectorizer. However, SimpleImputer outputs a
2D array and CountVectorizer requires 1D input.

The only solution I have found is to insert a transformer into the Pipeline
that reshapes the output of SimpleImputer from 2D to 1D before it is passed
to CountVectorizer. (You can find my code at the bottom of this message.)

My question: Is there a more elegant solution to this problem than what I'm
currently doing?

Notes:

- I realize that the missing values could be filled in pandas. However, I
would like to accomplish all preprocessing in scikit-learn so that the same
preprocessing can be applied via Pipeline to out-of-sample data.

- I recall seeing a GitHub issue in which Andy proposed that
CountVectorizer should allow 2D input as long as the second dimension is 1
(in other words: a single column of data). This modification to
CountVectorizer would be a great long-term solution to my problem. However,
I'm looking for a solution that would work in the current version of
scikit-learn.

Thank you so much for any feedback or ideas!

Kevin

== START OF CODE EXAMPLE ==

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline

df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

imp = SimpleImputer(strategy='constant')
one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})
vect = CountVectorizer()

pipe = make_pipeline(imp, one_dim, vect)

pipe.fit_transform(df[['text']]).toarray()

== END OF CODE EXAMPLE ==

-- 
Kevin Markham
Founder, Data School
https://www.dataschool.io
https://www.youtube.com/dataschool
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to