Hello! I have a DataFrame with a column of text, and I would like to vectorize the text using CountVectorizer. However, the text includes missing values, and so I would like to impute a constant value (for any missing values) before vectorizing.
My initial thought was to create a Pipeline of SimpleImputer (with strategy='constant') and CountVectorizer. However, SimpleImputer outputs a 2D array and CountVectorizer requires 1D input. The only solution I have found is to insert a transformer into the Pipeline that reshapes the output of SimpleImputer from 2D to 1D before it is passed to CountVectorizer. (You can find my code at the bottom of this message.) My question: Is there a more elegant solution to this problem than what I'm currently doing? Notes: - I realize that the missing values could be filled in pandas. However, I would like to accomplish all preprocessing in scikit-learn so that the same preprocessing can be applied via Pipeline to out-of-sample data. - I recall seeing a GitHub issue in which Andy proposed that CountVectorizer should allow 2D input as long as the second dimension is 1 (in other words: a single column of data). This modification to CountVectorizer would be a great long-term solution to my problem. However, I'm looking for a solution that would work in the current version of scikit-learn. Thank you so much for any feedback or ideas! Kevin == START OF CODE EXAMPLE == import pandas as pd import numpy as np from sklearn.impute import SimpleImputer from sklearn.preprocessing import FunctionTransformer from sklearn.feature_extraction.text import CountVectorizer from sklearn.pipeline import make_pipeline df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]}) imp = SimpleImputer(strategy='constant') one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1}) vect = CountVectorizer() pipe = make_pipeline(imp, one_dim, vect) pipe.fit_transform(df[['text']]).toarray() == END OF CODE EXAMPLE == -- Kevin Markham Founder, Data School https://www.dataschool.io https://www.youtube.com/dataschool
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn