Hello,

First, thanks for the fantastic scikit-learn library.

I have the following use case: For a classification problem, I have a
list of sentences and use word2vec and a method (eg. mean, or weigthed
mean, or attention and mean) to transform sentences to vectors. Because
my dataset is very noisy, I may come with sentences full of words that
are not part of word2vec, hence I can't vectorize them.

I would like to remove those sentences from my dataset X, but this would
mean removing also the corresponding target classes in y. Afaik,
scikit-learn does not implement this possibility. I've seen a couple of
issues about that, but they all seems stalled :
https://github.com/scikit-learn/scikit-learn/issues/9630,
https://github.com/scikit-learn/scikit-learn/issues/3855,
https://github.com/scikit-learn/scikit-learn/pull/4552,
https://github.com/scikit-learn/scikit-learn/issues/4143

I would like to be able to search for hyper-parameters in a simple way,
so I really would like to be able to use a single pipeline taking text
as input.

My actual conclusion is this one :

  * vectorizer should return None for bad samples (or a specific vector,
    like numpy.zeros, or add an extra column marking valid/invalid samples)
  * make all my transformers down the pipeline accept for those entries
    and leave them untouched (can be done with a generic wrapper class)
  * have a wrapper around my classifier, to avoid fitting on those, like
    jnothman suggested here
    
https://github.com/scikit-learn/scikit-learn/issues/9630#issuecomment-325202441

Its a bit tedious, but I can see it working.

Is there any better suggestion ?

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to