You might use the new FunctionSampler from imblearn which will take your heuristic as input sample for you. 

‎http://contrib.scikit-learn.org/imbalanced-learn/dev/auto_examples/plot_outlier_rejections.html#sphx-glr-auto-examples-plot-outlier-rejections-py

Is it compatible with imblearn pipeline (basically it handles sampler, apply the transform at fit time and does nothing at predict). 

Would it help?

Guillaume Lemaitre 
INRIA Saclay Ile-de-France / Equipe PARIETAL
guillaume.lemai...@inria.fr - https://glemaitre.github.io/
From: Alex Garel
Sent: Wednesday, 4 April 2018 13:35
To: scikit-learn@python.org
Reply To: Scikit-learn mailing list
Subject: [scikit-learn] Outliers removal

Hello,

First, thanks for the fantastic scikit-learn library.

I have the following use case: For a classification problem, I have a list of sentences and use word2vec and a method (eg. mean, or weigthed mean, or attention and mean) to transform sentences to vectors. Because my dataset is very noisy, I may come with sentences full of words that are not part of word2vec, hence I can't vectorize them.

I would like to remove those sentences from my dataset X, but this would mean removing also the corresponding target classes in y. Afaik, scikit-learn does not implement this possibility. I've seen a couple of issues about that, but they all seems stalled : https://github.com/scikit-learn/scikit-learn/issues/9630, https://github.com/scikit-learn/scikit-learn/issues/3855, https://github.com/scikit-learn/scikit-learn/pull/4552, https://github.com/scikit-learn/scikit-learn/issues/4143

I would like to be able to search for hyper-parameters in a simple way, so I really would like to be able to use a single pipeline taking text as input.

My actual conclusion is this one :

  • vectorizer should return None for bad samples (or a specific vector, like numpy.zeros, or add an extra column marking valid/invalid samples)
  • make all my transformers down the pipeline accept for those entries and leave them untouched (can be done with a generic wrapper class)
  • have a wrapper around my classifier, to avoid fitting on those, like jnothman suggested here https://github.com/scikit-learn/scikit-learn/issues/9630#issuecomment-325202441

Its a bit tedious, but I can see it working.

Is there any better suggestion ?


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to