[scikit-learn] problem with check_estimator for distance metric learning

wdevazel Thu, 29 Mar 2018 04:52:29 -0700

(Sorry, I sent this mail as a reply instead of starting a new thread...Here is the new thread.)


Hi all,

We are currently trying to add to the metric-learn package(https://github.com/metric-learn/metric-learn) a feature that wouldallow to do cross-validation with Weakly Supervised Metric Learnersusing scikit-learn's cross-validation routines.

Distance Metric Learning algorithms learn distance metrics betweensamples, using some supervised information about similarity betweentraining samples. Some Metric Learning algorithms are weakly supervised(Weakly Supervised Metric Learners), i.e. they do not train on labeledsamples, but for instance on labeled *pairs* of samples (the labeltelling whether the pair is of similar or dissimilar samples).

To cross-validate these algorithms, we make a train and a test bysplitting on the pairs. Indeed a use case of metric learning is toclassify at test time unseen pairs as similar or dissimilar (those pairscan involve already seen samples). For that, we made a datasetrepresentation that allows to easily slice on pairs of samples: we mocka 3D array containing pairs of samples, that would be of shape(n_constraints, 2, n_features) (each line is a pair of samples). We doso with an object that we called ConstrainedDataset, which is morememory efficient than the described array (because samples would beduplicated through pairs).

Now we have a problem when running scikit-learn's *check_estimator* onthese algorithms, because it launches a series of tests where theestimator takes as input regular arrays, whereas Weakly SupervisedMetric Learners always learn on ConstrainedDatasets (or more generallyon pairs, or tuples for some other algorithms).

We therefore thought of two main possibilities (that could be combined)to solve this problem:- taking the maximum number of tests yielded by check_estimator thatpass in our setting, and modifying the others by replacing array inputswith ConstrainedDatasets- wrapping a Weakly Supervised Metric Learner into aMockSklearnEstimator that would transform any array as input into aConstrainedDataset before passing it to the underlying Weakly SupervisedMetric Learner

However these options are not really satisfying: the first one willcreate a lot of code and after that one cannot see at a glance if theestimator passes scikit-learn's check_estimator, and the second adds somuch wrapping that we are not even really testing the Weakly SupervisedMetric Learner)

For more information, see this PR where the new feature is beingimplemented, including the constraints.ConstrainedDataset object, aswell as a comment on what is problematic when using scikit-learn'scheck_estimator:

https://github.com/metric-learn/metric-learn/pull/85#issuecomment-375659820

Any advice about how to design the weakly supervised algorithms, thedata structure containing the pairs of samples, or how to use anywayscikit-learn's check_estimator would be appreciated!


Thanks!

Best regards,

William

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] problem with check_estimator for distance metric learning

Reply via email to