(Sorry, I sent this mail as a reply instead of starting a new thread... Here is the new thread.)

Hi all,

We are currently trying to add to the metric-learn package (https://github.com/metric-learn/metric-learn) a feature that would allow to do cross-validation with Weakly Supervised Metric Learners using scikit-learn's cross-validation routines.

Distance Metric Learning algorithms learn distance metrics between samples, using some supervised information about similarity between training samples. Some Metric Learning algorithms are weakly supervised (Weakly Supervised Metric Learners), i.e. they do not train on labeled samples, but for instance on labeled *pairs* of samples (the label telling whether the pair is of similar or dissimilar samples).

To cross-validate these algorithms, we make a train and a test by splitting on the pairs. Indeed a use case of metric learning is to classify at test time unseen pairs as similar or dissimilar (those pairs can involve already seen samples). For that, we made a dataset representation that allows to easily slice on pairs of samples: we mock a 3D array containing pairs of samples, that would be of shape (n_constraints, 2, n_features) (each line is a pair of samples). We do so with an object that we called ConstrainedDataset, which is more memory efficient than the described array (because samples would be duplicated through pairs).

Now we have a problem when running scikit-learn's *check_estimator* on these algorithms, because it launches a series of tests where the estimator takes as input regular arrays, whereas Weakly Supervised Metric Learners always learn on ConstrainedDatasets (or more generally on pairs, or tuples for some other algorithms).

We therefore thought of two main possibilities (that could be combined) to solve this problem: - taking the maximum number of tests yielded by check_estimator that pass in our setting, and modifying the others by replacing array inputs with ConstrainedDatasets - wrapping a Weakly Supervised Metric Learner into a MockSklearnEstimator that would transform any array as input into a ConstrainedDataset before passing it to the underlying Weakly Supervised Metric Learner

However these options are not really satisfying: the first one will create a lot of code and after that one cannot see at a glance if the estimator passes scikit-learn's check_estimator, and the second adds so much wrapping that we are not even really testing the Weakly Supervised Metric Learner)

For more information, see this PR where the new feature is being implemented, including the constraints.ConstrainedDataset object, as well as a comment on what is problematic when using scikit-learn's check_estimator:
https://github.com/metric-learn/metric-learn/pull/85#issuecomment-375659820

Any advice about how to design the weakly supervised algorithms, the data structure containing the pairs of samples, or how to use anyway scikit-learn's check_estimator would be appreciated!

Thanks!

Best regards,

William

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to