[scikit-learn] Domain adaptation and cross-validation

Johan Mazel Thu, 20 Feb 2020 05:40:37 -0800

Hello
I am working on a binary classification task using machine learning.
One class C0 is built upon public data called D0. The other class C1 is
made of data named D1 that I generated. I actually generated two
datasets D1a and D1b with slightly different parameters.


I would like to evaluate the domain adaptation of a model in the context
of D0, D1a and D1b. It means that I would like to train a model on data
from D0 and D1a, and then, test its performance on D0 and D1b.
I plan to perform a k-fold cross-validation on D0 and bootstrapping on
D1a on D1b. For example, at each iteration, I would build training data
from k-1 folds of D0 and boostrapped data from D1a. Testing data would
be built upon the single remaining fold of D0 and bootstrapped data from
D1b.

I was thinking of using a class similar to those in
sklearn.model-selection (e.g. KFold or StratifiedKFold) to perform the
method described above.
The init function of the class would be initialized with the
KFold/StratifiedKFold and bootstrapping parameters.
The split function inside this class would be generic enough to handle
many datasets for either D0, or D1a, or D1b. This function's parameters
would be the usual X and y for data and targets, along with information
about the structure of X. This additional information would be
propagated from the fit function in a similar way as optional groups
parameter is passed to group-related split functions in classes such as
GroupKFold.
Here X would be the concatenation of train-test datasets (i.e. D0-like),
train only datasets (i.e. D1a-like) and test only datasets (i.e.
D1b-like). y would be built in a similar way.
The additional parameters may thus be two tuples for the start and end
indexes of train only datasets (i.e. D1a-like) and test only datasets
(i.e. D1b-like). These values would allow the split function to properly
operate on X and y by taking into account boundaries between dataset
types when building folds and performing bootstrapping.

As far as I know, I cannot perform such a procedure with scikit-learn
using the functions in sklearn.model-selection
(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_split.py).
Did I miss something? Maybe somewhere else in the code?
If there is no implementation in scikit-learn, would you be interested
in a pull-request with such a function?

Best regards,
Johan Mazel
Les données à caractère personnel recueillies et traitées dans le cadre de cet 
échange, le sont à seule fin d’exécution d’une relation professionnelle et 
s’opèrent dans cette seule finalité et pour la durée nécessaire à cette 
relation. Si vous souhaitez faire usage de vos droits de consultation, de 
rectification et de suppression de vos données, veuillez contacter 
contact.r...@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous 
remercions d’en informer l’expéditeur et de détruire le message. The personal 
data collected and processed during this exchange aims solely at completing a 
business relationship and is limited to the necessary duration of that 
relationship. If you wish to use your rights of consultation, rectification and 
deletion of your data, please contact: contact.r...@sgdsn.gouv.fr. If you have 
received this message in error, we thank you for informing the sender and 
destroying the message.
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] Domain adaptation and cross-validation

Reply via email to