Hello I am working on a binary classification task using machine learning. One class C0 is built upon public data called D0. The other class C1 is made of data named D1 that I generated. I actually generated two datasets D1a and D1b with slightly different parameters.
I would like to evaluate the domain adaptation of a model in the context of D0, D1a and D1b. It means that I would like to train a model on data from D0 and D1a, and then, test its performance on D0 and D1b. I plan to perform a k-fold cross-validation on D0 and bootstrapping on D1a on D1b. For example, at each iteration, I would build training data from k-1 folds of D0 and boostrapped data from D1a. Testing data would be built upon the single remaining fold of D0 and bootstrapped data from D1b. I was thinking of using a class similar to those in sklearn.model-selection (e.g. KFold or StratifiedKFold) to perform the method described above. The init function of the class would be initialized with the KFold/StratifiedKFold and bootstrapping parameters. The split function inside this class would be generic enough to handle many datasets for either D0, or D1a, or D1b. This function's parameters would be the usual X and y for data and targets, along with information about the structure of X. This additional information would be propagated from the fit function in a similar way as optional groups parameter is passed to group-related split functions in classes such as GroupKFold. Here X would be the concatenation of train-test datasets (i.e. D0-like), train only datasets (i.e. D1a-like) and test only datasets (i.e. D1b-like). y would be built in a similar way. The additional parameters may thus be two tuples for the start and end indexes of train only datasets (i.e. D1a-like) and test only datasets (i.e. D1b-like). These values would allow the split function to properly operate on X and y by taking into account boundaries between dataset types when building folds and performing bootstrapping. As far as I know, I cannot perform such a procedure with scikit-learn using the functions in sklearn.model-selection (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_split.py). Did I miss something? Maybe somewhere else in the code? If there is no implementation in scikit-learn, would you be interested in a pull-request with such a function? Best regards, Johan Mazel Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.r...@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.r...@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message. _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn