Dear Thomas, Your strategy for model development is built on the assumption that the SAR (structure-activity relationship) is a continuous manifold constructed for your compound descriptors. However, SARs for many proteins in drug discovery or chemical biology are not continuous (consider kinase inhibitors).
Therefore, you must make an assessment of the training data SAR to check for the prevalence of activity cliffs. There are at least two ways you can go about this: (1) Simply compute all pairwise similarities by your choice of descriptor+metric, then identify where there are pairs (e.g., MACCS-Tanimoto > 0.7) with large activity differences (e.g., K_i or IC50 difference of more than 10/50/100-fold; again, the biology of your problem determines the right values). (2) Perform many repetitions of train-test splitting on the 709 reference molecules, look at the distribution of your evaluation metric, and see if there is a limit in your ability to predict. If you are hitting a wall in terms of predictability (metric performance), it's a likely sign there is an activity cliff, and no amount of machine learning is going to be able to overcome this. Further, trace the predictability of individual compounds to identify those which consistently are predicted wrong. If you combine this with analysis (1), you can know exactly which of your chemistries are unmodelable. If you find that there are no activity cliffs in your dataset, then your application of the assumption that chemical similarity implies biological endpoint similarity will hold, and your experimental design is validated because of the presence of a continuous manifold. However, if you do have activity cliffs, then as awesome as sklearn is, it still cannot make the computational chemistry any better. Hope this helps you contextualize your work. Don't hesitate to contact me if I can be of consultation. Sincerely, J.B. Brown Kyoto University Graduate School of Medicine 2018-07-16 8:51 GMT+09:00 Thomas Evangelidis <teva...@gmail.com>: > > Hello, > > I am kind of confused about the use of sample_weights parameter in the > fit() function of RandomForestRegressor. Here is my problem: > > I am trying to predict the binding affinity of small molecules to a > protein. I have a training set of 709 molecules and a blind test set of 180 > molecules. I want to find those features that are more important for the > correct prediction of the binding affinity of those 180 molecules of my > blind test set. My rationale is that if I give more emphasis to the > similar molecules in the training set, then I will get higher importances > for those features that have higher predictive ability for this specific > blind test set of 180 molecules. To this end, I weighted the 709 training > set molecules by their maximum similarity to the 180 molecules, selected > only those features with high importance and trained a new RF with all 709 > molecules. I got some results but I am not satisfied. Is this the right way > to use sample_weights in RF. I would appreciate any advice or suggested > work flow. > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tev...@pharm.uoa.gr > > teva...@gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn