2017-09-07 15:57 GMT+02:00 Thomas Evangelidis <teva...@gmail.com>: > > > On 7 September 2017 at 15:29, Maciek Wójcikowski <mac...@wojcikowski.pl> > wrote: > >> I think StandardScaller is what you want. For each assay you will get >> mean and var. Average mean would be the "optimal" shift and average >> variance the spread. But would this value make any physical sense? >> >> I think you missed my point. The problem was scaling with restraints, > the RMSD between the binding affinity of the common ligands must be > minimized uppon scaling. Anyway, I managed to work it out using > scipy.optimize. > Yes, I meant the common ligand which would probably lead you to similar solution. Out of curiosity: is there a connection with the optimal shift and the type of assay in your case?
> > > >> Considering the RF-Score-VS: In fact it's a regressor and it predicts a >> real value, not a class. Although it is validated mostly using Enrichment >> Factor, the last figure shows top results for regression vs Vina. >> >> To my understanding you trained the RF using class information (active, > inactive) and the prediction was a probability value. If the probability > is above 0.5 then the compound is an active, otherwise it is an inactive. > This is how sklearn.ensemble.RandomForestClassifier works. > We trained RandomForestRegressor with binding affinities of DUD-E actives. The decoys were arbitrarily assigned 5.95 pK activity. > > In contrast I train MLPRegressors using binding affinities (scalar values) > and the predictions are binding affinities (scallar values). > We will have chance to talk it through in Berlin, see you there! > > > > >> ---- >> Pozdrawiam, | Best regards, >> Maciek Wójcikowski >> mac...@wojcikowski.pl >> >> 2017-09-06 20:48 GMT+02:00 Thomas Evangelidis <teva...@gmail.com>: >> >>> >>> After some though about this problem today, I think it is an objective >>> function minimization problem, when the objective function can be the root >>> mean square deviation (RMSD) between the affinities of the common molecules >>> in the two data sets. I could work iteratively, first rescale and fit assay >>> B to match A, then proceed to assay C and so forth. Or alternatively, for >>> each Assay I need to find two missing variables, the optimum shift Sh and >>> the scale Sc. So if I have 3 Assays A, B, C lets say, I am looking for the >>> optimum values of Sh_A, Sc_A, Sh_B, Sc_B, Sh_C, Sc_C that minimize the RMSD >>> between the binding affinities of the overlapping molecules. Any idea how I >>> can do that with scikit-learn? >>> >>> >>> On 6 September 2017 at 00:29, Thomas Evangelidis <teva...@gmail.com> >>> wrote: >>> >>>> Thanks Jason, Sebastian and Maciek! >>>> >>>> I believe from all the suggestions, the most feasible solutions is to >>>> look experimental assays which overlap by at least two compounds, and then >>>> adjust the binding affinities of one of them by looking in their difference >>>> in both assays. Sebastian mentioned the simplest scenario, where the shift >>>> for both compounds is 2 kcal/mol. However, he neglected to mention that the >>>> ratio between the affinities of the two compounds in each assay also >>>> matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but >>>> -10/-12=0.83 in assay B. Ideally that should also be taken into account to >>>> select the right transformation function for the values from Assay B. Is >>>> anybody away of any clever algorithm to select the right transformation >>>> function for such a case? I am sure there exists. >>>> >>>> The other approach would be to train different predictors from each >>>> assay and then apply a data fusion technique (e.g. min rank). But that >>>> wouldn't be that elegant. >>>> >>>> @Maciek To my understanding, the paper you cited addresses a >>>> classification problem (actives, inactives) by implementing Random Forrest >>>> Classfiers. My case is a Regression problem. >>>> >>>> >>>> best, >>>> Thomas >>>> >>>> >>>> On 5 September 2017 at 20:33, Maciek Wójcikowski <mac...@wojcikowski.pl >>>> > wrote: >>>> >>>>> Hi Thomas and others, >>>>> >>>>> It also really depend on how many data points you have on each >>>>> compound. If you had more than a few then there are few options. If you >>>>> get >>>>> two very distinct activities for one ligand. I'd discard such samples as >>>>> ambiguous or decide on one of the assays/experiments (the one with lower >>>>> error). The exact problem was faced by PDBbind creators, I'd also look >>>>> there for details what they did with their activities. >>>>> >>>>> To follow up Sebastians suggestion: have you checked how different >>>>> ranks/Z-scores you get? Check out the Kendall Tau. >>>>> >>>>> Anyhow, you could build local models for a specific experimental >>>>> methods. In our recent publication on slightly different area >>>>> (protein-ligand scoring function), we show that the RF build on one target >>>>> is just slightly better than the RF build on many targets (we've used >>>>> DUD-E >>>>> database); Checkout the "horizontal" and "per-target" splits >>>>> https://www.nature.com/articles/srep46710. Unfortunately, this may >>>>> change for different models. Plus the molecular descriptors used, which we >>>>> know nothing about. >>>>> >>>>> I hope that helped a bit. >>>>> >>>>> ---- >>>>> Pozdrawiam, | Best regards, >>>>> Maciek Wójcikowski >>>>> mac...@wojcikowski.pl >>>>> >>>>> 2017-09-05 19:35 GMT+02:00 Sebastian Raschka <se.rasc...@gmail.com>: >>>>> >>>>>> Another approach would be to pose this as a "ranking" problem to >>>>>> predict relative affinities rather than absolute affinities. E.g., if you >>>>>> have data from one (or more) molecules that has/have been tested under 2 >>>>>> or >>>>>> more experimental conditions, you can rank the other molecules >>>>>> accordingly >>>>>> or normalize. E.g. if you observe that the binding affinity of molecule a >>>>>> is -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the >>>>>> binding >>>>>> affinities of molecule B are -10 and -12 kcal/mol, respectively, that >>>>>> should give you some information for normalizing the values from assay 2 >>>>>> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution >>>>>> and >>>>>> might be error prone, but so are experimental assays ... (when I >>>>>> sometimes >>>>>> look at the std error/CI of the data I get from collaborators ... well, >>>>>> it >>>>>> seems that absolute binding affinities have always taken with a grain of >>>>>> salt anyway) >>>>>> >>>>>> Best, >>>>>> Sebastian >>>>>> >>>>>> > On Sep 5, 2017, at 1:02 PM, Jason Rudy <jcr...@gmail.com> wrote: >>>>>> > >>>>>> > Thomas, >>>>>> > >>>>>> > This is sort of related to the problem I did my M.S. thesis on >>>>>> years ago: cross-platform normalization of gene expression data. If you >>>>>> google that term you'll find some papers. The situation is somewhat >>>>>> different, though, because with microarrays or RNA-seq you get thousands >>>>>> of >>>>>> data points for each experiment, which makes it easier to estimate the >>>>>> batch effect. The principle is the similar, however. >>>>>> > >>>>>> > If I were in your situation, I would consider whether I have any of >>>>>> the following advantages: >>>>>> > >>>>>> > 1. Some molecules that appear in multiple data sets >>>>>> > 2. Detailed information about the different experimental conditions >>>>>> > 3. Physical/chemical models of how experimental conditions >>>>>> influence binding affinity >>>>>> > >>>>>> > If you have any of the above, you can potentially use them to >>>>>> improve your estimates. You could also consider using experiment ID as a >>>>>> categorical predictor in a sufficiently general regression method. >>>>>> > >>>>>> > Lastly, you may already know this, but the term "meta-analysis" is >>>>>> relevant here, and you can google for specific techniques. Most of these >>>>>> would be more limited than what you are envisioning, I think. >>>>>> > >>>>>> > Best, >>>>>> > >>>>>> > Jason >>>>>> > >>>>>> > On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis < >>>>>> teva...@gmail.com> wrote: >>>>>> > Greetings, >>>>>> > >>>>>> > I am working on a problem that involves predicting the binding >>>>>> affinity of small molecules on a receptor structure (is regression >>>>>> problem, >>>>>> not classification). I have multiple small datasets of molecules with >>>>>> measured binding affinities on a receptor, but each dataset was measured >>>>>> in >>>>>> different experimental conditions and therefore I cannot use them all >>>>>> together as trainning set. So, instead of using them individually, I was >>>>>> wondering whether there is a method to combine them all into a super >>>>>> training set. The first way I could think of is to convert the binding >>>>>> affinities to Z-scores and then combine all the small datasets of >>>>>> molecules. But this is would be inaccurate because, firstly the datasets >>>>>> are very small (10-50 molecules each), and secondly, the range of binding >>>>>> affinities differs in each experiment (some datasets contain really >>>>>> strong >>>>>> binders, while others do not, etc.). Is there any other approach to >>>>>> combine >>>>>> datasets with values coming from different sources? Maybe if som >>>>>> eone points me to the right reference I could read and understand if >>>>>> it is applicable to my case. >>>>>> > >>>>>> > Thanks, >>>>>> > Thomas >>>>>> > >>>>>> > -- >>>>>> > ============================================================ >>>>>> ========== >>>>>> > Dr Thomas Evangelidis >>>>>> > Post-doctoral Researcher >>>>>> > CEITEC - Central European Institute of Technology >>>>>> > Masaryk University >>>>>> > Kamenice 5/A35/2S049, >>>>>> > 62500 Brno, Czech Republic >>>>>> > >>>>>> > email: tev...@pharm.uoa.gr >>>>>> > teva...@gmail.com >>>>>> > >>>>>> > website: https://sites.google.com/site/thomasevangelidishomepage/ >>>>>> > >>>>>> > >>>>>> > >>>>>> > _______________________________________________ >>>>>> > scikit-learn mailing list >>>>>> > scikit-learn@python.org >>>>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> > >>>>>> > >>>>>> > _______________________________________________ >>>>>> > scikit-learn mailing list >>>>>> > scikit-learn@python.org >>>>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn@python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> ====================================================================== >>>> >>>> Dr Thomas Evangelidis >>>> >>>> Post-doctoral Researcher >>>> CEITEC - Central European Institute of Technology >>>> Masaryk University >>>> Kamenice 5/A35/2S049, >>>> 62500 Brno, Czech Republic >>>> >>>> email: tev...@pharm.uoa.gr >>>> >>>> teva...@gmail.com >>>> >>>> >>>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>>> >>>> >>> >>> >>> -- >>> >>> ====================================================================== >>> >>> Dr Thomas Evangelidis >>> >>> Post-doctoral Researcher >>> CEITEC - Central European Institute of Technology >>> Masaryk University >>> Kamenice 5/A35/2S049, >>> 62500 Brno, Czech Republic >>> >>> email: tev...@pharm.uoa.gr >>> >>> teva...@gmail.com >>> >>> >>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tev...@pharm.uoa.gr > > teva...@gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn