On 7 September 2017 at 15:29, Maciek Wójcikowski <mac...@wojcikowski.pl> wrote:
> I think StandardScaller is what you want. For each assay you will get mean > and var. Average mean would be the "optimal" shift and average variance the > spread. But would this value make any physical sense? > > I think you missed my point. The problem was scaling with restraints, the RMSD between the binding affinity of the common ligands must be minimized uppon scaling. Anyway, I managed to work it out using scipy.optimize. > Considering the RF-Score-VS: In fact it's a regressor and it predicts a > real value, not a class. Although it is validated mostly using Enrichment > Factor, the last figure shows top results for regression vs Vina. > > To my understanding you trained the RF using class information (active, inactive) and the prediction was a probability value. If the probability is above 0.5 then the compound is an active, otherwise it is an inactive. This is how sklearn.ensemble.RandomForestClassifier works. In contrast I train MLPRegressors using binding affinities (scalar values) and the predictions are binding affinities (scallar values). > ---- > Pozdrawiam, | Best regards, > Maciek Wójcikowski > mac...@wojcikowski.pl > > 2017-09-06 20:48 GMT+02:00 Thomas Evangelidis <teva...@gmail.com>: > >> >> After some though about this problem today, I think it is an objective >> function minimization problem, when the objective function can be the root >> mean square deviation (RMSD) between the affinities of the common molecules >> in the two data sets. I could work iteratively, first rescale and fit assay >> B to match A, then proceed to assay C and so forth. Or alternatively, for >> each Assay I need to find two missing variables, the optimum shift Sh and >> the scale Sc. So if I have 3 Assays A, B, C lets say, I am looking for the >> optimum values of Sh_A, Sc_A, Sh_B, Sc_B, Sh_C, Sc_C that minimize the RMSD >> between the binding affinities of the overlapping molecules. Any idea how I >> can do that with scikit-learn? >> >> >> On 6 September 2017 at 00:29, Thomas Evangelidis <teva...@gmail.com> >> wrote: >> >>> Thanks Jason, Sebastian and Maciek! >>> >>> I believe from all the suggestions, the most feasible solutions is to >>> look experimental assays which overlap by at least two compounds, and then >>> adjust the binding affinities of one of them by looking in their difference >>> in both assays. Sebastian mentioned the simplest scenario, where the shift >>> for both compounds is 2 kcal/mol. However, he neglected to mention that the >>> ratio between the affinities of the two compounds in each assay also >>> matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but >>> -10/-12=0.83 in assay B. Ideally that should also be taken into account to >>> select the right transformation function for the values from Assay B. Is >>> anybody away of any clever algorithm to select the right transformation >>> function for such a case? I am sure there exists. >>> >>> The other approach would be to train different predictors from each >>> assay and then apply a data fusion technique (e.g. min rank). But that >>> wouldn't be that elegant. >>> >>> @Maciek To my understanding, the paper you cited addresses a >>> classification problem (actives, inactives) by implementing Random Forrest >>> Classfiers. My case is a Regression problem. >>> >>> >>> best, >>> Thomas >>> >>> >>> On 5 September 2017 at 20:33, Maciek Wójcikowski <mac...@wojcikowski.pl> >>> wrote: >>> >>>> Hi Thomas and others, >>>> >>>> It also really depend on how many data points you have on each >>>> compound. If you had more than a few then there are few options. If you get >>>> two very distinct activities for one ligand. I'd discard such samples as >>>> ambiguous or decide on one of the assays/experiments (the one with lower >>>> error). The exact problem was faced by PDBbind creators, I'd also look >>>> there for details what they did with their activities. >>>> >>>> To follow up Sebastians suggestion: have you checked how different >>>> ranks/Z-scores you get? Check out the Kendall Tau. >>>> >>>> Anyhow, you could build local models for a specific experimental >>>> methods. In our recent publication on slightly different area >>>> (protein-ligand scoring function), we show that the RF build on one target >>>> is just slightly better than the RF build on many targets (we've used DUD-E >>>> database); Checkout the "horizontal" and "per-target" splits >>>> https://www.nature.com/articles/srep46710. Unfortunately, this may >>>> change for different models. Plus the molecular descriptors used, which we >>>> know nothing about. >>>> >>>> I hope that helped a bit. >>>> >>>> ---- >>>> Pozdrawiam, | Best regards, >>>> Maciek Wójcikowski >>>> mac...@wojcikowski.pl >>>> >>>> 2017-09-05 19:35 GMT+02:00 Sebastian Raschka <se.rasc...@gmail.com>: >>>> >>>>> Another approach would be to pose this as a "ranking" problem to >>>>> predict relative affinities rather than absolute affinities. E.g., if you >>>>> have data from one (or more) molecules that has/have been tested under 2 >>>>> or >>>>> more experimental conditions, you can rank the other molecules accordingly >>>>> or normalize. E.g. if you observe that the binding affinity of molecule a >>>>> is -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding >>>>> affinities of molecule B are -10 and -12 kcal/mol, respectively, that >>>>> should give you some information for normalizing the values from assay 2 >>>>> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and >>>>> might be error prone, but so are experimental assays ... (when I sometimes >>>>> look at the std error/CI of the data I get from collaborators ... well, it >>>>> seems that absolute binding affinities have always taken with a grain of >>>>> salt anyway) >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> > On Sep 5, 2017, at 1:02 PM, Jason Rudy <jcr...@gmail.com> wrote: >>>>> > >>>>> > Thomas, >>>>> > >>>>> > This is sort of related to the problem I did my M.S. thesis on years >>>>> ago: cross-platform normalization of gene expression data. If you google >>>>> that term you'll find some papers. The situation is somewhat different, >>>>> though, because with microarrays or RNA-seq you get thousands of data >>>>> points for each experiment, which makes it easier to estimate the batch >>>>> effect. The principle is the similar, however. >>>>> > >>>>> > If I were in your situation, I would consider whether I have any of >>>>> the following advantages: >>>>> > >>>>> > 1. Some molecules that appear in multiple data sets >>>>> > 2. Detailed information about the different experimental conditions >>>>> > 3. Physical/chemical models of how experimental conditions influence >>>>> binding affinity >>>>> > >>>>> > If you have any of the above, you can potentially use them to >>>>> improve your estimates. You could also consider using experiment ID as a >>>>> categorical predictor in a sufficiently general regression method. >>>>> > >>>>> > Lastly, you may already know this, but the term "meta-analysis" is >>>>> relevant here, and you can google for specific techniques. Most of these >>>>> would be more limited than what you are envisioning, I think. >>>>> > >>>>> > Best, >>>>> > >>>>> > Jason >>>>> > >>>>> > On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis < >>>>> teva...@gmail.com> wrote: >>>>> > Greetings, >>>>> > >>>>> > I am working on a problem that involves predicting the binding >>>>> affinity of small molecules on a receptor structure (is regression >>>>> problem, >>>>> not classification). I have multiple small datasets of molecules with >>>>> measured binding affinities on a receptor, but each dataset was measured >>>>> in >>>>> different experimental conditions and therefore I cannot use them all >>>>> together as trainning set. So, instead of using them individually, I was >>>>> wondering whether there is a method to combine them all into a super >>>>> training set. The first way I could think of is to convert the binding >>>>> affinities to Z-scores and then combine all the small datasets of >>>>> molecules. But this is would be inaccurate because, firstly the datasets >>>>> are very small (10-50 molecules each), and secondly, the range of binding >>>>> affinities differs in each experiment (some datasets contain really strong >>>>> binders, while others do not, etc.). Is there any other approach to >>>>> combine >>>>> datasets with values coming from different sources? Maybe if som >>>>> eone points me to the right reference I could read and understand if >>>>> it is applicable to my case. >>>>> > >>>>> > Thanks, >>>>> > Thomas >>>>> > >>>>> > -- >>>>> > ============================================================ >>>>> ========== >>>>> > Dr Thomas Evangelidis >>>>> > Post-doctoral Researcher >>>>> > CEITEC - Central European Institute of Technology >>>>> > Masaryk University >>>>> > Kamenice 5/A35/2S049, >>>>> > 62500 Brno, Czech Republic >>>>> > >>>>> > email: tev...@pharm.uoa.gr >>>>> > teva...@gmail.com >>>>> > >>>>> > website: https://sites.google.com/site/thomasevangelidishomepage/ >>>>> > >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > scikit-learn mailing list >>>>> > scikit-learn@python.org >>>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > scikit-learn mailing list >>>>> > scikit-learn@python.org >>>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> >>> ====================================================================== >>> >>> Dr Thomas Evangelidis >>> >>> Post-doctoral Researcher >>> CEITEC - Central European Institute of Technology >>> Masaryk University >>> Kamenice 5/A35/2S049, >>> 62500 Brno, Czech Republic >>> >>> email: tev...@pharm.uoa.gr >>> >>> teva...@gmail.com >>> >>> >>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>> >>> >> >> >> -- >> >> ====================================================================== >> >> Dr Thomas Evangelidis >> >> Post-doctoral Researcher >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/2S049, >> 62500 Brno, Czech Republic >> >> email: tev...@pharm.uoa.gr >> >> teva...@gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn