Re: [Rdkit-discuss] Butina clustering with additional output
On 21/09/2018 16:53, Chris Earnshaw wrote: Hi I'm afraid I can't help with an RDkit solution to your question, but there are a couple of issues which should be born in mind: 1) The centroid of a cluster is a vector mean of the fingerprints of all the members of the cluster and probably will not be represented _exactly_ by any member of the cluster; in this case no structures will have a distance of 0.0 from the centroid. Do you want to calculate the distances from the true centroid or from the structure(s) closest to the centroid? I have seen 'clustroid' in the literature to mean cluster member nearest to the centroid of that cluster. 2) The Tanimoto metric doesn't obey the triangle inequality and is therefore sub-optimal for this kind of analysis. It's better to use an alternative which does obey the triangle inequality - e.g. the Cosine metric. The opposite is true. Sven Kosub. A note on the triangle inequality for the jaccard distance. CoRR, abs/1612.02696, 2016. Alan H. Lipkus. A proof of the triangle inequality for the tanimoto dis- tance. Journal of Mathematical Chemistry, 26(1):263–265, Oct 1999. While cosine similarity is not a metric, according to wikipedia. I'm not a mathematician, but I think (1 - Tanimoto) is a proper distance as long as the molecules are encoded with only positive values. So, Boolean fingerprints are OK, and counted unfolded fingerprints as well. Regard, Francois. Regards, Chris Earnshaw On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss wrote: RDkit Discussion Group, I note that RDkit can perform Butina clustering. Given an SDF of small molecules I would like to cluster the ligands, but obtain additional information from the clustering algorithm. In particular, I would like to obtain the cluster number and Tanimoto distance from the centroid for every ligand in the SDF. The centroid would obviously have a distance of 0.00. Has anyone written additional RDkit code to extract this additional information? Thank you. Regards, Jim Metz ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss [1] Links: -- [1] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Saving mol file
Dear Colin, this is a specific problem I stumbled upon some time ago.[1] I also mentioned it to the rDock mailing list.[2] Maybe there is a better work-around, but in the meantime I wrote the attached function. It takes as input the Mol Block, which in my case are in a dataframe. Hope that helps! Cheers, Jose Manuel Refs: [1] https://sourceforge.net/p/rdkit/mailman/message/34740124/ [2] https://sourceforge.net/p/rdock/mailman/message/34741112/ 2018-09-25 17:27 GMT+02:00 Colin Bournez : > Well yes I have this line indeed, I did not put the whole file for clarity > purpose. The thing is tools as MOE, Pymol read it without problem but RDock > for example can't read it properly and returns a neutral N which is not the > case. And if I open it with pymol and save it back in mol format, the 3 > appears on the N line and Rdock has no trouble anymore... > I was just wondering if there was a trick in RDKit to also save it this > way. > > > On 25/09/18 17:18, Greg Landrum wrote: > > Hi Colin, > The RDkit outputs charge information to mol blocks using the CHG line: > > In [3]: m = Chem.MolFromSmiles('C[NH3+]') > > In [4]: print(Chem.MolToMolBlock(m)) > > RDKit 2D > > 2 1 0 0 0 0 0 0 0 0999 V2000 > 0.0.0. C 0 0 0 0 0 0 0 0 0 0 0 0 > 1.29900.75000. N 0 0 0 0 0 0 0 0 0 0 0 0 > 1 2 1 0 > M CHG 1 2 1 > M END > > > I expect that you will find one of those in your mol file and that it > should be properly read in by other tools. > Is this not the case for you? > > Best, > -greg > > > > On Tue, Sep 25, 2018 at 4:39 PM Colin Bournez < > colin.bour...@univ-orleans.fr> wrote: > >> Hey everyone, >> >> I have a question concerning the Chem.MolToMolFile() function. >> When I open this file containing a N+ (here is the line corresponding in >> the mol file) : >> >>11.37003.4360 -11.8300 N 0 3 0 0 0 0 0 0 0 0 0 0 >> >> And I just save it back withotu any modification, the line is then : >> >> 11.37003.4360 -11.8300 N 0 0 0 0 0 0 0 0 0 0 0 0 >> >> The problem is that for some software this mol file causes trouble and >> the N+ is transformed to N with 4 bonds. >> I tried several tricks but I was not able to save it as the original >> line, does anyone has suggestion? >> >> Thanks, >> >> -- >> *Colin Bournez* >> PhD Student, Structural Bioinformatics & Chemoinformatics >> Institut de Chimie Organique et Analytique (ICOA), UMR CNRS-Université >> d'Orléans 7311 >> Rue de Chartres, 45067 Orléans, France >> T. +33 238 494 577 >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > -- > *Colin Bournez* > PhD Student, Structural Bioinformatics & Chemoinformatics > Institut de Chimie Organique et Analytique (ICOA), UMR CNRS-Université > d'Orléans 7311 > Rue de Chartres, 45067 Orléans, France > T. +33 238 494 577 > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- José-Manuel Gally PhD Student Structural Bioinformatics & Chemoinformatics Institut de Chimie Organique et Analytique (ICOA) UMR CNRS-Université d'Orléans 7311 Université d'Orléans Rue de Chartres F-45067 Orléans phone: +33 238 494 577 def UpdateChargeFlagInAtomBlock(mb): """ This function opens twice a file. During the first time it reads it in order to extract all Mol Blocks and update them with expected charge flags in atomblocks in memory. During second time it rewrites it using the updated Mol Blocks in memory. """ f="{:>10s}"*3+"{:>2}{:>4s}"+"{:>3s}"*11 chgs = []# list of charges lines = mb.split("\n") if mb[0] == '' or mb[0] == "\n": del lines[0] CTAB = lines[2] atomCount = int(CTAB.split()[0]) # parse mb line per line for l in lines: # look for M CHG property if l[0:6] == "M CHG": records = l.split()[3:]# M CHG X is not needed for parsing, the info we want comes afterwards # record each charge into a list for i in range(0,len(records),2): idx = records[i] chg = records[i+1] chgs.append((int(idx), int(chg)))# sort tuples by first element? break# stop iterating # sort by idx in order to parse the molblock only once more chgs = sorted(chgs, key=lambda x: x[0]) # that we have a list for the current molblock, attribute each charges for chg in chgs: i=3 while i < 3+atomCount:# do not read from beginning each time, rather continue parsing mb! # when finding the idx of the atom we want to update, extract all fields and rewrite whole sequence if i-2 == chg[0]:# -4 to take into account the CTAB headers, +1
Re: [Rdkit-discuss] Butina clustering with additional output
On Sep 26, 2018, at 20:26, Peter S. Shenkin wrote: > Ah, David, but how do you define a "real" singleton? There can be many different definitions of what a '"real" singleton' might be, but we are specifically talking about Butina clustering. The Butina paper defines the term "false singleton", which Dave quoted. The relevant text from DOI: 10.1021/ci9803381 is: """The molecules that have not been flagged by the end of the clustering process, either as a cluster centroid or as a cluster member, become singletons. It is important to emphasize at this stage that one of the consequences of this approach is that some molecules defined as singletons may have neighbors at the given Tanimoto similarity index, but those neighbors have been excluded by a ‘stronger’ cluster centroid, i.e., one with more neighbors in its list. the problem with the creation of a number of false singletons that do in fact have similar compounds within the set is easily offset by the final quality of the clusters that this approach generates.""" As you can see, there are two types of singletons, and one is called "false singleton". No specific name is used for the other type of singleton, but it's easy to how they can be called "real" singletons, without confusion or misunderstanding. (FWIW, my implementation, mentioned in an earlier email, uses the term "true singleton" as the singleton which is not a "false singleton", but the difference is only in the label.) To confirm that this is what Dave means, I'll quote from his paper Blomberg, N., Cosgrove, D. A., Kenny, P. W., & Kolmodin, K. (2009). Design of compound libraries for fragment screening. Journal of Computer-Aided Molecular Design, 23(8), 513–525. doi:10.1007/s10822-009-9264-5 """The clustering program flush_clus is an implementation of the sphere-exclusion algorithm of Taylor [41], which has also been reported independently by Butina ... One consequence of the algorithm is the production of ‘false singleton clusters.’ The final clusters in the output are invariably singleton clusters, where the only member is the seed. Some of these will be true singletons, i.e. molecules lacking neighbors within the clustering threshold, but others (the false singletons) will be singletons by virtue of the fact that their neighbors were placed in other larger clusters in a previous iteration of the algorithm. The flush_clus program offers the opportunity of performing a final sweep through the clusters using a larger similarity threshold and placing the singleton molecules within the cluster for which it has the greatest similarity with the seed, so long as this is within the threshold.""" Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Butina clustering with additional output
Ah, David, but how do you define a "real" singleton? -P. On Wed, Sep 26, 2018 at 1:30 PM David Cosgrove wrote: > Slightly off topic, but a minor issue with the Taylor-Butina algorithm is > that it generates “false singletons”. These are molecules just outside the > clustering cutoff that are stranded when their neighbours are put in a > different, larger cluster. We used to find it convenient to have a sweep of > these, at a slightly looser cutoff, and drop them into the cluster whose > centroid/seed they were nearest too. This could be added to Andrew’s code > quite easily. At the very least, it’s worth keeping track of the initial > number of neighbours within the cluster cutoff that each fingerprint had so > as to distinguish real singletons from these artefactual ones. > Dave > > > On Tue, 25 Sep 2018 at 19:56, Peter S. Shenkin wrote: > >> Well, I'm not really familiar with the Taylor-Butina clustering method, >> so I'm proposing a methodology based on generalizing something that I found >> to be useful in a somewhat different clustering context. >> >> Presuming that what you are clustering is the fingerprints of structures, >> and that you know which structures are in each cluster, you'd compute the >> average of all the fingerprints. That is, each bit position would be given >> a floating point number that is the average of the 0s and 1s at that >> position computed over the structures in the cluster. Then you'd compute >> the distance (say, Manhattan or Euclidian) between the fingerprint of each >> structure in the cluster and the average so computed. The "most >> representative structure" would be the cluster member whose distance is >> closest to the cluster's average fingerprint. (Some additional mileage >> could be gained by seeing just how far away from the averag the "most >> representative structures" are. It might be more representative (i.e., >> closer) for some clusters than for others. >> >> It would make sense to try this (since it's easy enough) and see whether >> the resulting "most representative structures" from the clusters really are >> at least roughly representative, by comparing them with viewable random >> subsets of structures from the clusters. >> >> -P. >> >> On Tue, Sep 25, 2018 at 2:36 PM, Andrew Dalke >> wrote: >> >>> On Sep 25, 2018, at 17:13, Peter S. Shenkin wrote: >>> > FWIW, in work on conformational clustering, I used the “most >>> representative” molecule; that is, the real molecule closest to the >>> mathematical centroid. This would probably be the best way of displaying a >>> single molecule that typifies what is in the cluster. >>> >>> In some sense I'm rephrasing Chris Earnshaw's earlier question - how >>> does one do that with Taylor-Butina clustering? And does it make sense? >>> >>> The algorithm starts by picking a centroid based on the fingerprints >>> with the highest number of neighbors, so none of the other cluster members >>> should have more neighbors within that cutoff. >>> >>> I am far from an expert on this topic, but with any alternative I can >>> think of makes me think I should have started with something other than >>> Taylor-Butina. >>> >>> >>> >>> Andrew >>> da...@dalkescientific.com >>> >>> >>> >>> >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > -- > David Cosgrove > Freelance computational chemistry and chemoinformatics developer > http://cozchemix.co.uk > > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Butina clustering with additional output
Slightly off topic, but a minor issue with the Taylor-Butina algorithm is that it generates “false singletons”. These are molecules just outside the clustering cutoff that are stranded when their neighbours are put in a different, larger cluster. We used to find it convenient to have a sweep of these, at a slightly looser cutoff, and drop them into the cluster whose centroid/seed they were nearest too. This could be added to Andrew’s code quite easily. At the very least, it’s worth keeping track of the initial number of neighbours within the cluster cutoff that each fingerprint had so as to distinguish real singletons from these artefactual ones. Dave On Tue, 25 Sep 2018 at 19:56, Peter S. Shenkin wrote: > Well, I'm not really familiar with the Taylor-Butina clustering method, so > I'm proposing a methodology based on generalizing something that I found to > be useful in a somewhat different clustering context. > > Presuming that what you are clustering is the fingerprints of structures, > and that you know which structures are in each cluster, you'd compute the > average of all the fingerprints. That is, each bit position would be given > a floating point number that is the average of the 0s and 1s at that > position computed over the structures in the cluster. Then you'd compute > the distance (say, Manhattan or Euclidian) between the fingerprint of each > structure in the cluster and the average so computed. The "most > representative structure" would be the cluster member whose distance is > closest to the cluster's average fingerprint. (Some additional mileage > could be gained by seeing just how far away from the averag the "most > representative structures" are. It might be more representative (i.e., > closer) for some clusters than for others. > > It would make sense to try this (since it's easy enough) and see whether > the resulting "most representative structures" from the clusters really are > at least roughly representative, by comparing them with viewable random > subsets of structures from the clusters. > > -P. > > On Tue, Sep 25, 2018 at 2:36 PM, Andrew Dalke > wrote: > >> On Sep 25, 2018, at 17:13, Peter S. Shenkin wrote: >> > FWIW, in work on conformational clustering, I used the “most >> representative” molecule; that is, the real molecule closest to the >> mathematical centroid. This would probably be the best way of displaying a >> single molecule that typifies what is in the cluster. >> >> In some sense I'm rephrasing Chris Earnshaw's earlier question - how does >> one do that with Taylor-Butina clustering? And does it make sense? >> >> The algorithm starts by picking a centroid based on the fingerprints with >> the highest number of neighbors, so none of the other cluster members >> should have more neighbors within that cutoff. >> >> I am far from an expert on this topic, but with any alternative I can >> think of makes me think I should have started with something other than >> Taylor-Butina. >> >> >> >> Andrew >> da...@dalkescientific.com >> >> >> >> >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss