Is it correct to use Morgan fingerprints for this type of analysis, given that individual bit positions don't correspond to specific substructures/features? The original work used key fp's (MACCS and Pubchem)
On Wed, Sep 15, 2021 at 11:25 AM Patrick Walters <wpwalt...@gmail.com> wrote: > numpy! > > import pandas as pd > from descriptor_gen import DescriptorGen > import numpy as np > from rdkit import Chem, DataStructs > from rdkit.Chem import AllChem > > def smi2fp(smi): > mol = Chem.MolFromSmiles(smi) > fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048) > arr = np.zeros((0,), dtype=np.int8) > DataStructs.ConvertToNumpyArray(fp,arr) > return arr > > df = pd.read_csv("chembl_drugs.smi",sep=" ",names=["SMILES","Name"]) > df['fp'] = df.SMILES.apply(smi2fp) > db_fp = np.stack(df.fp).sum(axis=0) > > On Wed, Sep 15, 2021 at 9:32 AM Giovanni Tricarico < > giovanni.tricar...@glpg.com> wrote: > >> Hello, >> >> based on this article: >> >> >> >> https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0195-1 >> >> >> >> I have been trying to make what they call a ‘database fingerprint’. >> >> >> >> The first step seems to require obtaining the frequencies of each >> fingerprint bit in a database of molecules. >> >> To do that, I calculated the fingerprints of a list of molecules (much >> larger than the one below; this is just an example): >> >> >> >> ms = [Chem.MolFromSmiles(s) for s in ['c1ccccc1','CCC','CCCO']] >> >> fps = [rdMolDescriptors.GetMorganFingerprint(m, 3, useCounts = False) for >> m in ms] >> >> >> >> My first attempt to obtain the database fingerprint was by looping trough >> the fps and summing (+=), as that is reported to be an allowed operation >> for these fingerprints. >> >> This worked, but was very slow. >> >> >> >> My next attempt was to convert each fingerprint to a dictionary, and >> build the dictionary corresponding to the database fingerprint: >> >> >> >> database_fp_new = dict() >> >> >> >> for i,fp in enumerate(fps): >> >> for fpbit in fp.GetNonzeroElements(): >> >> if fpbit in database_fp_new: >> >> database_fp_new[fpbit] += 1 >> >> else: >> >> database_fp_new[fpbit] = 1 >> >> >> >> This worked, too, gave the same result as the ‘#=’ approach, and was much >> faster. >> >> >> >> {98513984: 1, >> >> 2763854213: 1, >> >> 3218693969: 1, >> >> 3741631696: 1, >> >> 2068133184: 1, >> >> 2245384272: 2, >> >> 2246728737: 2, >> >> 3542456614: 2, >> >> 864662311: 1, >> >> 1173125914: 1, >> >> 1365892349: 1, >> >> 1535166686: 1, >> >> 4023654873: 1} >> >> >> >> However, then I have a dictionary. >> >> But I need a fingerprint, because I want to do operations like similarity >> calculations (e.g. >> https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html?highlight=bulktanimoto#rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity >> ). >> >> >> >> Would anyone be able suggest if and how the dictionary can be turned back >> into a fingerprint, or perhaps advise how to make the database fingerprint >> in a different way, if the one I figured out is not optimal? >> >> >> >> Thank you >> >> ------------------------------ >> >> This e-mail and its attachment(s) (if any) may contain confidential >> and/or proprietary information and is intended for its addressee(s) only. >> Any unauthorized use of the information contained herein (including, but >> not limited to, alteration, reproduction, communication, distribution or >> any other form of dissemination) is strictly prohibited. If you are not the >> intended addressee, please notify the originator promptly and delete this >> e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor >> any of its affiliates shall be liable for direct, special, indirect or >> consequential damages arising from alteration of the contents of this >> message (by a third party) or as a result of a virus being passed on. >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss