numpy! import pandas as pd from descriptor_gen import DescriptorGen import numpy as np from rdkit import Chem, DataStructs from rdkit.Chem import AllChem
def smi2fp(smi): mol = Chem.MolFromSmiles(smi) fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048) arr = np.zeros((0,), dtype=np.int8) DataStructs.ConvertToNumpyArray(fp,arr) return arr df = pd.read_csv("chembl_drugs.smi",sep=" ",names=["SMILES","Name"]) df['fp'] = df.SMILES.apply(smi2fp) db_fp = np.stack(df.fp).sum(axis=0) On Wed, Sep 15, 2021 at 9:32 AM Giovanni Tricarico < giovanni.tricar...@glpg.com> wrote: > Hello, > > based on this article: > > > > https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0195-1 > > > > I have been trying to make what they call a ‘database fingerprint’. > > > > The first step seems to require obtaining the frequencies of each > fingerprint bit in a database of molecules. > > To do that, I calculated the fingerprints of a list of molecules (much > larger than the one below; this is just an example): > > > > ms = [Chem.MolFromSmiles(s) for s in ['c1ccccc1','CCC','CCCO']] > > fps = [rdMolDescriptors.GetMorganFingerprint(m, 3, useCounts = False) for > m in ms] > > > > My first attempt to obtain the database fingerprint was by looping trough > the fps and summing (+=), as that is reported to be an allowed operation > for these fingerprints. > > This worked, but was very slow. > > > > My next attempt was to convert each fingerprint to a dictionary, and build > the dictionary corresponding to the database fingerprint: > > > > database_fp_new = dict() > > > > for i,fp in enumerate(fps): > > for fpbit in fp.GetNonzeroElements(): > > if fpbit in database_fp_new: > > database_fp_new[fpbit] += 1 > > else: > > database_fp_new[fpbit] = 1 > > > > This worked, too, gave the same result as the ‘#=’ approach, and was much > faster. > > > > {98513984: 1, > > 2763854213: 1, > > 3218693969: 1, > > 3741631696: 1, > > 2068133184: 1, > > 2245384272: 2, > > 2246728737: 2, > > 3542456614: 2, > > 864662311: 1, > > 1173125914: 1, > > 1365892349: 1, > > 1535166686: 1, > > 4023654873: 1} > > > > However, then I have a dictionary. > > But I need a fingerprint, because I want to do operations like similarity > calculations (e.g. > https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html?highlight=bulktanimoto#rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity > ). > > > > Would anyone be able suggest if and how the dictionary can be turned back > into a fingerprint, or perhaps advise how to make the database fingerprint > in a different way, if the one I figured out is not optimal? > > > > Thank you > > ------------------------------ > > This e-mail and its attachment(s) (if any) may contain confidential and/or > proprietary information and is intended for its addressee(s) only. Any > unauthorized use of the information contained herein (including, but not > limited to, alteration, reproduction, communication, distribution or any > other form of dissemination) is strictly prohibited. If you are not the > intended addressee, please notify the originator promptly and delete this > e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor > any of its affiliates shall be liable for direct, special, indirect or > consequential damages arising from alteration of the contents of this > message (by a third party) or as a result of a virus being passed on. > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss