numpy!

import pandas as pd
from descriptor_gen import DescriptorGen
import numpy as np
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem

def smi2fp(smi):
    mol = Chem.MolFromSmiles(smi)
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)
    arr = np.zeros((0,), dtype=np.int8)
    DataStructs.ConvertToNumpyArray(fp,arr)
    return arr

df = pd.read_csv("chembl_drugs.smi",sep=" ",names=["SMILES","Name"])
df['fp'] = df.SMILES.apply(smi2fp)
db_fp = np.stack(df.fp).sum(axis=0)

On Wed, Sep 15, 2021 at 9:32 AM Giovanni Tricarico <
giovanni.tricar...@glpg.com> wrote:

> Hello,
>
> based on this article:
>
>
>
> https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0195-1
>
>
>
> I have been trying to make what they call a ‘database fingerprint’.
>
>
>
> The first step seems to require obtaining the frequencies of each
> fingerprint bit in a database of molecules.
>
> To do that, I calculated the fingerprints of a list of molecules (much
> larger than the one below; this is just an example):
>
>
>
> ms = [Chem.MolFromSmiles(s) for s in ['c1ccccc1','CCC','CCCO']]
>
> fps = [rdMolDescriptors.GetMorganFingerprint(m, 3, useCounts = False) for
> m in ms]
>
>
>
> My first attempt to obtain the database fingerprint was by looping trough
> the fps and summing (+=), as that is reported to be an allowed operation
> for these fingerprints.
>
> This worked, but was very slow.
>
>
>
> My next attempt was to convert each fingerprint to a dictionary, and build
> the dictionary corresponding to the database fingerprint:
>
>
>
> database_fp_new = dict()
>
>
>
> for i,fp in enumerate(fps):
>
>     for fpbit in fp.GetNonzeroElements():
>
>         if fpbit in database_fp_new:
>
>             database_fp_new[fpbit] += 1
>
>         else:
>
>             database_fp_new[fpbit] = 1
>
>
>
> This worked, too, gave the same result as the ‘#=’ approach, and was much
> faster.
>
>
>
> {98513984: 1,
>
> 2763854213: 1,
>
> 3218693969: 1,
>
> 3741631696: 1,
>
> 2068133184: 1,
>
> 2245384272: 2,
>
> 2246728737: 2,
>
> 3542456614: 2,
>
> 864662311: 1,
>
> 1173125914: 1,
>
> 1365892349: 1,
>
> 1535166686: 1,
>
> 4023654873: 1}
>
>
>
> However, then I have a dictionary.
>
> But I need a fingerprint, because I want to do operations like similarity
> calculations (e.g.
> https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html?highlight=bulktanimoto#rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity
> ).
>
>
>
> Would anyone be able suggest if and how the dictionary can be turned back
> into a fingerprint, or perhaps advise how to make the database fingerprint
> in a different way, if the one I figured out is not optimal?
>
>
>
> Thank you
>
> ------------------------------
>
> This e-mail and its attachment(s) (if any) may contain confidential and/or
> proprietary information and is intended for its addressee(s) only. Any
> unauthorized use of the information contained herein (including, but not
> limited to, alteration, reproduction, communication, distribution or any
> other form of dissemination) is strictly prohibited. If you are not the
> intended addressee, please notify the originator promptly and delete this
> e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor
> any of its affiliates shall be liable for direct, special, indirect or
> consequential damages arising from alteration of the contents of this
> message (by a third party) or as a result of a virus being passed on.
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to