Re: [Rdkit-discuss] how to make a database fingerprint

Giovanni Tricarico Wed, 15 Sep 2021 23:54:31 -0700

Thank you all for your feedback!

@ Rajarshi Guha : you’re right; however I am using unfolded fingerprints, as 
you see in my original code, and my analysis goes in fact beyond what the 
article describes. The concept of ‘global fingerprint’ of a set was already 
known well before that article appeared, and there is no strict need to use a 
specific type of fingerprint, indeed as long as there is consistency.


@ Patrick Walters : sure, I could use numpy for fast summing of binary vectors, 
but then 1) I would have to use folded fingerprints, which I can’t do; 2) I 
would still end up with something that is not a fingerprint in the format I 
need. Could I use for instance the BulkTanimotoSimilarity function between two 
db_fp’s like the one you defined?

@ Andrew Dalke : I am sure the solution you suggest is very efficient; for now 
however would like to try to stick to standard rdkit functionalities, and 
several scripts I have already made use those. I just need to be able to use 
the same scripts with a fingerprint that is not only for a single compound, but 
for a set of compounds.

In essence, let’s assume I consider the dictionary method satisfactory for 
combining multiple unfolded bit fp’s into a count fingerprint for a database.
Now I need to know how to reverse the GetNonzeroElements() function, i.e. 
instead of going from unfolded fingerprint to dictionary, take a dictionary and 
turn it into an unfolded fingerprint of a type that can be handled by 
DataStructs’ similarity functions.

I see some information about the ‘construction’ of the fp here:

https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html#rdkit.DataStructs.cDataStructs.UIntSparseIntVect

but frankly I have no idea how to use it. Hence my post here.

I also just found this webpage, where a numpy array is converted to a fp using 
DataStructs.cDataStructs.CreateFromBitString:

https://iwatobipen.wordpress.com/2019/02/08/convert-fingerprint-to-numpy-array-and-conver-numpy-array-to-fingerprint-rdkit-memorandum/

but again, folded fingerprints are used; I need unfolded ones :/

Thanks

PS
In fact at the moment I am calculating similarities between dictionaries, as 
very conveniently one can find the overlap between dictionaries by a simple ‘&’ 
operation.
I am only suspecting that this is much less efficient than the built-in 
similarity operations defined in rdkit. Hence my attempt to go back to fp.

From: Rajarshi Guha <rajarshi.g...@gmail.com>
Sent: 15 September 2021 17:39
To: Patrick Walters <wpwalt...@gmail.com>
Cc: Giovanni Tricarico <giovanni.tricar...@glpg.com>; 
rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] how to make a database fingerprint

*** CAUTION : External e-mail ***

Is it correct to use Morgan fingerprints for this type of analysis, given that 
individual bit positions don't correspond to specific substructures/features? 
The original work used key fp's (MACCS and Pubchem)

On Wed, Sep 15, 2021 at 11:25 AM Patrick Walters 
<wpwalt...@gmail.com<mailto:wpwalt...@gmail.com>> wrote:
numpy!

import pandas as pd
from descriptor_gen import DescriptorGen
import numpy as np
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem

def smi2fp(smi):
    mol = Chem.MolFromSmiles(smi)
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)
    arr = np.zeros((0,), dtype=np.int8)
    DataStructs.ConvertToNumpyArray(fp,arr)
    return arr

df = pd.read_csv("chembl_drugs.smi",sep=" ",names=["SMILES","Name"])
df['fp'] = df.SMILES.apply(smi2fp)
db_fp = np.stack(df.fp).sum(axis=0)

On Wed, Sep 15, 2021 at 9:32 AM Giovanni Tricarico 
<giovanni.tricar...@glpg.com<mailto:giovanni.tricar...@glpg.com>> wrote:
Hello,
based on this article:

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0195-1

I have been trying to make what they call a ‘database fingerprint’.

The first step seems to require obtaining the frequencies of each fingerprint 
bit in a database of molecules.
To do that, I calculated the fingerprints of a list of molecules (much larger 
than the one below; this is just an example):

ms = [Chem.MolFromSmiles(s) for s in ['c1ccccc1','CCC','CCCO']]
fps = [rdMolDescriptors.GetMorganFingerprint(m, 3, useCounts = False) for m in 
ms]

My first attempt to obtain the database fingerprint was by looping trough the 
fps and summing (+=), as that is reported to be an allowed operation for these 
fingerprints.
This worked, but was very slow.

My next attempt was to convert each fingerprint to a dictionary, and build the 
dictionary corresponding to the database fingerprint:

database_fp_new = dict()

for i,fp in enumerate(fps):
    for fpbit in fp.GetNonzeroElements():
        if fpbit in database_fp_new:
            database_fp_new[fpbit] += 1
        else:
            database_fp_new[fpbit] = 1

This worked, too, gave the same result as the ‘#=’ approach, and was much 
faster.

{98513984: 1,
2763854213: 1,
3218693969: 1,
3741631696: 1,
2068133184: 1,
2245384272: 2,
2246728737: 2,
3542456614: 2,
864662311: 1,
1173125914: 1,
1365892349: 1,
1535166686: 1,
4023654873: 1}

However, then I have a dictionary.
But I need a fingerprint, because I want to do operations like similarity 
calculations (e.g. 
https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html?highlight=bulktanimoto#rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity
 ).

Would anyone be able suggest if and how the dictionary can be turned back into 
a fingerprint, or perhaps advise how to make the database fingerprint in a 
different way, if the one I figured out is not optimal?

Thank you

________________________________

This e-mail and its attachment(s) (if any) may contain confidential and/or 
proprietary information and is intended for its addressee(s) only. Any 
unauthorized use of the information contained herein (including, but not 
limited to, alteration, reproduction, communication, distribution or any other 
form of dissemination) is strictly prohibited. If you are not the intended 
addressee, please notify the originator promptly and delete this e-mail and its 
attachment(s) (if any) subsequently. Neither Galapagos nor any of its 
affiliates shall be liable for direct, special, indirect or consequential 
damages arising from alteration of the contents of this message (by a third 
party) or as a result of a virus being passed on.
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Rajarshi Guha | http://blog.rguha.net | @rguha<https://twitter.com/rguha>

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] how to make a database fingerprint

Reply via email to