Is it correct to use Morgan fingerprints for this type of analysis, given
that individual bit positions don't correspond to specific
substructures/features? The original work used key fp's (MACCS and Pubchem)

On Wed, Sep 15, 2021 at 11:25 AM Patrick Walters <wpwalt...@gmail.com>
wrote:

> numpy!
>
> import pandas as pd
> from descriptor_gen import DescriptorGen
> import numpy as np
> from rdkit import Chem, DataStructs
> from rdkit.Chem import AllChem
>
> def smi2fp(smi):
>     mol = Chem.MolFromSmiles(smi)
>     fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)
>     arr = np.zeros((0,), dtype=np.int8)
>     DataStructs.ConvertToNumpyArray(fp,arr)
>     return arr
>
> df = pd.read_csv("chembl_drugs.smi",sep=" ",names=["SMILES","Name"])
> df['fp'] = df.SMILES.apply(smi2fp)
> db_fp = np.stack(df.fp).sum(axis=0)
>
> On Wed, Sep 15, 2021 at 9:32 AM Giovanni Tricarico <
> giovanni.tricar...@glpg.com> wrote:
>
>> Hello,
>>
>> based on this article:
>>
>>
>>
>> https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0195-1
>>
>>
>>
>> I have been trying to make what they call a ‘database fingerprint’.
>>
>>
>>
>> The first step seems to require obtaining the frequencies of each
>> fingerprint bit in a database of molecules.
>>
>> To do that, I calculated the fingerprints of a list of molecules (much
>> larger than the one below; this is just an example):
>>
>>
>>
>> ms = [Chem.MolFromSmiles(s) for s in ['c1ccccc1','CCC','CCCO']]
>>
>> fps = [rdMolDescriptors.GetMorganFingerprint(m, 3, useCounts = False) for
>> m in ms]
>>
>>
>>
>> My first attempt to obtain the database fingerprint was by looping trough
>> the fps and summing (+=), as that is reported to be an allowed operation
>> for these fingerprints.
>>
>> This worked, but was very slow.
>>
>>
>>
>> My next attempt was to convert each fingerprint to a dictionary, and
>> build the dictionary corresponding to the database fingerprint:
>>
>>
>>
>> database_fp_new = dict()
>>
>>
>>
>> for i,fp in enumerate(fps):
>>
>>     for fpbit in fp.GetNonzeroElements():
>>
>>         if fpbit in database_fp_new:
>>
>>             database_fp_new[fpbit] += 1
>>
>>         else:
>>
>>             database_fp_new[fpbit] = 1
>>
>>
>>
>> This worked, too, gave the same result as the ‘#=’ approach, and was much
>> faster.
>>
>>
>>
>> {98513984: 1,
>>
>> 2763854213: 1,
>>
>> 3218693969: 1,
>>
>> 3741631696: 1,
>>
>> 2068133184: 1,
>>
>> 2245384272: 2,
>>
>> 2246728737: 2,
>>
>> 3542456614: 2,
>>
>> 864662311: 1,
>>
>> 1173125914: 1,
>>
>> 1365892349: 1,
>>
>> 1535166686: 1,
>>
>> 4023654873: 1}
>>
>>
>>
>> However, then I have a dictionary.
>>
>> But I need a fingerprint, because I want to do operations like similarity
>> calculations (e.g.
>> https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html?highlight=bulktanimoto#rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity
>> ).
>>
>>
>>
>> Would anyone be able suggest if and how the dictionary can be turned back
>> into a fingerprint, or perhaps advise how to make the database fingerprint
>> in a different way, if the one I figured out is not optimal?
>>
>>
>>
>> Thank you
>>
>> ------------------------------
>>
>> This e-mail and its attachment(s) (if any) may contain confidential
>> and/or proprietary information and is intended for its addressee(s) only.
>> Any unauthorized use of the information contained herein (including, but
>> not limited to, alteration, reproduction, communication, distribution or
>> any other form of dissemination) is strictly prohibited. If you are not the
>> intended addressee, please notify the originator promptly and delete this
>> e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor
>> any of its affiliates shall be liable for direct, special, indirect or
>> consequential damages arising from alteration of the contents of this
>> message (by a third party) or as a result of a virus being passed on.
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to