Thank you all for your feedback! @ Rajarshi Guha : you’re right; however I am using unfolded fingerprints, as you see in my original code, and my analysis goes in fact beyond what the article describes. The concept of ‘global fingerprint’ of a set was already known well before that article appeared, and there is no strict need to use a specific type of fingerprint, indeed as long as there is consistency.
@ Patrick Walters : sure, I could use numpy for fast summing of binary vectors, but then 1) I would have to use folded fingerprints, which I can’t do; 2) I would still end up with something that is not a fingerprint in the format I need. Could I use for instance the BulkTanimotoSimilarity function between two db_fp’s like the one you defined? @ Andrew Dalke : I am sure the solution you suggest is very efficient; for now however would like to try to stick to standard rdkit functionalities, and several scripts I have already made use those. I just need to be able to use the same scripts with a fingerprint that is not only for a single compound, but for a set of compounds. In essence, let’s assume I consider the dictionary method satisfactory for combining multiple unfolded bit fp’s into a count fingerprint for a database. Now I need to know how to reverse the GetNonzeroElements() function, i.e. instead of going from unfolded fingerprint to dictionary, take a dictionary and turn it into an unfolded fingerprint of a type that can be handled by DataStructs’ similarity functions. I see some information about the ‘construction’ of the fp here: https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html#rdkit.DataStructs.cDataStructs.UIntSparseIntVect but frankly I have no idea how to use it. Hence my post here. I also just found this webpage, where a numpy array is converted to a fp using DataStructs.cDataStructs.CreateFromBitString: https://iwatobipen.wordpress.com/2019/02/08/convert-fingerprint-to-numpy-array-and-conver-numpy-array-to-fingerprint-rdkit-memorandum/ but again, folded fingerprints are used; I need unfolded ones :/ Thanks PS In fact at the moment I am calculating similarities between dictionaries, as very conveniently one can find the overlap between dictionaries by a simple ‘&’ operation. I am only suspecting that this is much less efficient than the built-in similarity operations defined in rdkit. Hence my attempt to go back to fp. From: Rajarshi Guha <rajarshi.g...@gmail.com> Sent: 15 September 2021 17:39 To: Patrick Walters <wpwalt...@gmail.com> Cc: Giovanni Tricarico <giovanni.tricar...@glpg.com>; rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] how to make a database fingerprint *** CAUTION : External e-mail *** Is it correct to use Morgan fingerprints for this type of analysis, given that individual bit positions don't correspond to specific substructures/features? The original work used key fp's (MACCS and Pubchem) On Wed, Sep 15, 2021 at 11:25 AM Patrick Walters <wpwalt...@gmail.com<mailto:wpwalt...@gmail.com>> wrote: numpy! import pandas as pd from descriptor_gen import DescriptorGen import numpy as np from rdkit import Chem, DataStructs from rdkit.Chem import AllChem def smi2fp(smi): mol = Chem.MolFromSmiles(smi) fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048) arr = np.zeros((0,), dtype=np.int8) DataStructs.ConvertToNumpyArray(fp,arr) return arr df = pd.read_csv("chembl_drugs.smi",sep=" ",names=["SMILES","Name"]) df['fp'] = df.SMILES.apply(smi2fp) db_fp = np.stack(df.fp).sum(axis=0) On Wed, Sep 15, 2021 at 9:32 AM Giovanni Tricarico <giovanni.tricar...@glpg.com<mailto:giovanni.tricar...@glpg.com>> wrote: Hello, based on this article: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0195-1 I have been trying to make what they call a ‘database fingerprint’. The first step seems to require obtaining the frequencies of each fingerprint bit in a database of molecules. To do that, I calculated the fingerprints of a list of molecules (much larger than the one below; this is just an example): ms = [Chem.MolFromSmiles(s) for s in ['c1ccccc1','CCC','CCCO']] fps = [rdMolDescriptors.GetMorganFingerprint(m, 3, useCounts = False) for m in ms] My first attempt to obtain the database fingerprint was by looping trough the fps and summing (+=), as that is reported to be an allowed operation for these fingerprints. This worked, but was very slow. My next attempt was to convert each fingerprint to a dictionary, and build the dictionary corresponding to the database fingerprint: database_fp_new = dict() for i,fp in enumerate(fps): for fpbit in fp.GetNonzeroElements(): if fpbit in database_fp_new: database_fp_new[fpbit] += 1 else: database_fp_new[fpbit] = 1 This worked, too, gave the same result as the ‘#=’ approach, and was much faster. {98513984: 1, 2763854213: 1, 3218693969: 1, 3741631696: 1, 2068133184: 1, 2245384272: 2, 2246728737: 2, 3542456614: 2, 864662311: 1, 1173125914: 1, 1365892349: 1, 1535166686: 1, 4023654873: 1} However, then I have a dictionary. But I need a fingerprint, because I want to do operations like similarity calculations (e.g. https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html?highlight=bulktanimoto#rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity ). Would anyone be able suggest if and how the dictionary can be turned back into a fingerprint, or perhaps advise how to make the database fingerprint in a different way, if the one I figured out is not optimal? Thank you ________________________________ This e-mail and its attachment(s) (if any) may contain confidential and/or proprietary information and is intended for its addressee(s) only. Any unauthorized use of the information contained herein (including, but not limited to, alteration, reproduction, communication, distribution or any other form of dissemination) is strictly prohibited. If you are not the intended addressee, please notify the originator promptly and delete this e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor any of its affiliates shall be liable for direct, special, indirect or consequential damages arising from alteration of the contents of this message (by a third party) or as a result of a virus being passed on. _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Rajarshi Guha | http://blog.rguha.net | @rguha<https://twitter.com/rguha>
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss