Hello,
based on this article:

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0195-1

I have been trying to make what they call a 'database fingerprint'.

The first step seems to require obtaining the frequencies of each fingerprint 
bit in a database of molecules.
To do that, I calculated the fingerprints of a list of molecules (much larger 
than the one below; this is just an example):

ms = [Chem.MolFromSmiles(s) for s in ['c1ccccc1','CCC','CCCO']]
fps = [rdMolDescriptors.GetMorganFingerprint(m, 3, useCounts = False) for m in 
ms]

My first attempt to obtain the database fingerprint was by looping trough the 
fps and summing (+=), as that is reported to be an allowed operation for these 
fingerprints.
This worked, but was very slow.

My next attempt was to convert each fingerprint to a dictionary, and build the 
dictionary corresponding to the database fingerprint:

database_fp_new = dict()

for i,fp in enumerate(fps):
    for fpbit in fp.GetNonzeroElements():
        if fpbit in database_fp_new:
            database_fp_new[fpbit] += 1
        else:
            database_fp_new[fpbit] = 1

This worked, too, gave the same result as the '#=' approach, and was much 
faster.

{98513984: 1,
2763854213: 1,
3218693969: 1,
3741631696: 1,
2068133184: 1,
2245384272: 2,
2246728737: 2,
3542456614: 2,
864662311: 1,
1173125914: 1,
1365892349: 1,
1535166686: 1,
4023654873: 1}

However, then I have a dictionary.
But I need a fingerprint, because I want to do operations like similarity 
calculations (e.g. 
https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html?highlight=bulktanimoto#rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity
 ).

Would anyone be able suggest if and how the dictionary can be turned back into 
a fingerprint, or perhaps advise how to make the database fingerprint in a 
different way, if the one I figured out is not optimal?

Thank you

--
This e-mail and its attachment(s) (if any) may contain confidential and/or 
proprietary information and is intended for its addressee(s) only. Any 
unauthorized use of the information contained herein (including, but not 
limited to, alteration, reproduction, communication, distribution or any other 
form of dissemination) is strictly prohibited. If you are not the intended 
addressee, please notify the originator promptly and delete this e-mail and its 
attachment(s) (if any) subsequently. 

Neither Galapagos nor any of its affiliates shall be liable for direct, 
special, indirect or consequential damages arising from alteration of the 
contents of this message (by a third party) or as a result of a virus being 
passed on.


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to