Re: [Rdkit-discuss] Folding count vectors

Francois Berenger Wed, 20 Nov 2019 00:25:13 -0800

On 20/11/2019 02:00, Benjamin Datko wrote:

Hello Francois,


I am trying to replicate some of the functionality of
CreateDifferenceFingerprintForReaction [Ref 1] for my own
understanding on how the code works. The function
CreateDifferenceFingerprintForReaction allows for three difference
fingerprint representation of the molecules: AtomPair, Morgan, and
TopologicalTorsion [Ref 2]. All three are count vectors [Ref 3], and
the function allows for variable fingerprint size output.


Personally, I wouldn't try to fold a count vector.
They are sparse vectors, so they don't take a lot of memory.
Also, they are less information lossy than binary fingerprints.

But, maybe Greg has some hack around, if you are really forced to dothis.

I was following this post [Ref 4] describing how to create reaction
difference fingerprints using different fingerprints representation.
Using the code from the post I can create reaction difference
fingerprints using either Morgan or AtomPair, but comparing the output
from the post [Ref 4] to CreateDifferenceFingerprintForReaction
results in different size fingerprints, with different values within
the fingerprint, and different densities. I am assuming this due to
folding the count vector down to the default fingerprint size of 2048.


Example code snippet:

# The below defs are from the post
https://sourceforge.net/p/rdkit/mailman/message/35240736/

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs
import copy

def _createFP(mol,maxSize,fpType='AP'):
    mol.UpdatePropertyCache(False)
    if fpType == 'AP':
        return AllChem.GetAtomPairFingerprint(mol, minLength=1,
maxLength=maxSize)
    else:
        Chem.GetSSSR(mol)
        rinfo = mol.GetRingInfo()
        return AllChem.GetMorganFingerprint(mol, radius=maxSize)

def getSumFps(fps):
    summedFP = copy.deepcopy(fps[0])
    for fp in fps[1:]:
        summedFP += fp
    return summedFP

def buildReactionFP(rxn, maxSize=3, fpType='AP'):
    reactants = rxn.GetReactants()
    products = rxn.GetProducts()
    rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
reactants])
    pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
products])
    return pFP-rFP

rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1CCCCC1>>[N:1]C1CCCCC1'

, useSmiles=True)

rxfp1 = buildReactionFP(rxn1,maxSize=2)

rxfp1.GetNonzeroElements()

{558114: -2, 574497: -1, 1066050: 2, 1066081: 1}

rxfp1.GetLength()

8388608

# Same reaction now using CreateDifferenceFingerprintForReaction

rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1)

rxn1_fp.GetNonzeroElements()


{1048: 10,
 1310: -20,
 1325: 20,
 1372: -10,
 1390: 20,
 1692: -10,
 1757: -20,
 1772: 10}

print(rxn1_fp.GetLength(),rxfp1.GetLength())

2048 8388608

References
1.
https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction
2.
https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html
3.
https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints
4. https://sourceforge.net/p/rdkit/mailman/message/35240736/

v/r,

Ben

On Mon, Nov 18, 2019 at 10:13 PM Francois Berenger <mli...@ligand.eu>
wrote:

On 19/11/2019 03:34, Benjamin Datko wrote:

Hello all,

I am curious on how to fold a count vector fingerprint. I

understand

when folding bit vectors the most common way is to split the

vector in

half, and apply a bitwise OR operation. I think this is how the
function rdkit.DataStructs.FoldFingerprint works in RDKit, correct

me

if I am wrong.

How does RDKit and or what is the appropriate way to fold count
vectors such as AtomPair, Morgan, and Topological torsion?


Can you give us some context? Why do you want to do that?

Maybe, you can use the following in order to create
shorter "fingerprints" for which the Tanimoto distance is
still computable (despite becoming approximate then):

---
Shrivastava, A. (2016).
Simple and efficient weighted minwise hashing.
In Advances in Neural Information Processing Systems (pp.
1498-1506).

https://papers.nips.cc/paper/6472-simple-and-efficient-weighted-minwise-hashing.pdf

---

Regards,
F.

I thought about turning the fingerprint into a bit vector using

their

respected "AsBitVect" method then folding using
rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't
have a "AsBitVect" method
[https://www.rdkit.org/docs/GettingStartedInPython.html].

For an explicit example using AtomPair fingerprint we can see the
fingerprint is extremely sparse. Could this AtomPair fingerprint

be

folded to increase the density?

from rdkit import Chem

from rdkit.Chem import AllChem

mol = Chem.MolFromSmiles('CC1CCCCC1')
ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1,

maxLength=3)

number_of_nonzero_elements =

len(ap_fp.GetNonzeroElements().values())

print((ap_fp.GetLength(),number_of_nonzero_elements))

(8388608,9)

Very Respectfully,

Ben
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Folding count vectors

Reply via email to