Hi Francois, I agree with your suggestion. I am also CCing Greg on this response.
I have tried to look around on google for viewing the source code of the CreateDifferenceFingerprintForReaction method but the most relevant pages I can find describing what the code does are [here]( https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html) and [here]( https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction ) I don't mind if the source is only in C++ but where can I find it? If I can view the source code I could understand how folding a count vector was implemented. As of right now I am assuming the implementation is similar to folding a bit vector just applying a SUM instead of a logical OR. v/r, Ben On Wed, Nov 20, 2019 at 3:23 AM Francois Berenger <mli...@ligand.eu> wrote: > On 20/11/2019 02:00, Benjamin Datko wrote: > > Hello Francois, > > > > I am trying to replicate some of the functionality of > > CreateDifferenceFingerprintForReaction [Ref 1] for my own > > understanding on how the code works. The function > > CreateDifferenceFingerprintForReaction allows for three difference > > fingerprint representation of the molecules: AtomPair, Morgan, and > > TopologicalTorsion [Ref 2]. All three are count vectors [Ref 3], and > > the function allows for variable fingerprint size output. > > Personally, I wouldn't try to fold a count vector. > They are sparse vectors, so they don't take a lot of memory. > Also, they are less information lossy than binary fingerprints. > > But, maybe Greg has some hack around, if you are really forced to do > this. > > > I was following this post [Ref 4] describing how to create reaction > > difference fingerprints using different fingerprints representation. > > Using the code from the post I can create reaction difference > > fingerprints using either Morgan or AtomPair, but comparing the output > > from the post [Ref 4] to CreateDifferenceFingerprintForReaction > > results in different size fingerprints, with different values within > > the fingerprint, and different densities. I am assuming this due to > > folding the count vector down to the default fingerprint size of 2048. > > > > > > Example code snippet: > > > > # The below defs are from the post > > https://sourceforge.net/p/rdkit/mailman/message/35240736/ > > > > from rdkit import Chem > > from rdkit.Chem import AllChem > > from rdkit import DataStructs > > import copy > > > > def _createFP(mol,maxSize,fpType='AP'): > > mol.UpdatePropertyCache(False) > > if fpType == 'AP': > > return AllChem.GetAtomPairFingerprint(mol, minLength=1, > > maxLength=maxSize) > > else: > > Chem.GetSSSR(mol) > > rinfo = mol.GetRingInfo() > > return AllChem.GetMorganFingerprint(mol, radius=maxSize) > > > > def getSumFps(fps): > > summedFP = copy.deepcopy(fps[0]) > > for fp in fps[1:]: > > summedFP += fp > > return summedFP > > > > def buildReactionFP(rxn, maxSize=3, fpType='AP'): > > reactants = rxn.GetReactants() > > products = rxn.GetProducts() > > rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in > > reactants]) > > pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in > > products]) > > return pFP-rFP > > > >>>> rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1CCCCC1>>[N:1]C1CCCCC1' > > , useSmiles=True) > > > >>>> rxfp1 = buildReactionFP(rxn1,maxSize=2) > > > >>>> rxfp1.GetNonzeroElements() > > {558114: -2, 574497: -1, 1066050: 2, 1066081: 1} > > > >>>> rxfp1.GetLength() > > 8388608 > > > > # Same reaction now using CreateDifferenceFingerprintForReaction > >>>> rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1) > > > >>>> rxn1_fp.GetNonzeroElements() > > > > {1048: 10, > > 1310: -20, > > 1325: 20, > > 1372: -10, > > 1390: 20, > > 1692: -10, > > 1757: -20, > > 1772: 10} > > > >>>> print(rxn1_fp.GetLength(),rxfp1.GetLength()) > > 2048 8388608 > > > > References > > 1. > > > https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction > > 2. > > > https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html > > 3. > > > https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints > > 4. https://sourceforge.net/p/rdkit/mailman/message/35240736/ > > > > v/r, > > > > Ben > > > > On Mon, Nov 18, 2019 at 10:13 PM Francois Berenger <mli...@ligand.eu> > > wrote: > > > >> On 19/11/2019 03:34, Benjamin Datko wrote: > >>> Hello all, > >>> > >>> I am curious on how to fold a count vector fingerprint. I > >> understand > >>> when folding bit vectors the most common way is to split the > >> vector in > >>> half, and apply a bitwise OR operation. I think this is how the > >>> function rdkit.DataStructs.FoldFingerprint works in RDKit, correct > >> me > >>> if I am wrong. > >>> > >>> How does RDKit and or what is the appropriate way to fold count > >>> vectors such as AtomPair, Morgan, and Topological torsion? > >> > >> Can you give us some context? Why do you want to do that? > >> > >> Maybe, you can use the following in order to create > >> shorter "fingerprints" for which the Tanimoto distance is > >> still computable (despite becoming approximate then): > >> > >> --- > >> Shrivastava, A. (2016). > >> Simple and efficient weighted minwise hashing. > >> In Advances in Neural Information Processing Systems (pp. > >> 1498-1506). > >> > >> > > > https://papers.nips.cc/paper/6472-simple-and-efficient-weighted-minwise-hashing.pdf > >> --- > >> > >> Regards, > >> F. > >> > >>> I thought about turning the fingerprint into a bit vector using > >> their > >>> respected "AsBitVect" method then folding using > >>> rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't > >>> have a "AsBitVect" method > >>> [https://www.rdkit.org/docs/GettingStartedInPython.html]. > >>> > >>> For an explicit example using AtomPair fingerprint we can see the > >>> fingerprint is extremely sparse. Could this AtomPair fingerprint > >> be > >>> folded to increase the density? > >>> > >>>>>> from rdkit import Chem > >>> > >>>>>> from rdkit.Chem import AllChem > >>> > >>>>>> mol = Chem.MolFromSmiles('CC1CCCCC1') > >>>>>> ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1, > >>> maxLength=3) > >>> > >>>>>> number_of_nonzero_elements = > >>> len(ap_fp.GetNonzeroElements().values()) > >>> > >>>>>> print((ap_fp.GetLength(),number_of_nonzero_elements)) > >>> (8388608,9) > >>> > >>> Very Respectfully, > >>> > >>> Ben > >>> _______________________________________________ > >>> Rdkit-discuss mailing list > >>> Rdkit-discuss@lists.sourceforge.net > >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss