Hi Thomas,

I think you are making your life more complicated than it needs to be while
testing things. You're certainly making it harder for us to follow what
you're doing. Please try and keep questions as simple as possible. In the
case below, the calls to getNumpyArray() aren't relevant to your question
and just make things confusing for those of us who want to help.

What exactly are you trying to figure out here? Do you for some reason want
to re-implement the standard connectivity invariants yourself in Python? I
doubt it, but in case you did want to do this, you've made a mistake in
your implementation. The invariants use whether or not an atom is in a
ring, not the number of rings it's in. You're also not using the periodic
table object efficiently. Here's an updated version:

def generateECFPAtomInvariant(mol, discrete_charges=False):
    pt = Chem.GetPeriodicTable()
    num_atoms = mol.GetNumAtoms()
    invariants = [0]*num_atoms
    ring_info = mol.GetRingInfo()
    for i,a in enumerate(mol.GetAtoms()):
        descriptors=[]
        descriptors.append(a.GetAtomicNum())
        descriptors.append(a.GetTotalDegree())
        descriptors.append(a.GetTotalNumHs())
        descriptors.append(a.GetFormalCharge())
        descriptors.append(a.GetMass() - pt.GetAtomicWeight(a.GetSymbol()))
        if(ring_info.NumAtomRings(i)):
            descriptors.append(1)
        invariants[i]=hash(tuple(descriptors))& 0xffffffff
    return invariants

To compare things I would keep it as simple as possible and do something
like this:

for SMILES in ['Cc1ncccc1',
'CS(=O)(=O)N1CCc2c(C1)c(nn2CCCN1CCOCC1)c1ccc(Cl)c(C#Cc2ccc3C[C@H
](NCc3c2)C(=O)N2CCCCC2)c1']:
    mol = Chem.MolFromSmiles(SMILES)
    mol = Chem.AddHs(mol)
    invariants = generateECFPAtomInvariant(mol)
    bi1={}
    bi2={}
    fp1 = rdMolDescriptors.GetMorganFingerprint(mol,radius=3,bitInfo=bi1)
    fp2 =
rdMolDescriptors.GetMorganFingerprint(mol,radius=3,invariants=invariants,bitInfo=bi2)
    print("========")
    print(SMILES)
    nz1 = fp1.GetNonzeroElements()
    nz2 = fp2.GetNonzeroElements()
    print(len(nz1),len(nz1))
    print(nz1==nz2)


The first print function outputs the same result for both fingerprints, as
you'd hope.[1]
This, as you point out, will not generate the same bits since the
invariants are hashed to different values, so the last print function
outputs "False". There's also no good way to compare the two bitinfo
structures to each other: since the bit IDs are different, you have no way
of knowing which entries to compare to which. So, though each of the two
dictionaries should contain the same values (not keys), they will be in a
different order. If you want to compare just that the two sets of values is
the same (which isn't a lot more informative than comparing the number of
set bits), you could do:
    print(sorted(bi1.values())==sorted(bi2.values()))

Again, I don't think you actually want to do any of this. What exactly are
you trying to accomplish?

-greg




-greg
[1] This isn't strictly guaranteed. Due to true hash collisions while
generating the fingerprint, there is a chance that you'll get different
numbers of bits being set.

On Sun, Dec 1, 2019 at 5:41 PM Thomas Evangelidis <teva...@gmail.com> wrote:

> Hi Paolo,
>
> Many thanks for the detailed explanation! Standing by your statement "If
> the invariants are provided by the user, they will be used instead", I
> attempted to reproduce the default ECFP fingerprint for a small and a large
> molecule. Here is the code:
>
> import numpy as np
> from rdkit import DataStructs
> from rdkit.Chem import PeriodicTable, GetPeriodicTable, AllChem
> from rdkit import Chem
>
> def getNumpyArray(fp):
>     arr = np.zeros((1,), np.float32)
>     DataStructs.ConvertToNumpyArray(fp, arr)
>     return arr
>
> def generateECFPAtomInvariant(mol, discrete_charges=False):
>     num_atoms = mol.GetNumAtoms()
>     invariants = [0]*num_atoms
>     ring_info = mol.GetRingInfo()
>     for i,a in enumerate(mol.GetAtoms()):
>         descriptors=[]
>         descriptors.append(a.GetAtomicNum())
>         descriptors.append(a.GetTotalDegree())
>         descriptors.append(a.GetTotalNumHs())
>         descriptors.append(a.GetFormalCharge())
>         descriptors.append(a.GetMass() - 
> PeriodicTable.GetAtomicWeight(GetPeriodicTable(), a.GetSymbol()))
>         descriptors.append(ring_info.NumAtomRings(i))
>         invariants[i]=hash(tuple(descriptors))& 0xffffffff
>     return invariants
>
> for SMILES in ['Cc1ncccc1', 
> 'CS(=O)(=O)N1CCc2c(C1)c(nn2CCCN1CCOCC1)c1ccc(Cl)c(C#Cc2ccc3C[C@H](NCc3c2)C(=O)N2CCCCC2)c1']:
>     mol = Chem.MolFromSmiles(SMILES)
>     mol = Chem.AddHs(mol)
>     invariants = generateECFPAtomInvariant(mol)
>     info, infoi = {}, {}
>     fp = getNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, 
> nBits=8192, invariants=[], bitInfo=info))
>     fpi = getNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, 
> nBits=8192, invariants=invariants, bitInfo=infoi))
>
>     print("Do the substructures extracted by default invariants and 
> user-defined invariants match?", set(info.values()) == set(infoi.values()))
>     print("Number of mis-matching bits between fp and fpi=", fp.shape[0] - 
> np.count_nonzero(np.equal(fp, fpi)))
>
>
> To assess whether the fingerprints match, I compared the values of the
> 'bitInfo' dictionaries. The keys are hash codes, which of course do not
> match, but the values are pairs of (atomID, radius), which should be the
> same. As you will the (atomID, radius) pairs for the small molecule match *but
> not for the large one*. I also compared the two bitstrings per se, but I
> suppose, due to the usage if different (?) hash functions, the bits don't
> match neither for the small nor for the large molecule. Moreover, when I
> implement the 'generateECFPAtomInvariant()' function on a large scale,
> namely generating fingerprint to train an ML model, the number of invariant
> bits is 2360 using the default ECFP atom invariants, while with
> user-defined invariants are much less (795) and the performance of the ML
> model is significantly different. Could someone point out what I am doing
> wrong?
>
> ~Thomas
>
> --
>
> ======================================================================
>
> Dr. Thomas Evangelidis
>
> Research Scientist
>
> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
> Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>, 
> Prague,
> Czech Republic
>   &
> CEITEC - Central European Institute of Technology <https://www.ceitec.eu/>
> , Brno, Czech Republic
>
> email: teva...@gmail.com, Twitter: tevangelidis
> <https://twitter.com/tevangelidis>, LinkedIn: Thomas Evangelidis
> <https://www.linkedin.com/in/thomas-evangelidis-495b45125/>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to