Hi Thomas, I think you are making your life more complicated than it needs to be while testing things. You're certainly making it harder for us to follow what you're doing. Please try and keep questions as simple as possible. In the case below, the calls to getNumpyArray() aren't relevant to your question and just make things confusing for those of us who want to help.
What exactly are you trying to figure out here? Do you for some reason want to re-implement the standard connectivity invariants yourself in Python? I doubt it, but in case you did want to do this, you've made a mistake in your implementation. The invariants use whether or not an atom is in a ring, not the number of rings it's in. You're also not using the periodic table object efficiently. Here's an updated version: def generateECFPAtomInvariant(mol, discrete_charges=False): pt = Chem.GetPeriodicTable() num_atoms = mol.GetNumAtoms() invariants = [0]*num_atoms ring_info = mol.GetRingInfo() for i,a in enumerate(mol.GetAtoms()): descriptors=[] descriptors.append(a.GetAtomicNum()) descriptors.append(a.GetTotalDegree()) descriptors.append(a.GetTotalNumHs()) descriptors.append(a.GetFormalCharge()) descriptors.append(a.GetMass() - pt.GetAtomicWeight(a.GetSymbol())) if(ring_info.NumAtomRings(i)): descriptors.append(1) invariants[i]=hash(tuple(descriptors))& 0xffffffff return invariants To compare things I would keep it as simple as possible and do something like this: for SMILES in ['Cc1ncccc1', 'CS(=O)(=O)N1CCc2c(C1)c(nn2CCCN1CCOCC1)c1ccc(Cl)c(C#Cc2ccc3C[C@H ](NCc3c2)C(=O)N2CCCCC2)c1']: mol = Chem.MolFromSmiles(SMILES) mol = Chem.AddHs(mol) invariants = generateECFPAtomInvariant(mol) bi1={} bi2={} fp1 = rdMolDescriptors.GetMorganFingerprint(mol,radius=3,bitInfo=bi1) fp2 = rdMolDescriptors.GetMorganFingerprint(mol,radius=3,invariants=invariants,bitInfo=bi2) print("========") print(SMILES) nz1 = fp1.GetNonzeroElements() nz2 = fp2.GetNonzeroElements() print(len(nz1),len(nz1)) print(nz1==nz2) The first print function outputs the same result for both fingerprints, as you'd hope.[1] This, as you point out, will not generate the same bits since the invariants are hashed to different values, so the last print function outputs "False". There's also no good way to compare the two bitinfo structures to each other: since the bit IDs are different, you have no way of knowing which entries to compare to which. So, though each of the two dictionaries should contain the same values (not keys), they will be in a different order. If you want to compare just that the two sets of values is the same (which isn't a lot more informative than comparing the number of set bits), you could do: print(sorted(bi1.values())==sorted(bi2.values())) Again, I don't think you actually want to do any of this. What exactly are you trying to accomplish? -greg -greg [1] This isn't strictly guaranteed. Due to true hash collisions while generating the fingerprint, there is a chance that you'll get different numbers of bits being set. On Sun, Dec 1, 2019 at 5:41 PM Thomas Evangelidis <teva...@gmail.com> wrote: > Hi Paolo, > > Many thanks for the detailed explanation! Standing by your statement "If > the invariants are provided by the user, they will be used instead", I > attempted to reproduce the default ECFP fingerprint for a small and a large > molecule. Here is the code: > > import numpy as np > from rdkit import DataStructs > from rdkit.Chem import PeriodicTable, GetPeriodicTable, AllChem > from rdkit import Chem > > def getNumpyArray(fp): > arr = np.zeros((1,), np.float32) > DataStructs.ConvertToNumpyArray(fp, arr) > return arr > > def generateECFPAtomInvariant(mol, discrete_charges=False): > num_atoms = mol.GetNumAtoms() > invariants = [0]*num_atoms > ring_info = mol.GetRingInfo() > for i,a in enumerate(mol.GetAtoms()): > descriptors=[] > descriptors.append(a.GetAtomicNum()) > descriptors.append(a.GetTotalDegree()) > descriptors.append(a.GetTotalNumHs()) > descriptors.append(a.GetFormalCharge()) > descriptors.append(a.GetMass() - > PeriodicTable.GetAtomicWeight(GetPeriodicTable(), a.GetSymbol())) > descriptors.append(ring_info.NumAtomRings(i)) > invariants[i]=hash(tuple(descriptors))& 0xffffffff > return invariants > > for SMILES in ['Cc1ncccc1', > 'CS(=O)(=O)N1CCc2c(C1)c(nn2CCCN1CCOCC1)c1ccc(Cl)c(C#Cc2ccc3C[C@H](NCc3c2)C(=O)N2CCCCC2)c1']: > mol = Chem.MolFromSmiles(SMILES) > mol = Chem.AddHs(mol) > invariants = generateECFPAtomInvariant(mol) > info, infoi = {}, {} > fp = getNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, > nBits=8192, invariants=[], bitInfo=info)) > fpi = getNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, > nBits=8192, invariants=invariants, bitInfo=infoi)) > > print("Do the substructures extracted by default invariants and > user-defined invariants match?", set(info.values()) == set(infoi.values())) > print("Number of mis-matching bits between fp and fpi=", fp.shape[0] - > np.count_nonzero(np.equal(fp, fpi))) > > > To assess whether the fingerprints match, I compared the values of the > 'bitInfo' dictionaries. The keys are hash codes, which of course do not > match, but the values are pairs of (atomID, radius), which should be the > same. As you will the (atomID, radius) pairs for the small molecule match *but > not for the large one*. I also compared the two bitstrings per se, but I > suppose, due to the usage if different (?) hash functions, the bits don't > match neither for the small nor for the large molecule. Moreover, when I > implement the 'generateECFPAtomInvariant()' function on a large scale, > namely generating fingerprint to train an ML model, the number of invariant > bits is 2360 using the default ECFP atom invariants, while with > user-defined invariants are much less (795) and the performance of the ML > model is significantly different. Could someone point out what I am doing > wrong? > > ~Thomas > > -- > > ====================================================================== > > Dr. Thomas Evangelidis > > Research Scientist > > IOCB - Institute of Organic Chemistry and Biochemistry of the Czech > Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>, > Prague, > Czech Republic > & > CEITEC - Central European Institute of Technology <https://www.ceitec.eu/> > , Brno, Czech Republic > > email: teva...@gmail.com, Twitter: tevangelidis > <https://twitter.com/tevangelidis>, LinkedIn: Thomas Evangelidis > <https://www.linkedin.com/in/thomas-evangelidis-495b45125/> > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss