Re: [Open Babel] Report weird behaviour of Tanimoto / readfile with Pybel

Santi Villalba Fri, 25 Feb 2011 05:30:59 -0800

A molecule created from a smiles string lacks explicit hydrogeninformation, while hydrogens are there when reading from the pubchemfile. Explicit hydrogens are used when computing the fingerprints, andso the substructures found differ in both cases, even if we are talkingabout the same molecule. Try this:molcan = pybel.readstring('can','C1=CC2=C(C=C1C3=CC=C(O3)CO)C(=NC=N2)NCC4=NC=CN4')

molpub = pybel.readfile('sdf', '44968247.sdf').next()
print(molcan.calcfp('MACCS') | molpub.calcfp('MACCS'))  ==> != 1.0, weirdly
molcan.addh()

print(molcan.calcfp('MACCS') | molpub.calcfp('MACCS')) ==> == 1.0, asexpected

I guess this is the intended behavior in the fingerprint computation.However it can be confusing at first and lacks logical consistency, aswe get different fingerprints for the very same molecule. Would it makesense to either explicitly add or explicitly remove the hydrogens insidethe fingerprint computation?


On 25/02/11 09:52, Floriane Montanari wrote:

Hi all,
I'm using Pybel to compute similarity between one query molecule andone database of molecules.
Doing some simple tests, I find out the following:
let's say that my query is the compound 44968246 from Pubchem. Itssmiles string is:
'CC(C1=CC=C(O1)C2=CC3=C(C=C2)N=CN=C3NCCC4=CN=CN4)O'
In my program, I find a list of similar compounds, and by curiosity Iwanted to check that the Tanimoto values are the same if I computethem one by one using Pybel. One of the compounds "hitted" is thecompound 44968247 of Pubchem, whose smiles string is
'C1=CC2=C(C=C1C3=CC=C(O3)CO)C(=NC=N2)NCC4=NC=CN4'

The computation of Tanimoto for MACCS fingerprints gives me: 0.822 using
>>> mols =['CC(C1=CC=C(O1)C2=CC3=C(C=C2)N=CN=C3NCCC4=CN=CN4)O','C1=CC2=C(C=C1C3=CC=C(O3)CO)C(=NC=N2)NCC4=NC=CN4']
>>> molec = [pybel.readstring("smi", x) for x in mols]
>>> fps = [x.calcfp('MACCS') for x in molec]
>>> print fps[0] | fps[1]
*0.822222222222*
But when I save the molecule 44968247 into a sdf file (find attached)and read the molecule from the file using>>> mol2 = pybel.readfile("sdf","/mmb/data/Medicahead/WP2/activePubchemCompound/44968247.sdf").next()
The computation then gives me
>>> fp2 = mol2.calcfp("MACCS")
>>> print fps[0] | fp2
*0.711538461538*
I have compared the lists of on bits given by 1/ the smiles string 2/the sdf file, and they are definitely different:1/ [8, 11, 38, 54, 57, 62, 65, 72, 77, 79, 80, 82, 83, 96, 100, 104,105, 109, 111, 120, 121, 131, 132, 133, 135, 137, 138, 139, 142, 151,152, 153, 155, 156, 157, 158, 159, 161, 162, 164, 165]2/ [8, 11, 38, 54, 57, 62, 65, 72, 75, 77, 79, 80, 82, 83, 96, 100,104, 105, 109, 111, 112, 120, 121, 122, 126, 131, 132, 133, 135, 137,138, 139, 142, 144, 148, 150, 151, 152, 153, 155, 156, 157, 158, 159,161, 162, 164, 165]
So... Is it a problem of OpenBabel reading the sdf file?
Is it a problem of me not reading it properly?
Is it a problem of Pubchem giving smiles string and sdf files that donot match?
I would be glad if someone could help me with that.

Regards,
Floriane

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev

_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Re: [Open Babel] Report weird behaviour of Tanimoto / readfile with Pybel

Reply via email to