A molecule created from a smiles string lacks explicit hydrogen
information, while hydrogens are there when reading from the pubchem
file. Explicit hydrogens are used when computing the fingerprints, and
so the substructures found differ in both cases, even if we are talking
about the same molecule. Try this:
molcan = pybel.readstring('can',
'C1=CC2=C(C=C1C3=CC=C(O3)CO)C(=NC=N2)NCC4=NC=CN4')
molpub = pybel.readfile('sdf', '44968247.sdf').next()
print(molcan.calcfp('MACCS') | molpub.calcfp('MACCS')) ==> != 1.0, weirdly
molcan.addh()
print(molcan.calcfp('MACCS') | molpub.calcfp('MACCS')) ==> == 1.0, as
expected
I guess this is the intended behavior in the fingerprint computation.
However it can be confusing at first and lacks logical consistency, as
we get different fingerprints for the very same molecule. Would it make
sense to either explicitly add or explicitly remove the hydrogens inside
the fingerprint computation?
On 25/02/11 09:52, Floriane Montanari wrote:
Hi all,
I'm using Pybel to compute similarity between one query molecule and
one database of molecules.
Doing some simple tests, I find out the following:
let's say that my query is the compound 44968246 from Pubchem. Its
smiles string is:
'CC(C1=CC=C(O1)C2=CC3=C(C=C2)N=CN=C3NCCC4=CN=CN4)O'
In my program, I find a list of similar compounds, and by curiosity I
wanted to check that the Tanimoto values are the same if I compute
them one by one using Pybel. One of the compounds "hitted" is the
compound 44968247 of Pubchem, whose smiles string is
'C1=CC2=C(C=C1C3=CC=C(O3)CO)C(=NC=N2)NCC4=NC=CN4'
The computation of Tanimoto for MACCS fingerprints gives me: 0.822 using
>>> mols =
['CC(C1=CC=C(O1)C2=CC3=C(C=C2)N=CN=C3NCCC4=CN=CN4)O','C1=CC2=C(C=C1C3=CC=C(O3)CO)C(=NC=N2)NCC4=NC=CN4']
>>> molec = [pybel.readstring("smi", x) for x in mols]
>>> fps = [x.calcfp('MACCS') for x in molec]
>>> print fps[0] | fps[1]
*0.822222222222*
But when I save the molecule 44968247 into a sdf file (find attached)
and read the molecule from the file using
>>> mol2 = pybel.readfile("sdf",
"/mmb/data/Medicahead/WP2/activePubchemCompound/44968247.sdf").next()
The computation then gives me
>>> fp2 = mol2.calcfp("MACCS")
>>> print fps[0] | fp2
*0.711538461538*
I have compared the lists of on bits given by 1/ the smiles string 2/
the sdf file, and they are definitely different:
1/ [8, 11, 38, 54, 57, 62, 65, 72, 77, 79, 80, 82, 83, 96, 100, 104,
105, 109, 111, 120, 121, 131, 132, 133, 135, 137, 138, 139, 142, 151,
152, 153, 155, 156, 157, 158, 159, 161, 162, 164, 165]
2/ [8, 11, 38, 54, 57, 62, 65, 72, 75, 77, 79, 80, 82, 83, 96, 100,
104, 105, 109, 111, 112, 120, 121, 122, 126, 131, 132, 133, 135, 137,
138, 139, 142, 144, 148, 150, 151, 152, 153, 155, 156, 157, 158, 159,
161, 162, 164, 165]
So... Is it a problem of OpenBabel reading the sdf file?
Is it a problem of me not reading it properly?
Is it a problem of Pubchem giving smiles string and sdf files that do
not match?
I would be glad if someone could help me with that.
Regards,
Floriane
------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss