Thanks Greg, That's exactly what I needed.
I now have a question though, using modulo arithmetic means each environment only sets one bit in the FP (checking the code this looks true). Is there a reason why we only set a single bit with each environment? I mean we in the greater sense, has anyone ever looked at denser Morgan FP's (with a fixed radius)? Best, Nick Nicholas C. Firth | PhD Student | Cancer Therapeutics The Institute of Cancer Research | 15 Cotswold Road | Belmont | Sutton | Surrey | SM2 5NG T 020 8722 4033 | E nicholas.fi...@icr.ac.uk | W www.icr.ac.uk | Twitter @ICRnews ________________________________________ From: Greg Landrum [greg.land...@gmail.com] Sent: 17 July 2014 05:52 To: Nicholas Firth Cc: RDKit Discuss Subject: Re: [Rdkit-discuss] Unhashed info in hashed fingerprint On Wed, Jul 16, 2014 at 6:40 PM, Nicholas Firth <nicholas.fi...@icr.ac.uk<mailto:nicholas.fi...@icr.ac.uk>> wrote: Hi RDKitters, I might be being stupid here, but I'm trying to marry up the bitinfo from a hashed fingerprint to the actual fingerprint and I can't seem to do it. from rdkit import Chem, DataStructs from rdkit.Chem import rdMolDescriptors as rdMD info = {} mol = Chem.MolFromSmiles('CCCCC') print rdMD.GetHashedMorganFingerprint(mol, radius=2, nBits = 1024, bitInfo = info).GetNonzeroElements() print '\n',info {33: 2, 294: 2, 591: 2, 80: 3, 887: 1, 794: 2, 381: 1} {2246728737: ((0, 0), (4, 0)), 3542456614: ((0, 1), (4, 1)), 1685248591: ((1, 2), (3, 2)), 2245384272: ((1, 0), (2, 0), (3, 0)), 1510461303: ((2, 1),), 1173125914: ((1, 1), (3, 1)), 2738269565: ((2, 2),)} The indices on the bitinfo appear to be the unhashed values. What I'd expect to see it something similar to the bit vector version of this code Sure enough, that's a bug. The values are the indices for the non-hashed (really non-folded, but it's too late to rename that function now) version of the fingerprint: In [7]: info = {} In [8]: print rdMD.GetMorganFingerprint(mol, radius=2, bitInfo = info).GetNonzeroElements() {2246728737: 2, 3542456614: 2, 1685248591: 2, 2245384272: 3, 1510461303: 1, 1173125914: 2, 2738269565: 1} In [9]: print '\n',info {2246728737: ((0, 0), (4, 0)), 3542456614: ((0, 1), (4, 1)), 1685248591: ((1, 2), (3, 2)), 2245384272: ((1, 0), (2, 0), (3, 0)), 1510461303: ((2, 1),), 1173125914: ((1, 1), (3, 1)), 2738269565: ((2, 2),)} Fortunately it's easy to fix this. The bits are hashed/folded into the smaller fingerprint using integer modulo: In [10]: info = {} In [11]: print rdMD.GetHashedMorganFingerprint(mol, radius=2, nBits = 1024, bitInfo = info).GetNonzeroElements() {33: 2, 294: 2, 591: 2, 80: 3, 887: 1, 794: 2, 381: 1} In [12]: for k,v in info.it<http://info.it> info.items info.iteritems info.iterkeys info.itervalues In [12]: for k,v in info.items(): print k%1024,v 33 ((0, 0), (4, 0)) 294 ((0, 1), (4, 1)) 591 ((1, 2), (3, 2)) 80 ((1, 0), (2, 0), (3, 0)) 887 ((2, 1),) 794 ((1, 1), (3, 1)) 381 ((2, 2),) I'll fix the bug, but this workaround should hopefully cover things in the short term. -greg The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. ------------------------------------------------------------------------------ Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss