Wow - that was quick. Thanks again Greg - it's much appreciated. If I ever get around to publishing the algorithm, I'll make sure I open source and contribute it to RDKit.
Thanks Jameed -----Original Message----- From: Greg Landrum [mailto:greg.land...@gmail.com] Sent: 12 March 2013 04:48 To: Jameed Hussain Cc: rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] RDKit fingerprint enhancement request Dear Jameed, On Mon, Mar 11, 2013 at 7:10 PM, Jameed Hussain <jameed.x.huss...@gsk.com> wrote: > <snip> > > I remember chatting to you at the UGM about this. It works okay - but > it is slow (as you need to generate an fp for every atom you need the > partial fp > for) and can suffer from issues related to symmetry. Hence, I was > wondering if you could add an option/enhancement to the topological > fingerprinting code. > > > > Would it be possible to record the bits set for every atom in a given > molecule as you generate the fingerprint. So something like a > dictionary keyed on atom id with a value containing an array/set of > the bits that get set for the atom. So as you hash a path, record the > bits that are set to on for the ids of the atoms in the path. > Hopefully, this isn't a large piece of work. It's not. > > It would make the partial_fp generation much quicker as I would just > need to generate the fp once and the data structure would contain all > the information needed to generate the partial fp for any > atom/substructure in the molecule (without the symmetry issues). It > would also have the benefit of providing a data structure to explain > the bits for the topological fingerprint like you have for the Morgan > fingerprint. I hope that is enough to convince you J. You had me already... this is just a nice extra bit. :-) > Lastly, there isn't an urgency as I have a slow implementation - I > just want to make it quicker. I just checked in an initial implementation. This will slow the fingerprinter down somewhat when you're using the option, but it shouldn't be that bad compared to the general slowness of the fingerprinter. >From the Python side it looks like this: In [1]: from rdkit import Chem In [2]: l = [] In [3]: fp=Chem.RDKFingerprint(Chem.MolFromSmiles('CCCO'),minPath=1,maxPath=3,nBitsPerHash=1,atomBits=l) In [4]: list(fp.GetOnBits()) Out[4]: [242, 591, 718, 820, 1485] In [6]: l Out[6]: [[718, 820, 1485], [718, 820, 591, 1485], [718, 242, 820, 591, 1485], [242, 591, 1485]] -greg ________________________________ This e-mail was sent by GlaxoSmithKline Services Unlimited (registered in England and Wales No. 1047315), which is a member of the GlaxoSmithKline group of companies. The registered address of GlaxoSmithKline Services Unlimited is 980 Great West Road, Brentford, Middlesex TW8 9GS. ------------------------------------------------------------------------------ Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss