Dear Isidro,

On Mon, Aug 5, 2013 at 12:37 PM, Isidro Cortés <isidrolausc...@gmail.com>wrote:

> Hi Greg, Hi All,
>
> Concerning the Morgan fingerprints in RDkit, I have several questions:
>
> - I am using
>
> fp = AllChem.GetMorganFingerprintAsBitVect(mol,2,512,bitInfo=info)
>
> to caculate the fingerprints. I need them for machine learning. Therefore,
> I would like to confirm what follows; each bit corresponds to a particular
> chemical substructure, which can be mapped back to a sketch -plot- of the
> substructure within the molecule.
> If I calculate the fingerprint for a dataset, will each bit correspond to
> the same chemical substructure for all the compounds? And, does each bit
> correspond to a unique chemical substructure? I mean that there are not
> clashes.
>
> In that case, which is the procedure to select which features will finally
> appear in the fixed-length fingerprint? Is there a numerical or chemical
> criterion?
>

The code that generates the Morgan fingerprints identifies the atom
environments ("circular" substructures) using the algorithm described in
this paper:
 Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. JCIM 50:742-54
(2010)  http://dx.doi.org/10.1021/ci100050t
Each environment is then hashed to give an unsigned 32bit integer. These
hashes are used as the bit ids if you call AllChem.GetMorganFingerprint().
If you are using AllChem.GetMorganFingerprintAsBitVect(), the 32 bit
unsigned bit id is divided by the bit vector size and the remainder is used
as the new bit id (in pseudo code; newBitId = bitId%numBits).

The original hashing process can certainly generate collisions (different
substructures that map to the same bit), but I'm not aware of examples of
it happening and I haven't actively gone looking for collisions. Hashing
into the smaller space of a bit vector is much more likely to yield
collisions. I have seen a couple specific examples of these at a
fingerprint size of 1024 bits. 512 bits, as you are using above, is
definitely going to have collisions.

To answer your specific questions:
1) The same substructure will always set the same bit, regardless of which
molecule it comes from. Which bit it sets depends on the size of the
fingerprint.
2) Because of the hashing, it is possible that different substructures can
set the same bit. The risk of this goes up as you hash into a smaller space.

The RDKit implementation of the Morgan fingerprint is definitely well
suited to machine learning; several examples have been posted here. If you
are not happy with the hashing and want to have a pre-defined space of
substructures to use for learning, the RDKit offers another possibility
using the molecular fragmenter. There's documentation for this in the
"Getting Started" guide:
http://www.rdkit.org/docs/GettingStartedInPython.html#molecular-fragments

I hope this helps,
-greg
------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to