Re: [Rdkit-discuss] Molecular dis / similarity using fingerprints

Greg Landrum Mon, 25 May 2015 19:59:37 -0700

Hi JP.

On Mon, May 25, 2015 at 4:10 PM, JP <[email protected]> wrote:

>
> I have a set of molecules (~30,000) which I would like to get a structural
> "diversity index" for.  So I thought easy - generate some fingerprint I
> fancy (ECFP-like, rad 2), take a threshold I fancy (0.7), select a
> similarity metric I fancy (Tanimoto) and apply these to the set in a
> pairwise fashion (you can only do this for a small-ish number of
> molecules).  The resulting distribution of Tanimoto scores defines the
> similarity (or dissimilarity) of the set.
>
> First of all is there a better way to do this? Does anyone have a feel for
> the numbers to use (fingerprint type, radius, no of bits)?  Is there some
> 'Industry standard'?  Which method should I use
> GetMorganFingerprintAsBitVect or GetMorganFingerprint (considering I wanted
> ECFP like fingerprints) ?  What determines when to use one over the other?
>

The two functions use the same algorithm for identifying features in the
molecule, but they return different object types. GetMorganFingerprint()
returns a sparse int vector 2^32 elements long containing the counts of the
number of times each feature appears. GetMorganFingerprintAsBitVect()
returns a bit vector (standard fingerprint) nBits long (nBits is an
argument) that indicates whether or not a particular feature is present.

Similarities calculated using the two fingerprints are highly correlated (
http://rdkit.blogspot.ch/2013/10/comparing-fingerprints-to-each-other.html),
but certainly not identical.

> All my scores are rather low even for relatively similar structures -- so
> I think one of my parameters must be off.  Just adding (or removing) a
> carbonyl drops my score to 0.43.
> I made this notebook example:
> http://nbviewer.ipython.org/gist/malteseunderdog/6af446c0dbb1ac9840e7
>

As Tim pointed out, if you change the aromaticity of a system (which
adding/removing the carbonyl does), it can have a dramatic impact on the
similarity. That is what's going on here.

To the RDKit question: GetMorganFingerprintAsBitVect and
> GetMorganFingerprint give different tanimoto scores (with same radius: 2).
> This is of course because for the explicit bit vector we can set the length
> of the vector/fingerprint.  Is there an equivalence between the two? (say
> using n bits gives same results as GetMorganFingerprint).  How come the
> GetMorganFingerprint method has no user-defined length for the
> fingerprint?  What are the hashed equivalents of these fingerprints (e.g.
> GetHashedMorganFingerprint) ?
>

The other two were explained above; GetHashedMorganFingerprint() returns a
count vector of a user-specified length (instead of being 2^32 long).

> ps A small suggestion, if I am allowed.  The fingerprint classes could do
> with an informative toString (or non Java equivalent) - I know there is
> ToBitString, but you need to call that explicitly when printing
>

Do you mean that you'd like "print fingerprint" from python to show
something about the value of the fingerprint instead of just what type it
is? This would be inconsistent from the rest of the RDKit objects, but
thinking about revisiting how all of that is done could make sense.

-greg

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Molecular dis / similarity using fingerprints

Reply via email to