On 17/03/2020 17:14, Chris Earnshaw wrote:
A quick comment on the cosine metric. Unlike Tanimoto it obeys the
triangle inequality, so in cases where it's used essentially as a
distance metric (e.g. some clustering applications) the results are
probably more mathematically correct.

The Tanimoto _distance_ is a valid metric, under certain conditions
(like vectors of only positive values).

For bitstrings, the formula is:
d = 1 - |AnB|/|AuB|

For float or integer vectors:

d = 1 - sum_i(min(a_i, b_i))/sum_i(max(a_i, b_i))

For the mathematical details, cf.

A proof of the triangle inequality for the Tanimoto distance
https://link.springer.com/article/10.1023%2FA%3A1019154432472

and

A note on the triangle inequality for the Jaccard distance
https://www.sciencedirect.com/science/article/pii/S0167865518309188

If you are used to using the Tanimoto score, there is no reason why not
to switch to the Tanimoto distance, if a true metric required by the underlying
algorithm/method.

Regards,
F.

I used it a lot in that context.
Whether it makes any real difference in practical terms is of course
questionable as the fingerprints themselves are only very approximate
descriptors.

All the best,
Chris

On Tue, 17 Mar 2020 at 07:28, Greg Landrum <greg.land...@gmail.com>
wrote:

Hi Jason,

On Mon, Mar 16, 2020 at 1:26 PM Jason Biggs <jasondbi...@gmail.com>
wrote:

Thank you again Greg.  If you have time to get this in the
upcoming release great, do not rush on my account.

I spent some time looking at this tomorrow and it's not going to be
a quick one: it'll require some thought and refactoring to make the
bit info work in all circumstances. That means that it won't make it
into the upcoming release.

I have another couple of questions regarding fingerprints in
general and the fingerprint generators in particular.

* To what degree do people use the different fingerprint types?
Is it more common to use the RDKit fingerprint, for example, as a
bit vector, and the Morgan fingerprint as a counts vector?  Does
it depend on the application or is it more how a particular
fingerprint was historically used?

It's hard to be really sure, but I would guess that the Morgan
fingerprints are the most used. I'm also going to guess, and this is
based on even more of a gut feeling, that people are using bit
vectors, not count vectors. As for why... good question. Probably
because the Morgan fingerprints tend to work well in general (though
there definitely is no "best" fingerprint....
https://link.springer.com/article/10.1186/1758-2946-5-26,
http://pubs.acs.org/doi/abs/10.1021/ci400466r), give results that
look "chemically similar" and there's lots of sample code around.

* I notice there is a wider variety of distance measures
available for bit vectors than for count vectors. Is this because
these measures, the McConnaughey similarity for example, aren't
extendable to multisets in the same way that Tversky similarity
can? Or is it just that there hasn't been any demand for
non-bitvector versions of the measures in BitOps.h?

Aeons ago when I wrote that code I wanted to be sure to have as many
possible metrics as possible available. Since then it's become clear
that Tanimoto/Dice (you can prove that these rank results exactly
the same) and Tversky (because it allows you to do asymmetric
measures) cover most every need. I've also seen people use cosine
similarity for comparing molecules of different sizes (though
asymmetric Tversky lets you do the same).

* Would it be useful to people for the FingerprintGenerator class
to return the list of atom invariants (or environments) used?  Or
is that what the BitInfo is used for?

there are generators available for the atom and bond invariants of
each of the fingerprint types. The fingerprint generators don't have
a method available that allows you to retrieve the atom/bond
invariant generators that they are using, but we could add this if
it would be useful.

-greg

Best,Jason

On Fri, Mar 13, 2020 at 11:13 PM Greg Landrum
<greg.land...@gmail.com> wrote:

Unfortunately it looks like the additional outputs for morgan, and
rdkit fingerprints are parts that weren't finished:

https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Fingerprints/MorganGenerator.cpp#L143


https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Fingerprints/RDKitFPGenerator.cpp#L99

I will take a look and see if it's possible to get these into the
next release. In the meantime, if you want that info it looks like
you'll need to use the older fingerprinting functions.

-greg

On Fri, Mar 13, 2020 at 11:10 PM Jason Biggs <jasondbi...@gmail.com>
wrote:

Thank you Greg.

I am working in C++.  I can poke around with this if I knew which
members of the AdditionalOutput struct are used by which fingerprint
generators.  I just wanted to make sure there wasn't an explanation
somewhere I missed.

I can see that with the AtomPairs fingerprints I can do the
following

//mol is an *ROMol and fpg is a *FingerprintGenerator

RDKit::AdditionalOutput ao;

std::vector<std::vector<std::uint64_t>>
atomtobits(mol->getNumAtoms());
ao.atomToBits = &atb;

auto res = fpg->getSparseCountFingerprint(*mo, nullptr, nullptr, -1,
&ao);

after which atomtobits contains a list of bits for each atom.  From
the comments I think the bitInfo member should be used by the
RDKitFingerprintGenerator, but I don't see where it is used in the
code.  Is that the part that wasn't finished?  Is it possible to get
information about the atoms/environments that set particular bits in
the Morgan or RDKit fingerprints using the new API?

Jason Biggs

On Fri, Mar 13, 2020 at 10:20 AM Greg Landrum
<greg.land...@gmail.com> wrote:

Hi Jason,

At the moment there's nothing available here except what's in the
C++ tests. This part of the code didn't end up being completely
finished before the GSoC project ended and it's never bubbled up on
my priority list to finish it.

I haven't spent much time with this code, but I can probably put
together an example.
Are you working from C++?

-greg

On Thu, Mar 12, 2020 at 10:42 PM Jason Biggs <jasondbi...@gmail.com>
wrote:

I am taking a look at the FingerprintGenerator class and I really
like this unified interface for these four types of fingerprints.  I
have very limited experience with the fingerprint code before the
generator API was introduced.

What I'm not sure about is how to get information about the
atoms/environments that set the bits.  I believe I need to use the
AdditionalOutput struct,

https://www.rdkit.org/docs/cppapi/structRDKit_1_1AdditionalOutput.html,
but I'm not exactly sure how to do so.  I normally would look at the
c++ test files to see how it is used, and from that I see the
atomToBits member is used in the atom pairs fingerprints, but I'm
not sure about the other members of this struct.  For example there
is a bitInfo member, is this where I would find information for the
RDKit and Morgan fingerprints?

Are there any examples somewhere that I could follow to find out
more information?

Thank you

Jason
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
 _______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to