A quick comment on the cosine metric. Unlike Tanimoto it obeys the triangle
inequality, so in cases where it's used essentially as a distance metric
(e.g. some clustering applications) the results are probably more
mathematically correct. I used it a lot in that context. Whether it makes
any real difference in practical terms is of course questionable as the
fingerprints themselves are only very approximate descriptors.

All the best,
Chris

On Tue, 17 Mar 2020 at 07:28, Greg Landrum <greg.land...@gmail.com> wrote:

> Hi Jason,
>
> On Mon, Mar 16, 2020 at 1:26 PM Jason Biggs <jasondbi...@gmail.com> wrote:
>
>> Thank you again Greg.  If you have time to get this in the upcoming
>> release great, do not rush on my account.
>>
>
>  I spent some time looking at this tomorrow and it's not going to be a
> quick one: it'll require some thought and refactoring to make the bit info
> work in all circumstances. That means that it won't make it into the
> upcoming release.
>
>
>> I have another couple of questions regarding fingerprints in general and
>> the fingerprint generators in particular.
>>
>>
>>    - To what degree do people use the different fingerprint types? Is it
>>    more common to use the RDKit fingerprint, for example, as a bit vector, 
>> and
>>    the Morgan fingerprint as a counts vector?  Does it depend on the
>>    application or is it more how a particular fingerprint was historically
>>    used?
>>
>> It's hard to be really sure, but I would guess that the Morgan
> fingerprints are the most used. I'm also going to guess, and this is based
> on even more of a gut feeling, that people are using bit vectors, not count
> vectors. As for why... good question. Probably because the Morgan
> fingerprints tend to work well in general (though there definitely is no
> "best" fingerprint....
> https://link.springer.com/article/10.1186/1758-2946-5-26,
> http://pubs.acs.org/doi/abs/10.1021/ci400466r), give results that look
> "chemically similar" and there's lots of sample code around.
>
>>
>>    - I notice there is a wider variety of distance measures available
>>    for bit vectors than for count vectors. Is this because these measures, 
>> the
>>    McConnaughey similarity for example, aren't extendable to multisets in the
>>    same way that Tversky similarity can? Or is it just that there hasn't been
>>    any demand for non-bitvector versions of the measures in BitOps.h?
>>
>> Aeons ago when I wrote that code I wanted to be sure to have as many
> possible metrics as possible available. Since then it's become clear that
> Tanimoto/Dice (you can prove that these rank results exactly the same) and
> Tversky (because it allows you to do asymmetric measures) cover most every
> need. I've also seen people use cosine similarity for comparing molecules
> of different sizes (though asymmetric Tversky lets you do the same).
>
>>
>>    - Would it be useful to people for the FingerprintGenerator class to
>>    return the list of atom invariants (or environments) used?  Or is that 
>> what
>>    the BitInfo is used for?
>>
>> there are generators available for the atom and bond invariants of each
> of the fingerprint types. The fingerprint generators don't have a method
> available that allows you to retrieve the atom/bond invariant generators
> that they are using, but we could add this if it would be useful.
>
> -greg
>
>
>
>
>> Best,
>> Jason
>>
>>
>>
>> On Fri, Mar 13, 2020 at 11:13 PM Greg Landrum <greg.land...@gmail.com>
>> wrote:
>>
>>> Unfortunately it looks like the additional outputs for morgan, and rdkit
>>> fingerprints are parts that weren't finished:
>>>
>>> https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Fingerprints/MorganGenerator.cpp#L143
>>>
>>> https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Fingerprints/RDKitFPGenerator.cpp#L99
>>>
>>> I will take a look and see if it's possible to get these into the next
>>> release. In the meantime, if you want that info it looks like you'll need
>>> to use the older fingerprinting functions.
>>>
>>> -greg
>>>
>>> On Fri, Mar 13, 2020 at 11:10 PM Jason Biggs <jasondbi...@gmail.com>
>>> wrote:
>>>
>>>> Thank you Greg.
>>>>
>>>> I am working in C++.  I can poke around with this if I knew which
>>>> members of the AdditionalOutput struct are used by which fingerprint
>>>> generators.  I just wanted to make sure there wasn't an explanation
>>>> somewhere I missed.
>>>>
>>>> I can see that with the AtomPairs fingerprints I can do the following
>>>>
>>>> //mol is an *ROMol and fpg is a *FingerprintGenerator
>>>> RDKit::AdditionalOutput ao;
>>>>
>>>> std::vector<std::vector<std::uint64_t>> atomtobits(mol->getNumAtoms());
>>>> ao.atomToBits = &atb;
>>>>
>>>> auto res = fpg->getSparseCountFingerprint(*mo, nullptr, nullptr, -1,
>>>> &ao);
>>>>
>>>> after which atomtobits contains a list of bits for each atom.  From the
>>>> comments I think the bitInfo member should be used by the
>>>> RDKitFingerprintGenerator, but I don't see where it is used in the code.
>>>> Is that the part that wasn't finished?  Is it possible to get information
>>>> about the atoms/environments that set particular bits in the Morgan or
>>>> RDKit fingerprints using the new API?
>>>>
>>>> Jason Biggs
>>>>
>>>>
>>>>
>>>> On Fri, Mar 13, 2020 at 10:20 AM Greg Landrum <greg.land...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Jason,
>>>>>
>>>>> At the moment there's nothing available here except what's in the C++
>>>>> tests. This part of the code didn't end up being completely finished 
>>>>> before
>>>>> the GSoC project ended and it's never bubbled up on my priority list to
>>>>> finish it.
>>>>>
>>>>> I haven't spent much time with this code, but I can probably put
>>>>> together an example.
>>>>> Are you working from C++?
>>>>>
>>>>> -greg
>>>>>
>>>>>
>>>>> On Thu, Mar 12, 2020 at 10:42 PM Jason Biggs <jasondbi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I am taking a look at the FingerprintGenerator class and I really
>>>>>> like this unified interface for these four types of fingerprints.  I have
>>>>>> very limited experience with the fingerprint code before the generator 
>>>>>> API
>>>>>> was introduced.
>>>>>>
>>>>>> What I'm not sure about is how to get information about the
>>>>>> atoms/environments that set the bits.  I believe I need to use the
>>>>>> AdditionalOutput struct,
>>>>>> https://www.rdkit.org/docs/cppapi/structRDKit_1_1AdditionalOutput.html,
>>>>>> but I'm not exactly sure how to do so.  I normally would look at the c++
>>>>>> test files to see how it is used, and from that I see the atomToBits 
>>>>>> member
>>>>>> is used in the atom pairs fingerprints, but I'm not sure about the other
>>>>>> members of this struct.  For example there is a bitInfo member, is this
>>>>>> where I would find information for the RDKit and Morgan fingerprints?
>>>>>>
>>>>>> Are there any examples somewhere that I could follow to find out more
>>>>>> information?
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>> Jason
>>>>>> _______________________________________________
>>>>>> Rdkit-discuss mailing list
>>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>>
>>>>> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to