I’ve just done an analysis of frequency of hash collisions for Morgan
fingerprints, on a combinatorial library of 1M potential organic
semiconductors. To reduce collisions below 0.5% (meaning: >99.5%
fingerprints are unique), the radius has to be at least 5 (corresponding to
ECFP10) and number of bits needs to be at least 128. The same conclusion
was obtained on a random 250k subset. Folding to 256 bits or more (tried up
to 2048), or increasing the radius (up to 8) offered modest improvements.
Below 100 bits, frequency of collisions increased dramatically. So it was
possible to choose a small(ish) fingerprint which is quite unique, even
though it is a combinatorial library of rather similar compounds,
containing many sets of isomers etc.
Regards,
Michal
On Fri, 20 Apr 2018 at 19:02, David Cosgrove <davidacosgrov...@gmail.com>
wrote:
> Hi Jeff,
> What you say is theoretically correct, in that it is probably not possible
> to go from the fingerprint directly to a structure. However, it is possible
> to generate structures and rapidly compare them to the target fingerprint.
> The fingerprints are of course able to tell you how close your structure is
> to the target fingerprint in a way that can drive an optimisation
> algorithm. Chemistry adds strong constraints to what structures are
> possible, which reduces the search space dramatically and if you know it’s
> a “drug-like” molecule you’re looking for, even more so.
> People forget that Daylight originally developed fingerprints to speed up
> substructural searching of databases. A structure can only be a
> substructure of another molecule if all the bits it sets are also in the
> other molecule. They are specifically designed to encode the molecular
> structure, and that’s why a GA can be successful. As Peter says, the same
> fingerprint can be generated for different molecules, but this will be rare
> if the fingerprint is well designed. Try it on Chembl with an RDKit
> fingerprint and I’ll be surprised if you get more than 10 pairs that aren’t
> isomers of each other or something trivial like that.
> Regards,
> Dave
>
> On Fri, 20 Apr 2018 at 18:49, Peter S. Shenkin <shen...@gmail.com> wrote:
>
>> Well, @jeff, there's no law saying that hashes must collide, and in fact
>> some are designed to make collision extremely unlikely (can you say
>> "SHA-2"?). But the ones in question here do collide relatively frequently,
>> for at least some molecular fingerprint types.
>>
>> An interesting question (maybe only to me :-) ) would be how similar, in
>> general, the structures are that exhibit identical fingerprints, for the
>> well-known fingerprint types, for various fingerprint lengths. A
>> sufficiently complicated molecule will give lots of on bits, and for (say)
>> a 64-fit fingerprint, there can only be 64 possible fingerprints with all
>> but one bit turned on.
>>
>> I realize that most fingerprints in common use today are longer than
>> this, but still, looking back at 64- and 32-bit fingerprints with all but
>> one bits on might give some insight. How short does a fingerprint of some
>> particular type have to be for, say, 10% of CHEMBL molecules to exhibit an
>> all-on pattern? How short does it have to be for, say, 10% of CHEMBL
>> molecules to have an exact fingerprint match with some other molecule?
>>
>> -P
>>
>> On Fri, Apr 20, 2018 at 1:03 PM, jeff godden <jgod...@gmail.com> wrote:
>>
>>> Long ago molecular fingerprints were referred to in the literature as
>>> molecular hash functions. (y'know, those crazy mathematical algorithms
>>> which permitted rapid lookup of some string in a lookup table) As such, we
>>> expected for their to be the associated hash collisions (
>>> https://en.wikipedia.org/wiki/Hash_table#Collision_resolution ). All
>>> this by way of saying that to go from fingerprint to the molecular
>>> structure which produced it is traditionally impossible unless the
>>> fingerprint no longer amounts to a hash(ing) function.
>>> --
>>> j
>>>
>>>
>>> On Fri, Apr 20, 2018 at 9:56 AM, Peter S. Shenkin <shen...@gmail.com>
>>> wrote:
>>>
>>>> Isn't it the case that more than one molecule can share an identical
>>>> fingerprint? (Depending on the specific fingerprint.) Think p-biphenyl,
>>>> extended to triphenyl, tetraphenyl, etc. Still, a GA or SA method could
>>>> keep going and come up with multiple matches, plus multiple near-misses.
>>>>
>>>> -P.
>>>>
>>>> On Fri, Apr 20, 2018 at 10:58 AM, David Cosgrove <
>>>> davidacosgrov...@gmail.com> wrote:
>>>>
>>>>> Hi Brian,
>>>>> Dave Weininger once showed a fairly simple GA that could generally
>>>>> deduce a structure from a daylight fingerprint by using SMILES strings as
>>>>> the chromosomes and tanimoto distance to the target fingerprint as the
>>>>> fitness function. He may have done a talk about it for MUG or conceivably
>>>>> written it up. It’d be in JCICS if so, I expect.
>>>>>
>>>>> You could probably knock up a script to do that in a couple of hours I
>>>>> would think using a GA library to do the mechanics. If you’re not worried
>>>>> about high efficiency, you don’t need to do anything fancy with mutation
>>>>> and crossover of the SMILES strings to ensure you always get a valid
>>>>> molecule, you can just give a fitness of 0 if the SMILES parser doesn’t
>>>>> like what you give it.
>>>>> HTH,
>>>>> Dave
>>>>>
>>>>>
>>>>> On Fri, 20 Apr 2018 at 14:45, Nils Weskamp <nils.wesk...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Brian,
>>>>>>
>>>>>> in general, it might be difficult to come up with a deterministic
>>>>>> algorithm that generates exactly one structure for a given fingerprint
>>>>>> due
>>>>>> to many ambiguities in the process. If you are happy with a more "fuzzy"
>>>>>> (approximate / probabilistic) approach, you might want to take a look at
>>>>>>
>>>>>> https://pubs.acs.org/doi/abs/10.1021/ci600383v
>>>>>> https://link.springer.com/article/10.1007/s10822-005-9020-4
>>>>>>
>>>>>> Given this task, I would probably start with a large database of
>>>>>> known compounds (PubChem, UniChem, GDB17), calculate fingerprints and
>>>>>> then
>>>>>> do a similarity search with my query fingerprint.
>>>>>>
>>>>>> Hope this helps,
>>>>>> Nils
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 20, 2018 at 3:13 PM, Brian Cole <col...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Chem-informaticians:
>>>>>>>
>>>>>>> I know it has been talked about in the community that fingerprints
>>>>>>> are not a way to obfuscate molecules for security, but I don't recall a
>>>>>>> paper actually demonstrating actual reverse engineering a fingerprint
>>>>>>> into
>>>>>>> a chemical structure. Does anyone know if such a paper exists?
>>>>>>>
>>>>>>> Code using RDKit to demonstrate the functionality would be an
>>>>>>> obvious bonus as well. :-)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Brian
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>>>>> _______________________________________________
>>>>>>> Rdkit-discuss mailing list
>>>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>>>> _______________________________________________
>>>>>> Rdkit-discuss mailing list
>>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>>
>>>>> --
>>>>> David Cosgrove
>>>>> Freelance computational chemistry and chemoinformatics developer
>>>>> http://cozchemix.co.uk
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> Rdkit-discuss mailing list
>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>>
>>>
>> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss