On Apr 20, 2018, at 19:03, jeff godden <jgod...@gmail.com> wrote:
> 
> Long ago molecular fingerprints were referred to in the literature as 
> molecular hash functions. (y'know, those crazy mathematical algorithms which 
> permitted rapid lookup of some string in a lookup table)

Do you have a reference or any more detail for that?

As far as I can tell, effectively everyone used fragment-based screens from the 
punched-card era of the 1940s up until the late 1980s. It wasn't until the 
early 1990s when the word "fingerprint" appears in the cheminformatics 
literature, in a paper by John Barnard referencing the Daylight fingerprints.

I haven't come across any use of molecular hash functions for fingerprint-like 
descriptors before the Daylight work, and I looked pretty hard for one.

Instead, I get the feeling that there was some attempt in the 1990s to 
distinguish between these two approaches, and then over time the term 
"fingerprint" took on a much broader meaning than "Daylight's binary 
fingerprint based on hash-encoding of enumerated linear subgraphs." That is, I 
think that "molecular hash functions" post-date "fingerprint".

FWIW, a pubs.acs.org search for "molecular hash" found only three papers; one 
from 2005 and the other two from 2015.

The closest exception to to an earlier Daylight-like fingerprint is the 
superimposed coding created by Mooers in the 1940s, mentioned in Ray and Kirsch 
(1957), put to use in Feldman and Hodes (1975), and further used at a couple of 
places since then.

These *are* connected; Mooers in his "Codes and Coding" entry from 
"Encyclopedia of Library and Information Science" remarks: A rudimentary form 
of use of random patterns was discovered by computer people about 10 years 
later for fast look-up in tables. This simpler form, which effectively uses 
only a single descriptor, is known in the computer industry as "hash coding."" 
Weininger and Mooers both drew from the same information theory concepts to 
develop fingerprints and superimposed coding, respectively.

But they aren't the same.

I found hashing used for other molecular-related topics, like WLN and IR 
spectra lookups, but these didn't seem to be *molecular* hash functions.


>  As such, we expected for their to be the associated hash collisions  
> (https://en.wikipedia.org/wiki/Hash_table#Collision_resolution ).

As Peter Shenkin pointed out, this isn't a given.

In the MMP code I helped develop (mmpdb - https://github.com/rdkit/mmpdb ), one 
of the novel features is the ability to match not just the pairs but the local 
attachment environment, based on the circular environment of r=1 up to r=5 from 
the attachment point.

I created a fingerprint for that based on the fingerprints for the individual 
circular environments, concatenated together, and then SHA256'ed to get a 
unique characteristic.

Unlike the Daylight approach, this fingerprint can only be used to check for 
identity. The requirement that a fingerprint be used for similarity and 
substructure screens makes them much larger than needed for a simple identity 
check.

And as Dave Cosgrove rightly pointed out, this extra information makes it 
possible to reverse engineer a (Daylight-style hash fingerprint) to find a 
molecular graph which is at least isospectral to the original structure.

Hand-waving sketch: start with a carbon. Generate fingerprint. It should pass 
the screening test. If not, the structure contains no carbons, so repeat with 
other elements until you find an atom which passes. Successively either add an 
atom+bond or connect two existing atoms with a bond, fingerprint the result, 
and do the screening test. If it does not pass then that modification was not 
permitted. Use a breadth-first search which prioritizes branching and rings to 
avoid chains longer than the maximum enumeration size.

You'll also need to allow aromatic atoms in a non-ring so you can do the growth 
correctly. 

ECFP-style circular fingerprints are not designed for substructure screens so 
cannot be reverse-engineered this easily. It would be interested to try the GA 
method that Dave Cosgrove suggested.

I know of no papers concerning this topic, and I doubt that Dave Weininger ever 
published anything about it. He wasn't much into publishing in the scientific 
literature. 


Going back to the mmpdb environment fingerprint, it was also designed so that 
organizations can feel a bit more comfortable sharing MMP data with other 
organizations, since (like the InChI-Key) it's not possible to guess what an 
mmpdb environment fingerprint describes unless you 1) have it already, or 2) 
are willing to brute-force reasonable chemistry space to find it.

Interestingly, this use of "fingerprint" is more closely aligned with Rabin's 
1981 work on fingerprints - cryptographic hashes used for identify checking; 
see http://www.xmailserver.org/rabin.pdf - than with Daylight fingerprints.

When I asked a few years ago, Dave Weininger did not recall how they came up 
with the term "fingerprint".

Cheers,


                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to