Thanks Andrew, very interesting and useful script!
Unfortunately it doesn't work on circular/ECFP-like fingerprints. It has
the requirement that the fingerprint be a substructure fingerprint as you
described. It seems the evolutionary/genetic algorithm approach is the
current state-of-the-art for decoding circular/ECFP-like fingerprints.
Historical question for you since you're the closest we have to a
chem-informatician historian. :-) Why did these circular/ECFP fingerprints
come into existence? They lose the substructure screening property,
property #2 in the 3 properties you listed: identity, subgraph, similarity.
So they generally seem less powerful. (Good description of why that is the
case here:
https://nextmovesoftware.com/blog/2015/02/16/for-every-fingerprint-optimisation-there-is-an-equal-and-opposite-fingerprint-deterioration/
)
I suppose the argument could be made that circular/ECFP are more powerful
for the similarity properly, i.e., virtual-screening. But my reading of the
current literature is that tree/dendritic are statistically just as good at
virtual screening as circular/ECFP:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3686626/
https://pubs.acs.org/doi/abs/10.1021/ci100062n
The corollary to this question is a curiosity why all the machine learning
research seems to be done with circular/ECFP fingerprints. You would think
the substructure screening property would make the tree/dendritic more
information rich?
Big thanks to everyone here, this has definitely been a very fruitful
discussion for me, hope it is for everyone.
Cheers,
Brian
On Sat, Apr 21, 2018 at 9:04 PM, Andrew Dalke <da...@dalkescientific.com>
wrote:
> On Apr 21, 2018, at 01:55, Andrew Dalke <da...@dalkescientific.com> wrote:
> > Hand-waving sketch: start with a carbon. Generate fingerprint. It should
> pass the screening test. If not, the structure contains no carbons, so
> repeat with other elements until you find an atom which passes.
> Successively either add an atom+bond or connect two existing atoms with a
> bond, fingerprint the result, and do the screening test. If it does not
> pass then that modification was not permitted. Use a breadth-first search
> which prioritizes branching and rings to avoid chains longer than the
> maximum enumeration size.
>
> Here's an implementation of that sketch, applied to the RDKit hash
> fingerprint:
>
> http://dalkescientific.com/rev_eng_fp.py
>
> It works well for small structures:
>
> % python rev_env_fp.py
> No SMILES given. Using caffeine.
> Current best guess is C=C with 2 bits of 759
> Current best guess is Cc=O with 6 bits of 759
> Found! Cn1c(=O)c2c(ncn2C)n(C)c1=O
>
> Here's aspirin:
>
> % python rev_env_fp.py 'O=C(C)Oc1ccccc1C(=O)O'
> Found! CC(=O)Oc1ccccc1C(=O)O
>
> Capsicum is close, only missing a methyl in the tail.
>
> % python rev_env_fp.py 'O=C(NCc1cc(OC)c(O)cc1)CCCCC=CC(C)C'
> Current best guess is CNC(=O)CCCCC=CC(C)C with 100 bits of 384
> Current best guess is CC=CCCCCC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384
> Best guess is CC=CCCCCC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384
>
>
> For omeprazole it only finds half of the structure:
>
> % python rev_env_fp.py 'COc1ccc2nc([nH]c2c1)S(=O)Cc1ncc(C)c(OC)c1C'
> Current best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863
> Best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863
>
> For estradiol it gets stuck finding another cyclohexane instead of the
> cyclopentane:
>
> % python rev_env_fp.py 'C[C@]12CC[C@@H]3c4ccc(cc4CC[C@H]3[C@@H]1CC[C@
> @H]2O)O'
> Current best guess is CC1CCCC2CCC(O)C21C with 163 bits of 583
> Current best guess is CC12CCCC(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583
> Best guess is CC12CCCC(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583
>
>
> Note: it's currently set up to only consider the elements
> ["C", "c", "O", "o", "N", "n", "S", "s", "F", "Cl", "Br"]
>
> Edit the 'elements' list of you want to include more possibilities. This
> is more likely to run into a dead-end.
>
>
> The current code assumes that when I grow by one atom, if fp(mol + new
> atom) is a subset of the target fingerprint, then mol + new_atom is a
> subgraph of the target structure.
>
> This can be resolved by setting up a search tree, but then it needs to be
> more careful about backtracking and pruning, and that's too much work for
> an evening of programming.
>
> Cheers,
>
>
> Andrew
> da...@dalkescientific.com
>
>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss