On Apr 21, 2018, at 01:55, Andrew Dalke <da...@dalkescientific.com> wrote: > Hand-waving sketch: start with a carbon. Generate fingerprint. It should pass > the screening test. If not, the structure contains no carbons, so repeat with > other elements until you find an atom which passes. Successively either add > an atom+bond or connect two existing atoms with a bond, fingerprint the > result, and do the screening test. If it does not pass then that modification > was not permitted. Use a breadth-first search which prioritizes branching and > rings to avoid chains longer than the maximum enumeration size.
Here's an implementation of that sketch, applied to the RDKit hash fingerprint: http://dalkescientific.com/rev_eng_fp.py It works well for small structures: % python rev_env_fp.py No SMILES given. Using caffeine. Current best guess is C=C with 2 bits of 759 Current best guess is Cc=O with 6 bits of 759 Found! Cn1c(=O)c2c(ncn2C)n(C)c1=O Here's aspirin: % python rev_env_fp.py 'O=C(C)Oc1ccccc1C(=O)O' Found! CC(=O)Oc1ccccc1C(=O)O Capsicum is close, only missing a methyl in the tail. % python rev_env_fp.py 'O=C(NCc1cc(OC)c(O)cc1)CCCCC=CC(C)C' Current best guess is CNC(=O)CCCCC=CC(C)C with 100 bits of 384 Current best guess is CC=CCCCCC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384 Best guess is CC=CCCCCC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384 For omeprazole it only finds half of the structure: % python rev_env_fp.py 'COc1ccc2nc([nH]c2c1)S(=O)Cc1ncc(C)c(OC)c1C' Current best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863 Best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863 For estradiol it gets stuck finding another cyclohexane instead of the cyclopentane: % python rev_env_fp.py 'C[C@]12CC[C@@H]3c4ccc(cc4CC[C@H]3[C@@H]1CC[C@@H]2O)O' Current best guess is CC1CCCC2CCC(O)C21C with 163 bits of 583 Current best guess is CC12CCCC(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583 Best guess is CC12CCCC(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583 Note: it's currently set up to only consider the elements ["C", "c", "O", "o", "N", "n", "S", "s", "F", "Cl", "Br"] Edit the 'elements' list of you want to include more possibilities. This is more likely to run into a dead-end. The current code assumes that when I grow by one atom, if fp(mol + new atom) is a subset of the target fingerprint, then mol + new_atom is a subgraph of the target structure. This can be resolved by setting up a search tree, but then it needs to be more careful about backtracking and pruning, and that's too much work for an evening of programming. Cheers, Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss