Hi all, I've spent the last while working on some techniques to improve the performance of SMARTS-based fingerprint generators. It's called "talus" and is available at https://hg.sr.ht/~dalke/talus .
It's able to improve the performance of Klekota-Roth fingerprint generation by about a factor of 12. These fingerprints have long been described as a slow to generate, eg, "PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints" (2010) at https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.21707 says "The slowest algorithms are the Klekota-Roth fingerprint and Klekota-Roth fingerprint count because they are matching 4860 SMARTS patterns for each molecule.", which they timed as taking 14x the time of MACCS-key generation. The fastest version is available at https://hg.sr.ht/~dalke/talus/browse/KlekotaRoth/kr_filtered_atomtypes.py?rev=tip , which takes a SMILES file and generates the fingerprints in chemfp's FPS format, as a standalone file which depends only on RDKit. ## How does it work? This effort comes from looking at the Klekota-Roth fingerprints (defined in the supplementary data for "Chemical substructures that enrich for biological activity", doi: 10.1093/bioinformatics/btn479, https://academic.oup.com/bioinformatics/article/24/21/2518/192573 and available with a few minor syntax changes in the CDK's Java sources), which contains 4,860 SMARTS strings including [!#1][CH2][CH]([!#1])[!#1] -and- OS(=O)(=O)c1ccc(NN=C(C=O)C=O)cc1 The direct translation into a set of if statements, conceptually like _pat123 = Chem.MolFromSmarts("[!#1][CH2][CH]([!#1])[!#1]") ... if mol.HasSubstructMatch(_pat123): fp.SetBit(123) takes 417 seconds in my standard benchmark of about 30,000 SMILES string, of which only 2.6 seconds is parsing the SMILES string and the rest is the SMARTS matches. I was able to speed this up by a factor of 12 using the following techniques: 1) create a filter based on atom types counts, something like: _at1_pat = Chem.MolFromSmarts("[!#1]") _at2_pat = Chem.MolFromSmarts("[CH2]") _at3_pat = Chem.MolFromSmarts("[CH]") ... num_at1 = len(mol.GetSubstructMatches(_at1_pat)) num_at2 = len(mol.GetSubstructMatches(_at2_pat)) num_at3 = len(mol.GetSubstructMatches(_at3_pat)) ... if (num_at1 >= 3 and num_at2 >= 1 and num_at3 >= 1 and mol.HasSubstructMatch(_pat123)): fp.SetBit(123) 2) Analyze the atom SMARTS to recognize that, for example, both "[CH2]" and "[CH]" will always match "[!#1]", so the minimum counts can be increased to: if (num_at1 >= 5 and num_at2 >= 1 and num_at3 >= 1 and mol.HasSubstructMatch(_pat123)): fp.SetBit(123) 3) Identify SMARTS prefixes, which provide a natural tree structure. For example, the last two SMARTS patterns in the Klekota-Roth keys are: SCCS SCCS(=O)=O There is no reason to test for "SCCS(=O)=O" if "SCCS" does not pass, in which case there's no need to repeat the check for "S" and "C" counts, resulting in something like: if (num_S >= 2 and num_C >= 2 and mol.HasSubstructMatch(_pat_4858)): fp.SetBit(4858) if (num_O >= 2 and mol.HasSubstructMatch(_pat_4859)): fp.SetBit(4859) 4) Improve the effectiveness of SMARTS prefixes The SMARTS patterns are generated by Daylight's canonicalization rules then sorted ASCII-betically, but the SMARTS prefix method works better if the SMARTS starts with a unlikely chain terminal. For example, bit 3994 (key 3995) is "COc1cccc(C=NNC(=O)CO)c1", but "OCC(NN=Cc1cccc(OC)c1)=O" is an equivalent SMARTS with a longer initial chain. 5) Identify SMARTS prefixes which can be inserted as a filter. Here's an example of what that looks like, with a bit number followed by the SMARTS pattern, where the "*" indicates that a pattern is only used for filtering: 2949 Br 3222 BrC * BrCC filter 7 patterns 3332 BrC(C)(Br)Br 3333 BrC(C)C(=O)N * BrCCC filter 3 patterns 3430 BrCC(C)O 2973 BrCCC=O 4683 BrCCC(NC(O)=O)=O 4227 BrC(C(N)NC=O)(Br)Br 4692 BrC(C(O)NC=O)(Br)Br This says the "Br" is one of the keys, so bits 3222, 3332, 3333, 3430, etc. will not be tested unless Br exists. It further notices that "BrCC" is a common prefix to 7 patterns, so on the assumption that the overhead of one rejection test (which should be usual case) saves the time needed to do 7 additional test, it adds that extra filter. The "BrCCC" is provide a further refinement. All told, this brought Klekota-Roth fingerprint generation down to 33.5 seconds, of which 2.3 seconds (7%) was for SMILES processing so another 10x performance gain may be possible. ## These gains are not necessarily portable These impressive performance gains are possible because of how the Klekota-Roth keys were generated. For the subset of the PubChem keys which can be handled by "HasSubstructMatch" to a SMARTS pattern, the overall performance is only 2x, not 12x. ## Possible future directions A clear direction for future improvement would be to build a decision tree based on all reasonable SMARTS subgraphs, tuned by match statistics from a representative selection of molecules. Another extension would be to handle minimum counts, like how "at least 2 rings of size 6", (expressed as "*~1~*~*~*~*~*~1" or "[R]@1@[R]@[R]@[R]@[R]@[R]@1") requires at least 7 ring atoms. Anyone thinking further along these lines may be interested in "Efficient matching of multiple chemical subgraphs" at https://www.nextmovesoftware.com/talks/Sayle_MultipleSmarts_ICCS_201106.pdf . I wanted a system which could generate a Python module, rather than a C/C++/Java library, resulting in different trade-offs. ## Methods to analyze atom and bond SMARTS terms Developing this package required building a parser for the atom and bond SMARTS terms so I could tell if one atom SMARTS is a ubset of another atoms SMARTS. (I let RDKit handle the full SMARTS parsing, then use QueryAtom.GetSmarts() or QueryBond.GetSmarts() to get the actual SMARTS terms). I think it may be of broader interest for anyone working with SMARTS as a syntax level. For example, the test driver takes a SMARTS string and gives a breakdown of the different components, and where that information came from in the SMARTS term: % python smarts_parse.py '[#6a]=@[PH+]' Pattern SMARTS: [#6a]=@[PH+] atoms[0]: [#6&a] -> [c;R;!X0] ^^ ^ elements: [c] ^ in_ring: [R] connectivities: [!X0] + from SMARTS topology atoms[1]: [P&H1&+] -> [P;H1;h0,h1;+1;!X0] ^ elements: [P] ^^ total_hcount: [H1] ^^ implicit_hcount: [h0,h1] ^ charges: [+1] connectivities: [!X0] + from SMARTS topology bonds[0]: =&@ (between atoms 0 and 1) -> '=;@' ^ bondtypes: [=] ^ in_ring: [@] This is able to figure out that "[#6a]" means it must be an aromatic carbon, which means it must be in a ring. It also knows from the SMARTS topology that there must be at least one bond (hence [!X0]). Were it a bit more clever, the "R" should tell it there are at least two bonds, both ring bonds, but that's for the future to fix. It also adds some additional constraints (which I conjectured would be useful atom typing) like how "H1" means the implicit hydrogen count must be only 0 or 1. Some of this work dates back to a SMARTS regular-expression based tokenizer I contributed to Brian Kelly's FROWNS project back in 2001 or so! See https://frowns.sourceforge.net/ . If you want to take this effort further, please contact me and I'll provide some help, thoughts, and advice! Andrew da...@dalkescientific.com _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss