Hi all,

 I've spent the last while working on some techniques to improve the 
performance of SMARTS-based fingerprint generators. It's called "talus" and is 
available at https://hg.sr.ht/~dalke/talus .

It's able to improve the performance of Klekota-Roth fingerprint generation by 
about a factor of 12.

These fingerprints have long been described as a slow to generate, eg, 
"PaDEL-descriptor: An open source software to calculate molecular descriptors 
and fingerprints" (2010) at 
https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.21707 says "The slowest 
algorithms are the Klekota-Roth fingerprint and Klekota-Roth fingerprint count 
because they are matching 4860 SMARTS patterns for each molecule.", which they 
timed as taking 14x the time of MACCS-key generation.

The fastest version is available at 
https://hg.sr.ht/~dalke/talus/browse/KlekotaRoth/kr_filtered_atomtypes.py?rev=tip
 , which takes a SMILES file and generates the fingerprints in chemfp's FPS 
format, as a standalone file which depends only on RDKit.

## How does it work?

This effort comes from looking at the Klekota-Roth fingerprints (defined in the 
supplementary data for "Chemical substructures that enrich for biological 
activity", doi: 10.1093/bioinformatics/btn479, 
https://academic.oup.com/bioinformatics/article/24/21/2518/192573 and available 
with a few minor syntax changes in the CDK's Java sources), which contains 
4,860 SMARTS strings including

[!#1][CH2][CH]([!#1])[!#1]
  -and-
OS(=O)(=O)c1ccc(NN=C(C=O)C=O)cc1

The direct translation into a set of if statements, conceptually like

  _pat123 = Chem.MolFromSmarts("[!#1][CH2][CH]([!#1])[!#1]")
     ...
  if mol.HasSubstructMatch(_pat123):
     fp.SetBit(123)

takes 417 seconds in my standard benchmark of about 30,000 SMILES string, of 
which only 2.6 seconds is parsing the SMILES string and the rest is the SMARTS 
matches.

I was able to speed this up by a factor of 12 using the following techniques:

1) create a filter based on atom types counts, something like:

  _at1_pat = Chem.MolFromSmarts("[!#1]")
  _at2_pat = Chem.MolFromSmarts("[CH2]")
  _at3_pat = Chem.MolFromSmarts("[CH]")
  ...
  num_at1 = len(mol.GetSubstructMatches(_at1_pat))
  num_at2 = len(mol.GetSubstructMatches(_at2_pat))
  num_at3 = len(mol.GetSubstructMatches(_at3_pat))
  ...
  if (num_at1 >= 3 and num_at2 >= 1 and num_at3 >= 1
      and mol.HasSubstructMatch(_pat123)):
     fp.SetBit(123)

2) Analyze the atom SMARTS to recognize that, for example, both "[CH2]" and 
"[CH]" will always match "[!#1]", so the minimum counts can be increased to:

  if (num_at1 >= 5 and num_at2 >= 1 and num_at3 >= 1
      and mol.HasSubstructMatch(_pat123)):
     fp.SetBit(123)

3) Identify SMARTS prefixes, which provide a natural tree structure.

For example, the last two SMARTS patterns in the Klekota-Roth keys are:

SCCS
SCCS(=O)=O

There is no reason to test for "SCCS(=O)=O" if "SCCS" does not pass, in which 
case there's no need to repeat the check for "S" and "C" counts, resulting in 
something like:

  if (num_S >= 2 and num_C >= 2
       and mol.HasSubstructMatch(_pat_4858)):
      fp.SetBit(4858)
      if (num_O >= 2
           and mol.HasSubstructMatch(_pat_4859)):
          fp.SetBit(4859)

4) Improve the effectiveness of SMARTS prefixes

The SMARTS patterns are generated by Daylight's canonicalization rules then 
sorted ASCII-betically, but the SMARTS prefix method works better if the SMARTS 
starts with a unlikely chain terminal. For example, bit 3994 (key 3995) is 
"COc1cccc(C=NNC(=O)CO)c1", but "OCC(NN=Cc1cccc(OC)c1)=O" is an equivalent 
SMARTS with a longer initial chain.

5) Identify SMARTS prefixes which can be inserted as a filter.

Here's an example of what that looks like, with a bit number followed by the 
SMARTS pattern, where the "*" indicates that a pattern is only used for 
filtering:

2949 Br
3222 BrC
* BrCC filter 7 patterns
3332 BrC(C)(Br)Br
3333 BrC(C)C(=O)N
* BrCCC filter 3 patterns
3430 BrCC(C)O
2973 BrCCC=O
4683 BrCCC(NC(O)=O)=O
4227 BrC(C(N)NC=O)(Br)Br
4692 BrC(C(O)NC=O)(Br)Br

This says the "Br" is one of the keys, so bits 3222, 3332, 3333, 3430, etc. 
will not be tested unless Br exists.

It further notices that "BrCC" is a common prefix to 7 patterns, so on the 
assumption that the overhead of one rejection test (which should be usual case) 
saves the time needed to do 7 additional test, it adds that extra filter.

The "BrCCC" is provide a further refinement.

All told, this brought Klekota-Roth fingerprint generation down to 33.5 
seconds, of which 2.3 seconds (7%) was for SMILES processing so another 10x 
performance gain may be possible.

## These gains are not necessarily portable

These impressive performance gains are possible because of how the Klekota-Roth 
keys were generated. For the subset of the PubChem keys which can be handled by 
"HasSubstructMatch" to a SMARTS pattern, the overall performance is only 2x, 
not 12x.

## Possible future directions

A clear direction for future improvement would be to build a decision tree 
based on all reasonable SMARTS subgraphs, tuned by match statistics from a 
representative selection of molecules. Another extension would be to handle 
minimum counts, like how "at least 2 rings of size 6", (expressed as 
"*~1~*~*~*~*~*~1" or "[R]@1@[R]@[R]@[R]@[R]@[R]@1") requires at least 7 ring 
atoms.

Anyone thinking further along these lines may be interested in "Efficient 
matching of multiple chemical subgraphs" at 
https://www.nextmovesoftware.com/talks/Sayle_MultipleSmarts_ICCS_201106.pdf .

I wanted a system which could generate a Python module, rather than a 
C/C++/Java library, resulting in different trade-offs.

## Methods to analyze atom and bond SMARTS terms

Developing this package required building a parser for the atom and bond SMARTS 
terms so I could tell if one atom SMARTS is a ubset of another atoms SMARTS. (I 
let RDKit handle the full SMARTS parsing, then use QueryAtom.GetSmarts() or 
QueryBond.GetSmarts() to get the actual SMARTS terms).

I think it may be of broader interest for anyone working with SMARTS as a 
syntax level.

For example, the test driver takes a SMARTS string and gives a breakdown of the 
different components, and where that information came from in the SMARTS term:

% python smarts_parse.py '[#6a]=@[PH+]'
Pattern SMARTS: [#6a]=@[PH+]
 atoms[0]: [#6&a] -> [c;R;!X0]
            ^^ ^  elements: [c]
               ^  in_ring: [R]
                  connectivities: [!X0]
                   + from SMARTS topology
 atoms[1]: [P&H1&+] -> [P;H1;h0,h1;+1;!X0]
            ^       elements: [P]
              ^^    total_hcount: [H1]
              ^^    implicit_hcount: [h0,h1]
                 ^  charges: [+1]
                    connectivities: [!X0]
                     + from SMARTS topology
 bonds[0]: =&@ (between atoms 0 and 1) -> '=;@'
           ^   bondtypes: [=]
             ^ in_ring: [@]

This is able to figure out that "[#6a]" means it must be an aromatic carbon, 
which means it must be in a ring. It also knows from the SMARTS topology that 
there must be at least one bond (hence [!X0]). Were it a bit more clever, the 
"R" should tell it there are at least two bonds, both ring bonds, but that's 
for the future to fix.

It also adds some additional constraints (which I conjectured would be useful 
atom typing) like how "H1" means the implicit hydrogen count must be only 0 or 
1.

Some of this work dates back to a SMARTS regular-expression based tokenizer I 
contributed to Brian Kelly's FROWNS project back in 2001 or so! See 
https://frowns.sourceforge.net/ .

If you want to take this effort further, please contact me and I'll provide 
some help, thoughts, and advice!

                                        Andrew
                                        da...@dalkescientific.com




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to