Re: [Rdkit-discuss] Extracting SMILES from text

Andrew Dalke Fri, 02 Dec 2016 11:37:44 -0800

On Dec 2, 2016, at 11:11 AM, Greg Landrum wrote:
> An initial start on some regexps that match SMILES is here: 
> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
> 
> that may also be useful



I've put together a more gnarly regular expression to find possible SMILES 
strings. It's configured for at least 4 atom terms, but that's easy to change 
(there's a "{3,}" which can be changed as desired.)

It's follows the SMILES specification a bit more closely, which means there 
should be fewer false positives than the regular expression Greg pointed out.

The file which constructs the regular expression, and an example driver, is 
attached. Here's what the output looks like:

from __future__ import print_function

# Identify likely SMILES strings embedded in a block of text.

# This constructs a rather complicated regular expression which does a
# decent job of matching the OpenSMILES grammar. It assumes that a
# SMILES string looks like:
#
#   atom ( connector? atom)*
#
# where the "atom" is the organic subset or a 'bracket' atom term from
# the SMILES followed by any optional closures, and where the "connector"
# is the possible combinations of open/close parentheses, dot disconnect,
# or bond.

# It does not attempt to balance parenthesies, ensure matching ring
# closures, or handle aromaticity. those cannot be done with a regular
# expression.

# Written in 2016 by Andrew Dalke <da...@dalkescientific.com>

import re

# Match the atom term and any closures
atom_re = r"""
(
(
 Cl? |             # Cl and Br are part of the organic subset
 Br? |
 [NOSPFIbcnosp*] |  # as are these single-letter elements

     # bracket atom
 \[\d*              # optional atomic mass
   (                # valid element names
    C[laroudsemf]? |
    Os?|N[eaibdpos]? |
    S[icernbmg]? |
    P[drmtboau]? |
    H[eofgas]? |
    c|n|o|s|p |
    A[lrsgutcm] |
    B[eraik]? |
    Dy|E[urs] |
    F[erm]? |
    G[aed] |
    I[nr]? |
    Kr? |
    L[iaur] |
    M[gnodt] |
    R[buhenaf] |
    T[icebmalh] |
    U|V|W|Xe |
    Yb?|Z[nr]
   )
   [^]]*   # ignore anything up to the ']'
\]
)
   # allow 0 or more closures directly after any atom
(
  [-=#$/\\]?  # optional bond type
  (
    [0-9] |        # single digit closure
    (%[0-9][0-9])  # two digit closure
  )
) *
)
"""

# Things that can go between atoms. This is complicated. Some
# of the patterns are:
#   C))C
#   C(C
#   C(=C)
#   C))C
#   C)))))=C
#   C=C
#   C.C

connection_re = r"""
(
 (
  \( [-=#$/\\]?   # a '(', which can have an optional bond (no dot)
 ) | (
   \)*   # any number of close parens, followed by
   (
     ( \( [-=#$/\\]? ) |  # an open parens and optional bond (no dot)
     [.-=#$/\\]?          # or a dot disconnect or bond
   )
 )
)
"""
# The full regular expression. Use zero-width assertion to ensure the
# the putative SMILES is not inside of a larger "word".
smiles_re = r"""
(?<!\w)   # this isn't a SMILES if there are letters or numbers before the term

(
%(atom)s
(
  %(connection)s?
  %(atom)s
){3,}  # must have at least 4 atoms

(?!\w)   # this isn't a SMILES if there are letters or numbers after the term

)""" % dict(
    atom = atom_re,
    connection = connection_re)

possible_smiles_pat = re.compile(smiles_re, re.X)

def find_possible_smiles(text):
    return [(m.start(), m.end(), m.group())
                    for m in possible_smiles_pat.finditer(text)]

def process(filename, text):
    if filename is None:
        for start, end, smiles in find_possible_smiles(text):
            print("%d:%s %r" % (start, end, smiles))
    else:
        for start, end, smiles in find_possible_smiles(text):
            print("%s:%d:%s %r" % (filename, start, end, smiles))
        
    
    
def main():
    import sys
    filenames = sys.argv[1:]
    if len(filenames) == 0:
        text = sys.stdin.read()
        process(None, text)
    elif len(filenames) == 1:
        with open(filenames[0]) as infile:
            text = infile.read()
            process(None, text)
    else:
        for filename in filenames:
            with open(filename) as infile:
                text = infile.read()
                process(filename, text)
    

if __name__ == "__main__":
    main()


% python detect_smiles.py ~/talks/*.txt
/Users/dalke/talks/ICCS_2014_paper.txt:528:532 'IOPS'
/Users/dalke/talks/ICCS_2014_paper.txt:30150:30183 
'CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2'
/Users/dalke/talks/ICCS_2014_paper2.txt:3270:3274 'CBCC'
/Users/dalke/talks/ICCS_2014_paper2.txt:10229:10239 'CC(=O)[O-]'
/Users/dalke/talks/ICCS_2014_paper2.txt:32766:32770 'ISIS'
/Users/dalke/talks/Sheffield2013.txt:25002:25013 'C1=CC=CC=C1'
/Users/dalke/talks/Sheffield2013.txt:25039:25047 'c1ccccc1'
/Users/dalke/talks/Sheffield_2016.txt:2767:2771 'CBCC'
/Users/dalke/talks/Sheffield_2016.txt:10295:10301 'OOOOO0'
/Users/dalke/talks/Sheffield_2016_talk.txt:7302:7306 'CBCC'
/Users/dalke/talks/Sheffield_2016_talk.txt:7564:7568 'CBCC'
/Users/dalke/talks/Sheffield_2016_talk.txt:7716:7720 'CBCC'
/Users/dalke/talks/Sheffield_2016_v2.txt:2874:2878 'soon'
/Users/dalke/talks/Sheffield_2016_v2.txt:7312:7317 'OOOOO'
/Users/dalke/talks/Sheffield_2016_v2.txt:22770:22774 'ICCS'
/Users/dalke/talks/Sheffield_2016_v3.txt:2982:2986 'soon'
/Users/dalke/talks/Sheffield_2016_v3.txt:7627:7632 'OOOOO'
/Users/dalke/talks/Sheffield_2016_v3.txt:24546:24550 'ICCS'
/Users/dalke/talks/tdd_part_2.txt:7547:7551 'scop'

You can also modify the code for line-by-line processing rather than an entire 
block of text like I did.


As others have pointed out, this is a well-trodden path. Follow their warnings 
and advice.

Also, I didn't fully test it.



                                Andrew
                                da...@dalkescientific.com


P.S.

Here's the regular expression:

(?<!\w)   # this isn't a SMILES if there are letters or numbers before the term

(

(
(
 Cl? |             # Cl and Br are part of the organic subset
 Br? |
 [NOSPFIbcnosp*] |  # as are these single-letter elements

     # bracket atom
 \[\d*              # optional atomic mass
   (                # valid element names
    C[laroudsemf]? |
    Os?|N[eaibdpos]? |
    S[icernbmg]? |
    P[drmtboau]? |
    H[eofgas]? |
    c|n|o|s|p |
    A[lrsgutcm] |
    B[eraik]? |
    Dy|E[urs] |
    F[erm]? |
    G[aed] |
    I[nr]? |
    Kr? |
    L[iaur] |
    M[gnodt] |
    R[buhenaf] |
    T[icebmalh] |
    U|V|W|Xe |
    Yb?|Z[nr]
   )
   [^]]*   # ignore anything up to the ']'
\]
)
   # allow 0 or more closures directly after any atom
(
  [-=#$/\\]?  # optional bond type
  (
    [0-9] |        # single digit closure
    (%[0-9][0-9])  # two digit closure
  )
) *
)

(
  
(
 (
  \( [-=#$/\\]?   # a '(', which can have an optional bond (no dot)
 ) | (
   \)*   # any number of close parens, followed by
   (
     ( \( [-=#$/\\]? ) |  # an open parens and optional bond (no dot)
     [.-=#$/\\]?          # or a dot disconnect or bond
   )
 )
)
?
  
(
(
 Cl? |             # Cl and Br are part of the organic subset
 Br? |
 [NOSPFIbcnosp*] |  # as are these single-letter elements

     # bracket atom
 \[\d*              # optional atomic mass
   (                # valid element names
    C[laroudsemf]? |
    Os?|N[eaibdpos]? |
    S[icernbmg]? |
    P[drmtboau]? |
    H[eofgas]? |
    c|n|o|s|p |
    A[lrsgutcm] |
    B[eraik]? |
    Dy|E[urs] |
    F[erm]? |
    G[aed] |
    I[nr]? |
    Kr? |
    L[iaur] |
    M[gnodt] |
    R[buhenaf] |
    T[icebmalh] |
    U|V|W|Xe |
    Yb?|Z[nr]
   )
   [^]]*   # ignore anything up to the ']'
\]
)
   # allow 0 or more closures directly after any atom
(
  [-=#$/\\]?  # optional bond type
  (
    [0-9] |        # single digit closure
    (%[0-9][0-9])  # two digit closure
  )
) *
)

){3,}  # must have at least 4 atoms

(?!\w)   # this isn't a SMILES if there are letters or numbers after the term

)

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Extracting SMILES from text

Reply via email to