Re: [Rdkit-discuss] Extracting SMILES from text

Brian Kelley Fri, 02 Dec 2016 13:09:56 -0800

Here is a very old version of Andrew's parser in code form:

http://frowns.cvs.sourceforge.net/viewvc/frowns/frowns/smiles_parsers/Smiles.py?revision=1.1.1.1&content-type=text%2Fplain


that I used in frowns more than a decade ago.  It was fairy well tested on the 
sigma catalog back in the day.  It might be fun to resurrect use it in some 
form.

----
Brian Kelley

> On Dec 2, 2016, at 2:36 PM, Andrew Dalke <[email protected]> wrote:
> 
>> On Dec 2, 2016, at 11:11 AM, Greg Landrum wrote:
>> An initial start on some regexps that match SMILES is here: 
>> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
>> 
>> that may also be useful
> 
> 
> I've put together a more gnarly regular expression to find possible SMILES 
> strings. It's configured for at least 4 atom terms, but that's easy to change 
> (there's a "{3,}" which can be changed as desired.)
> 
> It's follows the SMILES specification a bit more closely, which means there 
> should be fewer false positives than the regular expression Greg pointed out.
> 
> The file which constructs the regular expression, and an example driver, is 
> attached. Here's what the output looks like:
> 
> <detect_smiles.py>
> 
> 
> % python detect_smiles.py ~/talks/*.txt
> /Users/dalke/talks/ICCS_2014_paper.txt:528:532 'IOPS'
> /Users/dalke/talks/ICCS_2014_paper.txt:30150:30183 
> 'CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2'
> /Users/dalke/talks/ICCS_2014_paper2.txt:3270:3274 'CBCC'
> /Users/dalke/talks/ICCS_2014_paper2.txt:10229:10239 'CC(=O)[O-]'
> /Users/dalke/talks/ICCS_2014_paper2.txt:32766:32770 'ISIS'
> /Users/dalke/talks/Sheffield2013.txt:25002:25013 'C1=CC=CC=C1'
> /Users/dalke/talks/Sheffield2013.txt:25039:25047 'c1ccccc1'
> /Users/dalke/talks/Sheffield_2016.txt:2767:2771 'CBCC'
> /Users/dalke/talks/Sheffield_2016.txt:10295:10301 'OOOOO0'
> /Users/dalke/talks/Sheffield_2016_talk.txt:7302:7306 'CBCC'
> /Users/dalke/talks/Sheffield_2016_talk.txt:7564:7568 'CBCC'
> /Users/dalke/talks/Sheffield_2016_talk.txt:7716:7720 'CBCC'
> /Users/dalke/talks/Sheffield_2016_v2.txt:2874:2878 'soon'
> /Users/dalke/talks/Sheffield_2016_v2.txt:7312:7317 'OOOOO'
> /Users/dalke/talks/Sheffield_2016_v2.txt:22770:22774 'ICCS'
> /Users/dalke/talks/Sheffield_2016_v3.txt:2982:2986 'soon'
> /Users/dalke/talks/Sheffield_2016_v3.txt:7627:7632 'OOOOO'
> /Users/dalke/talks/Sheffield_2016_v3.txt:24546:24550 'ICCS'
> /Users/dalke/talks/tdd_part_2.txt:7547:7551 'scop'
> 
> You can also modify the code for line-by-line processing rather than an 
> entire block of text like I did.
> 
> 
> As others have pointed out, this is a well-trodden path. Follow their 
> warnings and advice.
> 
> Also, I didn't fully test it.
> 
> 
> 
>                Andrew
>                [email protected]
> 
> 
> P.S.
> 
> Here's the regular expression:
> 
> (?<!\w)   # this isn't a SMILES if there are letters or numbers before the 
> term
> 
> (
> 
> (
> (
> Cl? |             # Cl and Br are part of the organic subset
> Br? |
> [NOSPFIbcnosp*] |  # as are these single-letter elements
> 
>     # bracket atom
> \[\d*              # optional atomic mass
>   (                # valid element names
>    C[laroudsemf]? |
>    Os?|N[eaibdpos]? |
>    S[icernbmg]? |
>    P[drmtboau]? |
>    H[eofgas]? |
>    c|n|o|s|p |
>    A[lrsgutcm] |
>    B[eraik]? |
>    Dy|E[urs] |
>    F[erm]? |
>    G[aed] |
>    I[nr]? |
>    Kr? |
>    L[iaur] |
>    M[gnodt] |
>    R[buhenaf] |
>    T[icebmalh] |
>    U|V|W|Xe |
>    Yb?|Z[nr]
>   )
>   [^]]*   # ignore anything up to the ']'
> \]
> )
>   # allow 0 or more closures directly after any atom
> (
>  [-=#$/\\]?  # optional bond type
>  (
>    [0-9] |        # single digit closure
>    (%[0-9][0-9])  # two digit closure
>  )
> ) *
> )
> 
> (
> 
> (
> (
>  \( [-=#$/\\]?   # a '(', which can have an optional bond (no dot)
> ) | (
>   \)*   # any number of close parens, followed by
>   (
>     ( \( [-=#$/\\]? ) |  # an open parens and optional bond (no dot)
>     [.-=#$/\\]?          # or a dot disconnect or bond
>   )
> )
> )
> ?
> 
> (
> (
> Cl? |             # Cl and Br are part of the organic subset
> Br? |
> [NOSPFIbcnosp*] |  # as are these single-letter elements
> 
>     # bracket atom
> \[\d*              # optional atomic mass
>   (                # valid element names
>    C[laroudsemf]? |
>    Os?|N[eaibdpos]? |
>    S[icernbmg]? |
>    P[drmtboau]? |
>    H[eofgas]? |
>    c|n|o|s|p |
>    A[lrsgutcm] |
>    B[eraik]? |
>    Dy|E[urs] |
>    F[erm]? |
>    G[aed] |
>    I[nr]? |
>    Kr? |
>    L[iaur] |
>    M[gnodt] |
>    R[buhenaf] |
>    T[icebmalh] |
>    U|V|W|Xe |
>    Yb?|Z[nr]
>   )
>   [^]]*   # ignore anything up to the ']'
> \]
> )
>   # allow 0 or more closures directly after any atom
> (
>  [-=#$/\\]?  # optional bond type
>  (
>    [0-9] |        # single digit closure
>    (%[0-9][0-9])  # two digit closure
>  )
> ) *
> )
> 
> ){3,}  # must have at least 4 atoms
> 
> (?!\w)   # this isn't a SMILES if there are letters or numbers after the term
> 
> )
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most 
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Extracting SMILES from text

Reply via email to