Here is a very old version of Andrew's parser in code form:
http://frowns.cvs.sourceforge.net/viewvc/frowns/frowns/smiles_parsers/Smiles.py?revision=1.1.1.1&content-type=text%2Fplain
that I used in frowns more than a decade ago. It was fairy well tested on the
sigma catalog back in the day. It might be fun to resurrect use it in some
form.
----
Brian Kelley
> On Dec 2, 2016, at 2:36 PM, Andrew Dalke <da...@dalkescientific.com> wrote:
>
>> On Dec 2, 2016, at 11:11 AM, Greg Landrum wrote:
>> An initial start on some regexps that match SMILES is here:
>> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
>>
>> that may also be useful
>
>
> I've put together a more gnarly regular expression to find possible SMILES
> strings. It's configured for at least 4 atom terms, but that's easy to change
> (there's a "{3,}" which can be changed as desired.)
>
> It's follows the SMILES specification a bit more closely, which means there
> should be fewer false positives than the regular expression Greg pointed out.
>
> The file which constructs the regular expression, and an example driver, is
> attached. Here's what the output looks like:
>
> <detect_smiles.py>
>
>
> % python detect_smiles.py ~/talks/*.txt
> /Users/dalke/talks/ICCS_2014_paper.txt:528:532 'IOPS'
> /Users/dalke/talks/ICCS_2014_paper.txt:30150:30183
> 'CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2'
> /Users/dalke/talks/ICCS_2014_paper2.txt:3270:3274 'CBCC'
> /Users/dalke/talks/ICCS_2014_paper2.txt:10229:10239 'CC(=O)[O-]'
> /Users/dalke/talks/ICCS_2014_paper2.txt:32766:32770 'ISIS'
> /Users/dalke/talks/Sheffield2013.txt:25002:25013 'C1=CC=CC=C1'
> /Users/dalke/talks/Sheffield2013.txt:25039:25047 'c1ccccc1'
> /Users/dalke/talks/Sheffield_2016.txt:2767:2771 'CBCC'
> /Users/dalke/talks/Sheffield_2016.txt:10295:10301 'OOOOO0'
> /Users/dalke/talks/Sheffield_2016_talk.txt:7302:7306 'CBCC'
> /Users/dalke/talks/Sheffield_2016_talk.txt:7564:7568 'CBCC'
> /Users/dalke/talks/Sheffield_2016_talk.txt:7716:7720 'CBCC'
> /Users/dalke/talks/Sheffield_2016_v2.txt:2874:2878 'soon'
> /Users/dalke/talks/Sheffield_2016_v2.txt:7312:7317 'OOOOO'
> /Users/dalke/talks/Sheffield_2016_v2.txt:22770:22774 'ICCS'
> /Users/dalke/talks/Sheffield_2016_v3.txt:2982:2986 'soon'
> /Users/dalke/talks/Sheffield_2016_v3.txt:7627:7632 'OOOOO'
> /Users/dalke/talks/Sheffield_2016_v3.txt:24546:24550 'ICCS'
> /Users/dalke/talks/tdd_part_2.txt:7547:7551 'scop'
>
> You can also modify the code for line-by-line processing rather than an
> entire block of text like I did.
>
>
> As others have pointed out, this is a well-trodden path. Follow their
> warnings and advice.
>
> Also, I didn't fully test it.
>
>
>
> Andrew
> da...@dalkescientific.com
>
>
> P.S.
>
> Here's the regular expression:
>
> (?<!\w) # this isn't a SMILES if there are letters or numbers before the
> term
>
> (
>
> (
> (
> Cl? | # Cl and Br are part of the organic subset
> Br? |
> [NOSPFIbcnosp*] | # as are these single-letter elements
>
> # bracket atom
> \[\d* # optional atomic mass
> ( # valid element names
> C[laroudsemf]? |
> Os?|N[eaibdpos]? |
> S[icernbmg]? |
> P[drmtboau]? |
> H[eofgas]? |
> c|n|o|s|p |
> A[lrsgutcm] |
> B[eraik]? |
> Dy|E[urs] |
> F[erm]? |
> G[aed] |
> I[nr]? |
> Kr? |
> L[iaur] |
> M[gnodt] |
> R[buhenaf] |
> T[icebmalh] |
> U|V|W|Xe |
> Yb?|Z[nr]
> )
> [^]]* # ignore anything up to the ']'
> \]
> )
> # allow 0 or more closures directly after any atom
> (
> [-=#$/\\]? # optional bond type
> (
> [0-9] | # single digit closure
> (%[0-9][0-9]) # two digit closure
> )
> ) *
> )
>
> (
>
> (
> (
> \( [-=#$/\\]? # a '(', which can have an optional bond (no dot)
> ) | (
> \)* # any number of close parens, followed by
> (
> ( \( [-=#$/\\]? ) | # an open parens and optional bond (no dot)
> [.-=#$/\\]? # or a dot disconnect or bond
> )
> )
> )
> ?
>
> (
> (
> Cl? | # Cl and Br are part of the organic subset
> Br? |
> [NOSPFIbcnosp*] | # as are these single-letter elements
>
> # bracket atom
> \[\d* # optional atomic mass
> ( # valid element names
> C[laroudsemf]? |
> Os?|N[eaibdpos]? |
> S[icernbmg]? |
> P[drmtboau]? |
> H[eofgas]? |
> c|n|o|s|p |
> A[lrsgutcm] |
> B[eraik]? |
> Dy|E[urs] |
> F[erm]? |
> G[aed] |
> I[nr]? |
> Kr? |
> L[iaur] |
> M[gnodt] |
> R[buhenaf] |
> T[icebmalh] |
> U|V|W|Xe |
> Yb?|Z[nr]
> )
> [^]]* # ignore anything up to the ']'
> \]
> )
> # allow 0 or more closures directly after any atom
> (
> [-=#$/\\]? # optional bond type
> (
> [0-9] | # single digit closure
> (%[0-9][0-9]) # two digit closure
> )
> ) *
> )
>
> ){3,} # must have at least 4 atoms
>
> (?!\w) # this isn't a SMILES if there are letters or numbers after the term
>
> )
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss