Re: [Rdkit-discuss] Extracting SMILES from text

Andrew Dalke Fri, 02 Dec 2016 14:03:57 -0800

On Dec 2, 2016, at 10:12 PM, George Papadatos wrote:
> If Alexis wants to search for valid SMILES strings representing typical 
> organic molecules among text of plain English words, would it not be safe to 
> assume that any word containing more than 4 'C' or 'c' characters would only 
> be a SMILES string?


Maybe. It depends on the text. That's the problem with any sort of text 
extraction.

If it contains entries like:

  The combination of phenol (c1ccccc1O) and ....
or
  The SMILES for phenol is c1ccccc1O.


then my code will extract the 'c1ccccc1O', even though the whitespace delimited 
words of "(c1ccccc1O)" and "c1ccccc1O." cause RDKit to complain with a parse 
error.

I implemented your heuristic as:

def find_possible_smiles(text):
    return [(0, 0, term) for term in text.split() if term.count("C") + 
term.count("c") >= 4]

Here are some of the matches:

/Users/dalke/talks/ICCS_2014_paper.txt:0:0 'CACTVS-specific'
/Users/dalke/talks/ICCS_2014_paper.txt:0:0 'CC12CCC3C(CCC4=CC(O)CCC34C)C1CCC2'
/Users/dalke/talks/ICCS_2014_paper2.txt:0:0 
'[http://www.dalkescientific.com/writings/diary/archive/2005/03/02/faster_fingerprint_substructure_tests.html]'
/Users/dalke/talks/Sheffield2013.txt:0:0 '"C1=CC=CC=C1"'
/Users/dalke/talks/Sheffield2013.txt:0:0 '"c1ccccc1",'
/Users/dalke/talks/bugs.txt:0:0 
'http://localhost:8080/files?responder=%3Cscript%3Ealert%28%22hi!%22%29%3C/script%3E'
/Users/dalke/talks/garfield.txt:0:0 
'http://www.chemheritage.org/discover/collections/oral-histories/details/henderson-madeline-m.aspx'

You can see it grabbed as trailing comma for a SMILES, as well as a bunch of 
URLs. Those could, of course, be easily post-filtered. But why not use a regexp?


Of course, another level on top of this would be de-hypenation.

This is a well trod path, but not an easy one.


BTW, I tested how many missed structures there might be using your heuristic:

>>> sum(1 for line in open("/Users/dalke/databases/pubchem.smi") if 
>>> line.count("C") >= 4)
68228954

% wc -l /Users/dalke/databases/pubchem.smi
 68413797 /Users/dalke/databases/pubchem.smi

I inverted the logic, so that's
  68413797-68228954 = 184843 = 0.3%




                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Extracting SMILES from text

Reply via email to