Re: [Rdkit-discuss] Extracting SMILES from text

Igor Filippov Mon, 05 Dec 2016 04:59:29 -0800

Alexis,

Nice, but it doesn't seem to take into account Andrew Dalke's comment that
valid SMILES may be adjacent to a punctuation sign (e.g. period or
parenthesis).
Perhaps it is not an issue for your specific project, but maybe instead of
simple "split()" it is worthwhile to use something more sophisticated?


Best regards,
Igor


On Mon, Dec 5, 2016 at 5:35 AM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Dear All,
>
> Many thanks to everyone for your participation in that discussion. It was 
> very interesting and useful. I have written a small script that took on board 
> everyone’s input:
>
> This incorporates a few "text filters" before the RDKit function: First of 
> all I made a dictionary of all the words present in the text as a Key, and 
> the number of times
>
> they appear in the text as values. Then I removed from the list of unique 
> keys (words) all the ones that were repeated more than once (because I know 
> that my SMILES
>
> appear only once in each document). Then I remove all the words that are 
> shorter than 5 letters because I know that all my structures contain more 
> than 5 atoms
>
> and I want to remove possible FPs coming from “I” or “CC” for example. Then, 
> with regex, I removed all unique words that contain letter that are not in 
> the main
>
> periodic table of element and remove the words that contain the main English 
> punctuation signs that never happen in SMILES.
>
> Placed one after the others, those filters take 26 836 words of the book 
> "Alice's adventure in the wonderland" down to 780 words. (97% of words 
> filtered out)
>
>
> TEST RESULTS
>
> I have tested my script on:
> •     7900 unique SMILES for “drug-like molecules”
> •     Alice’s adventure in wonderland (I never read the book but I assumed 
> there is no SMILES!)
> •     A shuffled mixture of Alice’s in wonderland and 7900 unique SMILES
>
> The performance is as follow:
>
>
> For Alice’s adventure in wonderland:
> 26836 words
> 26835 TN
> 0 TP
> 1 FP: “*****************************************************************” 
> (actually a valid SMILES…)
> 0 FN
>
> ==> Accuracy of 0.99996, in 0:00:00.112000
>
>
>
> For 7900 unique SMILES from unique drug like molecules
> 7900 TP
> 0 TN
> 0 FP
> 0 FN
> ==> Accuracy of 0.99996, in 0:00:04.200000
>
>
>
>
> 7900 unique SMILES from unique drug like molecule shuffled within ALICE'S 
> ADVENTURES IN WONDERLAND 26836 words (34736 word in totals)
>
> 7900 TP
> 26835 TN
> 1 FP: “*****************************************************************”
> 0 FN
>
> ==> Accuracy of 0.99997 in 0:00:04.949000
>
>
> Then, I have reprocessed the txt mixture above without the text filters 
> (directly feeding every words from the text into the RDKit function and got 
> the following result:
>
> 7900 TP
> 26835 TN
> 339 FP
> 0 FN
> ==> Accuracy of 0.97 in 0:00:07.893
>
>
> Therefore, as Brian pointed out, the function Chem.MolFromSmiles(SMILES) is 
> crazy fast to detected non valid smiles, i.e. to return a “None Object” 
> (about 240K/s
>
> on my computer). What takes the longest is the processing of valid smiles 
> into valid Mol object (2 K/s, i.e 120 times slower).
>
> My conclusion is that the filters are mainly useful to prevent FPs from 
> occurring, but there is no noticeable gain in time processing. The function 
> Chem.MolFromSmiles
>
> is very quick to discard none valid smiles but can incorporate a number of 
> FPs if used without text filtering.
>
>
> The script is in attachment, comments are again welcome!
>
> Thanks again,
>
> Alexis
>
>
>
>
>
>
>
> On 4 December 2016 at 02:52, Andrew Dalke <da...@dalkescientific.com>
> wrote:
>
>> On Dec 2, 2016, at 5:46 PM, Brian Kelley wrote:
>> > I hacked a version of RDKit's smiles parser to compute heavy atom
>> count, perhaps some version of this could be used to check smiles validity
>> without making the actual molecule.
>>
>> FWIW, here's my regex code for it, which makes the assumption that only
>> "[H]" and anything with a "*" are not heavy.
>>
>> _atom_pat = re.compile(r"""
>> (
>>  Cl? |
>>  Br? |
>>  [NOSPFIbcnosp] |
>>  \[[^]]*\]
>> )
>> """, re.X)
>>
>> def get_num_heavies(smiles):
>>     num_atoms = 0
>>     for m in _atom_pat.finditer(smiles):
>>         text = m.group()
>>         if text == "[H]" or "*" in text:
>>             continue
>>         num_atoms += 1
>>     return num_atoms
>>
>> Thus turns out to be a quite handy piece of functionality.
>>
>>
>>                                 Andrew
>>                                 da...@dalkescientific.com
>>
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> ------------------------------------------------------------
> ------------------
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>

------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Extracting SMILES from text

Reply via email to