Re: [Rdkit-discuss] Extracting SMILES from text

Alexis Parenty Mon, 05 Dec 2016 02:36:39 -0800

Dear All,

Many thanks to everyone for your participation in that discussion. It
was very interesting and useful. I have written a small script that
took on board everyone’s input:

This incorporates a few "text filters" before the RDKit function:
First of all I made a dictionary of all the words present in the text
as a Key, and the number of times

they appear in the text as values. Then I removed from the list of
unique keys (words) all the ones that were repeated more than once
(because I know that my SMILES

appear only once in each document). Then I remove all the words that
are shorter than 5 letters because I know that all my structures
contain more than 5 atoms

and I want to remove possible FPs coming from “I” or “CC” for example.
Then, with regex, I removed all unique words that contain letter that
are not in the main

periodic table of element and remove the words that contain the main
English punctuation signs that never happen in SMILES.

Placed one after the others, those filters take 26 836 words of the
book "Alice's adventure in the wonderland" down to 780 words. (97% of
words filtered out)

TEST RESULTS

I have tested my script on:
•       7900 unique SMILES for “drug-like molecules”
•       Alice’s adventure in wonderland (I never read the book but I assumed
there is no SMILES!)
•       A shuffled mixture of Alice’s in wonderland and 7900 unique SMILES

The performance is as follow:

For Alice’s adventure in wonderland:
26836 words
26835 TN
0 TP
1 FP: “*****************************************************************”
(actually a valid SMILES…)
0 FN

==> Accuracy of 0.99996, in 0:00:00.112000

For 7900 unique SMILES from unique drug like molecules
7900 TP
0 TN
0 FP
0 FN
==> Accuracy of 0.99996, in 0:00:04.200000

7900 unique SMILES from unique drug like molecule shuffled within
ALICE'S ADVENTURES IN WONDERLAND 26836 words (34736 word in totals)

7900 TP
26835 TN
1 FP: “*****************************************************************”
0 FN

==> Accuracy of 0.99997 in 0:00:04.949000

Then, I have reprocessed the txt mixture above without the text
filters (directly feeding every words from the text into the RDKit
function and got the following result:

7900 TP
26835 TN
339 FP
0 FN
==> Accuracy of 0.97 in 0:00:07.893

Therefore, as Brian pointed out, the function
Chem.MolFromSmiles(SMILES) is crazy fast to detected non valid smiles,
i.e. to return a “None Object” (about 240K/s

on my computer). What takes the longest is the processing of valid
smiles into valid Mol object (2 K/s, i.e 120 times slower).

My conclusion is that the filters are mainly useful to prevent FPs
from occurring, but there is no noticeable gain in time processing.
The function Chem.MolFromSmiles

is very quick to discard none valid smiles but can incorporate a
number of FPs if used without text filtering.

The script is in attachment, comments are again welcome!

Thanks again,

Alexis

On 4 December 2016 at 02:52, Andrew Dalke <da...@dalkescientific.com> wrote:

> On Dec 2, 2016, at 5:46 PM, Brian Kelley wrote:
> > I hacked a version of RDKit's smiles parser to compute heavy atom count,
> perhaps some version of this could be used to check smiles validity without
> making the actual molecule.
>
> FWIW, here's my regex code for it, which makes the assumption that only
> "[H]" and anything with a "*" are not heavy.
>
> _atom_pat = re.compile(r"""
> (
>  Cl? |
>  Br? |
>  [NOSPFIbcnosp] |
>  \[[^]]*\]
> )
> """, re.X)
>
> def get_num_heavies(smiles):
>     num_atoms = 0
>     for m in _atom_pat.finditer(smiles):
>         text = m.group()
>         if text == "[H]" or "*" in text:
>             continue
>         num_atoms += 1
>     return num_atoms
>
> Thus turns out to be a quite handy piece of functionality.
>
>
>                                 Andrew
>                                 da...@dalkescientific.com
>
>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

from collections import Counter
import re
from rdkit import Chem
from rdkit import rdBase
rdBase.DisableLog("rdApp.*")


def filter_out_any_word_that_occurs_more_than_n_times(dic, n):
    # filter#1: Store in a dictionary the list of unique word from the 
document, with their occurance
    set_word = set()
    for key in dic:
        if dic[key] <= n:
            set_word.add(key)
    return set_word

def filter_out_any_word_with_less_than_n_letters(set_words, n):
    new_set_words = set()
    for each_word in set_words:
        if len(each_word) >= n:
            new_set_words.add(each_word)
    return new_set_words


def filter_out_any_word_with_non_SMILES_char(set_words):
    new_set_words = set()
    for each_word in set_words:
        if not re.search('[EJQUjkmpqvwxyz,;:_]+', each_word) and each_word[-1] 
!= ".":
            new_set_words.add(each_word)
    return new_set_words


file_name = "C:\\Users\\PARENAL1\\Desktop\\test\\alice.txt"


with open(file_name, 'r') as infile:
    orig_list_of_words = infile.read().split()
    length_original_list = len(orig_list_of_words)
    # filter1: Store in a dictionary the list of unique word from the document, 
with their occurance
    dict_word_count = Counter(orig_list_of_words)

    # filter2: To filter out words that occur more than 1 time in the document 
(I know that my SMILES do not occur more than once in each document)
    filtered_set_of_unique_words = 
filter_out_any_word_that_occurs_more_than_n_times(dict_word_count, 1)

    # filter3: To get rid of words that are shorter than 5 letter and that 
could lead to FP (such as I, CC, ...). I know that in my case the valid SMILES 
represent molecule bigger than 5 atoms...)
    filtered_set_of_unique_words = 
filter_out_any_word_with_less_than_n_letters(filtered_set_of_unique_words, 5)


    # filter4: filter out words that contain letter that are not in the main 
elements of the periodic table, and some frequent English ponctuation signs 
that are never encountered in SMILES
    filtered_set_of_unique_words = 
filter_out_any_word_with_non_SMILES_char(filtered_set_of_unique_words)

    # Store valid smiles in a dictiona
    smile_list = []
    for each_word in filtered_set_of_unique_words:
        each_word_mol = Chem.MolFromSmiles(each_word)
        if each_word_mol != None:
            smile_list.append(each_word)

    print("number of Smiles detected = {0} from a total of {1} 
words".format(len(smile_list), length_original_list))

------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Extracting SMILES from text

Reply via email to