Thank you for sharing your results, Alexis. This is indeed an interesting
problem.

Just wonder what are the 339 FP's? Are they all English words with fewer
than 6 characters? If RDKit can construct a molecule out of them, I suppose
in theory they could be valid smiles?

Looks like the problem with parenthesis and punctuations will be difficult
to debug since MolFromSmiles can no longer serve as a control run. Valid
smiles will remain hidden.

Ling


On Mon, Dec 5, 2016 at 2:35 AM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Dear All,
>
> Many thanks to everyone for your participation in that discussion. It was 
> very interesting and useful. I have written a small script that took on board 
> everyone’s input:
>
> This incorporates a few "text filters" before the RDKit function: First of 
> all I made a dictionary of all the words present in the text as a Key, and 
> the number of times
>
> they appear in the text as values. Then I removed from the list of unique 
> keys (words) all the ones that were repeated more than once (because I know 
> that my SMILES
>
> appear only once in each document). Then I remove all the words that are 
> shorter than 5 letters because I know that all my structures contain more 
> than 5 atoms
>
> and I want to remove possible FPs coming from “I” or “CC” for example. Then, 
> with regex, I removed all unique words that contain letter that are not in 
> the main
>
> periodic table of element and remove the words that contain the main English 
> punctuation signs that never happen in SMILES.
>
> Placed one after the others, those filters take 26 836 words of the book 
> "Alice's adventure in the wonderland" down to 780 words. (97% of words 
> filtered out)
>
>
> TEST RESULTS
>
> I have tested my script on:
> •     7900 unique SMILES for “drug-like molecules”
> •     Alice’s adventure in wonderland (I never read the book but I assumed 
> there is no SMILES!)
> •     A shuffled mixture of Alice’s in wonderland and 7900 unique SMILES
>
> The performance is as follow:
>
>
> For Alice’s adventure in wonderland:
> 26836 words
> 26835 TN
> 0 TP
> 1 FP: “*****************************************************************” 
> (actually a valid SMILES…)
> 0 FN
>
> ==> Accuracy of 0.99996, in 0:00:00.112000
>
>
>
> For 7900 unique SMILES from unique drug like molecules
> 7900 TP
> 0 TN
> 0 FP
> 0 FN
> ==> Accuracy of 0.99996, in 0:00:04.200000
>
>
>
>
> 7900 unique SMILES from unique drug like molecule shuffled within ALICE'S 
> ADVENTURES IN WONDERLAND 26836 words (34736 word in totals)
>
> 7900 TP
> 26835 TN
> 1 FP: “*****************************************************************”
> 0 FN
>
> ==> Accuracy of 0.99997 in 0:00:04.949000
>
>
> Then, I have reprocessed the txt mixture above without the text filters 
> (directly feeding every words from the text into the RDKit function and got 
> the following result:
>
> 7900 TP
> 26835 TN
> 339 FP
> 0 FN
> ==> Accuracy of 0.97 in 0:00:07.893
>
>
> Therefore, as Brian pointed out, the function Chem.MolFromSmiles(SMILES) is 
> crazy fast to detected non valid smiles, i.e. to return a “None Object” 
> (about 240K/s
>
> on my computer). What takes the longest is the processing of valid smiles 
> into valid Mol object (2 K/s, i.e 120 times slower).
>
> My conclusion is that the filters are mainly useful to prevent FPs from 
> occurring, but there is no noticeable gain in time processing. The function 
> Chem.MolFromSmiles
>
> is very quick to discard none valid smiles but can incorporate a number of 
> FPs if used without text filtering.
>
>
> The script is in attachment, comments are again welcome!
>
> Thanks again,
>
> Alexis
>
>
>
>
>
>
>
> On 4 December 2016 at 02:52, Andrew Dalke <da...@dalkescientific.com>
> wrote:
>
>> On Dec 2, 2016, at 5:46 PM, Brian Kelley wrote:
>> > I hacked a version of RDKit's smiles parser to compute heavy atom
>> count, perhaps some version of this could be used to check smiles validity
>> without making the actual molecule.
>>
>> FWIW, here's my regex code for it, which makes the assumption that only
>> "[H]" and anything with a "*" are not heavy.
>>
>> _atom_pat = re.compile(r"""
>> (
>>  Cl? |
>>  Br? |
>>  [NOSPFIbcnosp] |
>>  \[[^]]*\]
>> )
>> """, re.X)
>>
>> def get_num_heavies(smiles):
>>     num_atoms = 0
>>     for m in _atom_pat.finditer(smiles):
>>         text = m.group()
>>         if text == "[H]" or "*" in text:
>>             continue
>>         num_atoms += 1
>>     return num_atoms
>>
>> Thus turns out to be a quite handy piece of functionality.
>>
>>
>>                                 Andrew
>>                                 da...@dalkescientific.com
>>
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> ------------------------------------------------------------
> ------------------
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to