Cool!  Btw-  try sanitize=False

Also, Andrew is right that you will miss parenthetical phrases.  I.e.  
Benzene(c1ccccc1) and the like, just reasserting that this is a hard problem!

----
Brian Kelley

> On Dec 5, 2016, at 5:35 AM, Alexis Parenty <alexis.parenty.h...@gmail.com> 
> wrote:
> 
> Dear All, 
> Many thanks to everyone for your participation in that discussion. It was 
> very interesting and useful. I have written a small script that took on board 
> everyone’s input:
> 
> This incorporates a few "text filters" before the RDKit function: First of 
> all I made a dictionary of all the words present in the text as a Key, and 
> the number of times
> they appear in the text as values. Then I removed from the list of unique 
> keys (words) all the ones that were repeated more than once (because I know 
> that my SMILES 
> appear only once in each document). Then I remove all the words that are 
> shorter than 5 letters because I know that all my structures contain more 
> than 5 atoms 
> and I want to remove possible FPs coming from “I” or “CC” for example. Then, 
> with regex, I removed all unique words that contain letter that are not in 
> the main 
> periodic table of element and remove the words that contain the main English 
> punctuation signs that never happen in SMILES.
> Placed one after the others, those filters take 26 836 words of the book 
> "Alice's adventure in the wonderland" down to 780 words. (97% of words 
> filtered out)
> 
> TEST RESULTS
> 
> I have tested my script on:
> •     7900 unique SMILES for “drug-like molecules”
> •     Alice’s adventure in wonderland (I never read the book but I assumed 
> there is no SMILES!)
> •     A shuffled mixture of Alice’s in wonderland and 7900 unique SMILES
> 
> The performance is as follow:
> 
> 
> For Alice’s adventure in wonderland:
> 26836 words
> 26835 TN
> 0 TP
> 1 FP: “*****************************************************************” 
> (actually a valid SMILES…)
> 0 FN
> ==> Accuracy of 0.99996, in 0:00:00.112000
> 
> 
> For 7900 unique SMILES from unique drug like molecules
> 7900 TP
> 0 TN
> 0 FP
> 0 FN
> ==> Accuracy of 0.99996, in 0:00:04.200000
> 
> 
> 
> 
> 7900 unique SMILES from unique drug like molecule shuffled within ALICE'S 
> ADVENTURES IN WONDERLAND 26836 words (34736 word in totals)
> 
> 7900 TP
> 26835 TN
> 1 FP: “*****************************************************************”
> 0 FN
> 
> ==> Accuracy of 0.99997 in 0:00:04.949000
> 
> 
> Then, I have reprocessed the txt mixture above without the text filters 
> (directly feeding every words from the text into the RDKit function and got 
> the following result:
> 
> 7900 TP
> 26835 TN
> 339 FP
> 0 FN
> ==> Accuracy of 0.97 in 0:00:07.893
> 
> 
> Therefore, as Brian pointed out, the function Chem.MolFromSmiles(SMILES) is 
> crazy fast to detected non valid smiles, i.e. to return a “None Object” 
> (about 240K/s
> on my computer). What takes the longest is the processing of valid smiles 
> into valid Mol object (2 K/s, i.e 120 times slower).
> 
> My conclusion is that the filters are mainly useful to prevent FPs from 
> occurring, but there is no noticeable gain in time processing. The function 
> Chem.MolFromSmiles 
> is very quick to discard none valid smiles but can incorporate a number of 
> FPs if used without text filtering.
> 
> The script is in attachment, comments are again welcome!
> 
> Thanks again,
> 
> Alexis
> 
> 
> 
> 
> 
> 
> 
>> On 4 December 2016 at 02:52, Andrew Dalke <da...@dalkescientific.com> wrote:
>> On Dec 2, 2016, at 5:46 PM, Brian Kelley wrote:
>> > I hacked a version of RDKit's smiles parser to compute heavy atom count, 
>> > perhaps some version of this could be used to check smiles validity 
>> > without making the actual molecule.
>> 
>> FWIW, here's my regex code for it, which makes the assumption that only 
>> "[H]" and anything with a "*" are not heavy.
>> 
>> _atom_pat = re.compile(r"""
>> (
>>  Cl? |
>>  Br? |
>>  [NOSPFIbcnosp] |
>>  \[[^]]*\]
>> )
>> """, re.X)
>> 
>> def get_num_heavies(smiles):
>>     num_atoms = 0
>>     for m in _atom_pat.finditer(smiles):
>>         text = m.group()
>>         if text == "[H]" or "*" in text:
>>             continue
>>         num_atoms += 1
>>     return num_atoms
>> 
>> Thus turns out to be a quite handy piece of functionality.
>> 
>> 
>>                                 Andrew
>>                                 da...@dalkescientific.com
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> 
> <SMILES_from_english_text_parser.txt>
> ------------------------------------------------------------------------------
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to