Thank you for sharing your results, Alexis. This is indeed an interesting
problem.
Just wonder what are the 339 FP's? Are they all English words with fewer
than 6 characters? If RDKit can construct a molecule out of them, I suppose
in theory they could be valid smiles?
Looks like the problem with parenthesis and punctuations will be difficult
to debug since MolFromSmiles can no longer serve as a control run. Valid
smiles will remain hidden.
Ling
On Mon, Dec 5, 2016 at 2:35 AM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:
> Dear All,
>
> Many thanks to everyone for your participation in that discussion. It was
> very interesting and useful. I have written a small script that took on board
> everyone’s input:
>
> This incorporates a few "text filters" before the RDKit function: First of
> all I made a dictionary of all the words present in the text as a Key, and
> the number of times
>
> they appear in the text as values. Then I removed from the list of unique
> keys (words) all the ones that were repeated more than once (because I know
> that my SMILES
>
> appear only once in each document). Then I remove all the words that are
> shorter than 5 letters because I know that all my structures contain more
> than 5 atoms
>
> and I want to remove possible FPs coming from “I” or “CC” for example. Then,
> with regex, I removed all unique words that contain letter that are not in
> the main
>
> periodic table of element and remove the words that contain the main English
> punctuation signs that never happen in SMILES.
>
> Placed one after the others, those filters take 26 836 words of the book
> "Alice's adventure in the wonderland" down to 780 words. (97% of words
> filtered out)
>
>
> TEST RESULTS
>
> I have tested my script on:
> • 7900 unique SMILES for “drug-like molecules”
> • Alice’s adventure in wonderland (I never read the book but I assumed
> there is no SMILES!)
> • A shuffled mixture of Alice’s in wonderland and 7900 unique SMILES
>
> The performance is as follow:
>
>
> For Alice’s adventure in wonderland:
> 26836 words
> 26835 TN
> 0 TP
> 1 FP: “*****************************************************************”
> (actually a valid SMILES…)
> 0 FN
>
> ==> Accuracy of 0.99996, in 0:00:00.112000
>
>
>
> For 7900 unique SMILES from unique drug like molecules
> 7900 TP
> 0 TN
> 0 FP
> 0 FN
> ==> Accuracy of 0.99996, in 0:00:04.200000
>
>
>
>
> 7900 unique SMILES from unique drug like molecule shuffled within ALICE'S
> ADVENTURES IN WONDERLAND 26836 words (34736 word in totals)
>
> 7900 TP
> 26835 TN
> 1 FP: “*****************************************************************”
> 0 FN
>
> ==> Accuracy of 0.99997 in 0:00:04.949000
>
>
> Then, I have reprocessed the txt mixture above without the text filters
> (directly feeding every words from the text into the RDKit function and got
> the following result:
>
> 7900 TP
> 26835 TN
> 339 FP
> 0 FN
> ==> Accuracy of 0.97 in 0:00:07.893
>
>
> Therefore, as Brian pointed out, the function Chem.MolFromSmiles(SMILES) is
> crazy fast to detected non valid smiles, i.e. to return a “None Object”
> (about 240K/s
>
> on my computer). What takes the longest is the processing of valid smiles
> into valid Mol object (2 K/s, i.e 120 times slower).
>
> My conclusion is that the filters are mainly useful to prevent FPs from
> occurring, but there is no noticeable gain in time processing. The function
> Chem.MolFromSmiles
>
> is very quick to discard none valid smiles but can incorporate a number of
> FPs if used without text filtering.
>
>
> The script is in attachment, comments are again welcome!
>
> Thanks again,
>
> Alexis
>
>
>
>
>
>
>
> On 4 December 2016 at 02:52, Andrew Dalke <da...@dalkescientific.com>
> wrote:
>
>> On Dec 2, 2016, at 5:46 PM, Brian Kelley wrote:
>> > I hacked a version of RDKit's smiles parser to compute heavy atom
>> count, perhaps some version of this could be used to check smiles validity
>> without making the actual molecule.
>>
>> FWIW, here's my regex code for it, which makes the assumption that only
>> "[H]" and anything with a "*" are not heavy.
>>
>> _atom_pat = re.compile(r"""
>> (
>> Cl? |
>> Br? |
>> [NOSPFIbcnosp] |
>> \[[^]]*\]
>> )
>> """, re.X)
>>
>> def get_num_heavies(smiles):
>> num_atoms = 0
>> for m in _atom_pat.finditer(smiles):
>> text = m.group()
>> if text == "[H]" or "*" in text:
>> continue
>> num_atoms += 1
>> return num_atoms
>>
>> Thus turns out to be a quite handy piece of functionality.
>>
>>
>> Andrew
>> da...@dalkescientific.com
>>
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> ------------------------------------------------------------
> ------------------
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss