Cool! Btw- try sanitize=False Also, Andrew is right that you will miss parenthetical phrases. I.e. Benzene(c1ccccc1) and the like, just reasserting that this is a hard problem!
---- Brian Kelley > On Dec 5, 2016, at 5:35 AM, Alexis Parenty <alexis.parenty.h...@gmail.com> > wrote: > > Dear All, > Many thanks to everyone for your participation in that discussion. It was > very interesting and useful. I have written a small script that took on board > everyone’s input: > > This incorporates a few "text filters" before the RDKit function: First of > all I made a dictionary of all the words present in the text as a Key, and > the number of times > they appear in the text as values. Then I removed from the list of unique > keys (words) all the ones that were repeated more than once (because I know > that my SMILES > appear only once in each document). Then I remove all the words that are > shorter than 5 letters because I know that all my structures contain more > than 5 atoms > and I want to remove possible FPs coming from “I” or “CC” for example. Then, > with regex, I removed all unique words that contain letter that are not in > the main > periodic table of element and remove the words that contain the main English > punctuation signs that never happen in SMILES. > Placed one after the others, those filters take 26 836 words of the book > "Alice's adventure in the wonderland" down to 780 words. (97% of words > filtered out) > > TEST RESULTS > > I have tested my script on: > • 7900 unique SMILES for “drug-like molecules” > • Alice’s adventure in wonderland (I never read the book but I assumed > there is no SMILES!) > • A shuffled mixture of Alice’s in wonderland and 7900 unique SMILES > > The performance is as follow: > > > For Alice’s adventure in wonderland: > 26836 words > 26835 TN > 0 TP > 1 FP: “*****************************************************************” > (actually a valid SMILES…) > 0 FN > ==> Accuracy of 0.99996, in 0:00:00.112000 > > > For 7900 unique SMILES from unique drug like molecules > 7900 TP > 0 TN > 0 FP > 0 FN > ==> Accuracy of 0.99996, in 0:00:04.200000 > > > > > 7900 unique SMILES from unique drug like molecule shuffled within ALICE'S > ADVENTURES IN WONDERLAND 26836 words (34736 word in totals) > > 7900 TP > 26835 TN > 1 FP: “*****************************************************************” > 0 FN > > ==> Accuracy of 0.99997 in 0:00:04.949000 > > > Then, I have reprocessed the txt mixture above without the text filters > (directly feeding every words from the text into the RDKit function and got > the following result: > > 7900 TP > 26835 TN > 339 FP > 0 FN > ==> Accuracy of 0.97 in 0:00:07.893 > > > Therefore, as Brian pointed out, the function Chem.MolFromSmiles(SMILES) is > crazy fast to detected non valid smiles, i.e. to return a “None Object” > (about 240K/s > on my computer). What takes the longest is the processing of valid smiles > into valid Mol object (2 K/s, i.e 120 times slower). > > My conclusion is that the filters are mainly useful to prevent FPs from > occurring, but there is no noticeable gain in time processing. The function > Chem.MolFromSmiles > is very quick to discard none valid smiles but can incorporate a number of > FPs if used without text filtering. > > The script is in attachment, comments are again welcome! > > Thanks again, > > Alexis > > > > > > > >> On 4 December 2016 at 02:52, Andrew Dalke <da...@dalkescientific.com> wrote: >> On Dec 2, 2016, at 5:46 PM, Brian Kelley wrote: >> > I hacked a version of RDKit's smiles parser to compute heavy atom count, >> > perhaps some version of this could be used to check smiles validity >> > without making the actual molecule. >> >> FWIW, here's my regex code for it, which makes the assumption that only >> "[H]" and anything with a "*" are not heavy. >> >> _atom_pat = re.compile(r""" >> ( >> Cl? | >> Br? | >> [NOSPFIbcnosp] | >> \[[^]]*\] >> ) >> """, re.X) >> >> def get_num_heavies(smiles): >> num_atoms = 0 >> for m in _atom_pat.finditer(smiles): >> text = m.group() >> if text == "[H]" or "*" in text: >> continue >> num_atoms += 1 >> return num_atoms >> >> Thus turns out to be a quite handy piece of functionality. >> >> >> Andrew >> da...@dalkescientific.com >> >> >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > <SMILES_from_english_text_parser.txt> > ------------------------------------------------------------------------------ > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss