Re: [Rdkit-discuss] Extracting SMILES from text

Brian Kelley Fri, 02 Dec 2016 09:36:57 -0800

This was why they started using the dictionary lookup as I recall :). The iupac 
system they ended up using was Roger's when at OpenEye.


----
Brian Kelley

> On Dec 2, 2016, at 12:33 PM, Igor Filippov <igor.v.filip...@gmail.com> wrote:
> 
> I could be wrong but I believe IBM system had a preprocessing step which 
> removed all known dictionary words - which would get rid of "submarine" etc.
> I also believe this problem has been solved multiple times in the past, 
> NextMove software comes to mind, chemical tagger - 
> http://chemicaltagger.ch.cam.ac.uk/, etc.
> 
> my 2 cents,
> Igor
> 
> 
> 
> 
>> On Fri, Dec 2, 2016 at 11:46 AM, Brian Kelley <fustiga...@gmail.com> wrote:
>> I hacked a version of RDKit's smiles parser to compute heavy atom count, 
>> perhaps some version of this could be used to check smiles validity without 
>> making the actual molecule.
>> 
>> From a fun historical perspective:  IBM had an expert system to find IUPAC 
>> names in documents.  They ended up finding things like "submarine" which was 
>> amusing.  It turned out that just parsing all words with the IUPAC parser 
>> was by far the fastest and best solution.  I expect the same will be true 
>> for finding smiles.
>> 
>> It would be interesting to put the common OCR errors into the parser as well 
>> (l's and 1's are hard for instance).
>> 
>> 
>>> On Fri, Dec 2, 2016 at 10:46 AM, Peter Gedeck <peter.ged...@gmail.com> 
>>> wrote:
>>> Hello Alexis,
>>> 
>>> Depending on the size of your document, you could consider limit storing 
>>> the already tested strings by word length and only memoize shorter words. 
>>> SMILES tend to be longer, so everything above a given number of characters 
>>> has a higher probability of being a SMILES. Large words probably also 
>>> contain a lot of chemical names. They often contain commas (,), so they are 
>>> easy to remove quickly. 
>>> 
>>> Best,
>>> 
>>> Peter
>>> 
>>> 
>>>> On Fri, Dec 2, 2016 at 5:43 AM Alexis Parenty 
>>>> <alexis.parenty.h...@gmail.com> wrote:
>>>> Dear Pavel And Greg,
>>>> 
>>>>  
>>>> 
>>>> Thanks Greg for the regexps link. I’ll use that too.
>>>> 
>>>> 
>>>> 
>>>> Pavel, I need to track on which document the SMILES are coming from, but I 
>>>> will indeed make a set of unique word for each document before looping. 
>>>> Thanks!
>>>> 
>>>> Best,
>>>> 
>>>> Alexis
>>>> 
>>>> 
>>>> On 2 December 2016 at 11:21, Pavel <pavel_polishc...@ukr.net> wrote:
>>>> Hi, Alexis,
>>>> 
>>>>   if you should not track from which document SMILES come, you may just 
>>>> combine all words from all document in a list, take only unique words and 
>>>> try to test them. Thus, you should not store and check for valid/non-valid 
>>>> strings. That would reduce problem complexity as well.
>>>> 
>>>> Pavel.
>>>>> On 12/02/2016 11:11 AM, Greg Landrum wrote:
>>>>> An initial start on some regexps that match SMILES is here: 
>>>>> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
>>>>> 
>>>>> that may also be useful
>>>>> 
>>>>> On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty 
>>>>> <alexis.parenty.h...@gmail.com> wrote:
>>>>> Hi Markus,
>>>>> 
>>>>> 
>>>>> Yes! I might discover novel compounds                 that way!! Would be 
>>>>> interesting to see how they look like…
>>>>> 
>>>>> 
>>>>> Good suggestion to also store the words that were correctly identified as 
>>>>> SMILES. I’ll add that to the script.
>>>>> 
>>>>> 
>>>>> I also like your “distribution of word” idea. I could safely skip any 
>>>>> words that occur more than 1% of the time and could try to play around 
>>>>> with the threshold to find an optimum.
>>>>> 
>>>>> 
>>>>> I will try every suggestions and will time it to see what is best. I’ll 
>>>>> keep everyone in the loop and will share the script and results.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> 
>>>>> Alexis
>>>>> 
>>>>> 
>>>>> On 2 December 2016 at 10:47, Markus Sitzmann <markus.sitzm...@gmail.com> 
>>>>> wrote:
>>>>> Hi Alexis,
>>>>> 
>>>>> you may find also so some "novel" compounds by this approach :-).
>>>>> 
>>>>> Whether your tuple solution improves performance strongly depends on the 
>>>>> content of your text documents and how often they repeat the same words 
>>>>> again - but my guess would be it will help. Probably the best way is even 
>>>>> to look at the distribution of words before you feed them to RDKit. You 
>>>>> should also "memorize" those ones that successfully generated a 
>>>>> structure, doesn't make sense to do it again, then.
>>>>> 
>>>>> Markus
>>>>> 
>>>>> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski 
>>>>> <mac...@wojcikowski.pl> wrote:
>>>>> Hi Alexis,
>>>>> 
>>>>> You may want to filter with some regex strings containing not valid 
>>>>> characters (i.e. there is small subset of atoms that may be without 
>>>>> brackets). See "Atoms" section: 
>>>>> http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html 
>>>>> 
>>>>> The set might grow pretty quick and may be inefficient, so I'd parse all 
>>>>> strings passing above filter. Although there will be some false positives 
>>>>> like "CC" which may occur in text (emails especially).
>>>>> 
>>>>> ----
>>>>> Pozdrawiam,  |  Best regards,
>>>>> Maciek Wójcikowski
>>>>> mac...@wojcikowski.pl
>>>>> 
>>>>> 2016-12-02 10:11 GMT+01:00 Alexis Parenty <alexis.parenty.h...@gmail.com>:
>>>>> Dear all,
>>>>> 
>>>>> 
>>>>> 
>>>>> I am looking for a way to extract SMILES scattered in many text documents 
>>>>> (thousands documents of several pages each). 
>>>>> 
>>>>> At the moment, I am thinking to scan each words from the text and try to 
>>>>> make a mol object from them using Chem.MolFromSmiles() then store the 
>>>>> words if they return a mol object that is not None.
>>>>> 
>>>>> Can anyone think of a better/quicker way?
>>>>> 
>>>>> 
>>>>> Would it be worth storing in a tuple any word that do not return a mol 
>>>>> object from Chem.MolFromSmiles() and exclude them from subsequent search? 
>>>>> 
>>>>> 
>>>>> Something along those lines
>>>>> 
>>>>> 
>>>>> excluded_set = set()
>>>>> 
>>>>> smiles_list = []
>>>>> 
>>>>> For each_word in text:
>>>>> 
>>>>>     If each_word not in excluded_set:
>>>>> 
>>>>>             each_word_mol =  Chem.MolFromSmiles(each_word)
>>>>> 
>>>>>             if each_word_mol is not None:
>>>>> 
>>>>>                     smiles_list.append(each_word)
>>>>> 
>>>>>              else:
>>>>> 
>>>>>                      excluded_set.add(each_word_mol)
>>>>> 
>>>>> 
>>>>> Would not searching into that growing tuple take actually more time than 
>>>>> trying to blindly make a mol object for every word?
>>>>>  
>>>>> 
>>>>> Any suggestion?
>>>>> 
>>>>> 
>>>>> Many thanks and regards,
>>>>> 
>>>>> 
>>>>> Alexis
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> Rdkit-discuss mailing list
>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> Rdkit-discuss mailing list
>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> Rdkit-discuss mailing list
>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> Rdkit-discuss mailing list
>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> Check out the vibrant tech community on one of the world's most 
>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Rdkit-discuss mailing list
>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! 
>>>> http://sdm.link/slashdot_______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>> 
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>> 
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Extracting SMILES from text

Reply via email to