Re: [Rdkit-discuss] Extracting SMILES from text

George Papadatos Fri, 02 Dec 2016 10:29:22 -0800

I think Alexis was referring to converting actual SMILES strings found in
random text. Chemical entity recognition and name to structure conversion
is another story altogether and nowadays one can quickly go a long way with
open tools such as OSCAR + OPSIN in KNIME or with something like this:
http://chemdataextractor.org/docs/intro


George

On 2 December 2016 at 17:35, Brian Kelley <fustiga...@gmail.com> wrote:

> This was why they started using the dictionary lookup as I recall :). The
> iupac system they ended up using was Roger's when at OpenEye.
>
> ----
> Brian Kelley
>
> On Dec 2, 2016, at 12:33 PM, Igor Filippov <igor.v.filip...@gmail.com>
> wrote:
>
> I could be wrong but I believe IBM system had a preprocessing step which
> removed all known dictionary words - which would get rid of "submarine" etc.
> I also believe this problem has been solved multiple times in the past,
> NextMove software comes to mind, chemical tagger -
> http://chemicaltagger.ch.cam.ac.uk/, etc.
>
> my 2 cents,
> Igor
>
>
>
>
> On Fri, Dec 2, 2016 at 11:46 AM, Brian Kelley <fustiga...@gmail.com>
> wrote:
>
>> I hacked a version of RDKit's smiles parser to compute heavy atom count,
>> perhaps some version of this could be used to check smiles validity without
>> making the actual molecule.
>>
>> From a fun historical perspective:  IBM had an expert system to find
>> IUPAC names in documents.  They ended up finding things like "submarine"
>> which was amusing.  It turned out that just parsing all words with the
>> IUPAC parser was by far the fastest and best solution.  I expect the same
>> will be true for finding smiles.
>>
>> It would be interesting to put the common OCR errors into the parser as
>> well (l's and 1's are hard for instance).
>>
>>
>> On Fri, Dec 2, 2016 at 10:46 AM, Peter Gedeck <peter.ged...@gmail.com>
>> wrote:
>>
>>> Hello Alexis,
>>>
>>> Depending on the size of your document, you could consider limit storing
>>> the already tested strings by word length and only memoize shorter words.
>>> SMILES tend to be longer, so everything above a given number of characters
>>> has a higher probability of being a SMILES. Large words probably also
>>> contain a lot of chemical names. They often contain commas (,), so they are
>>> easy to remove quickly.
>>>
>>> Best,
>>>
>>> Peter
>>>
>>>
>>> On Fri, Dec 2, 2016 at 5:43 AM Alexis Parenty <
>>> alexis.parenty.h...@gmail.com> wrote:
>>>
>>>> Dear Pavel And Greg,
>>>>
>>>>
>>>>
>>>> Thanks Greg for the regexps link. I’ll use that too.
>>>>
>>>>
>>>> Pavel, I need to track on which document the SMILES are coming from,
>>>> but I will indeed make a set of unique word for each document before
>>>> looping. Thanks!
>>>>
>>>> Best,
>>>>
>>>> Alexis
>>>>
>>>> On 2 December 2016 at 11:21, Pavel <pavel_polishc...@ukr.net> wrote:
>>>>
>>>> Hi, Alexis,
>>>>
>>>>   if you should not track from which document SMILES come, you may just
>>>> combine all words from all document in a list, take only unique words and
>>>> try to test them. Thus, you should not store and check for valid/non-valid
>>>> strings. That would reduce problem complexity as well.
>>>>
>>>> Pavel.
>>>> On 12/02/2016 11:11 AM, Greg Landrum wrote:
>>>>
>>>> An initial start on some regexps that match SMILES is here:
>>>> https://gist.github.com/lsauer/1312860/264ae813c2bd2c2
>>>> 7a769d261c8c6b38da34e22fb
>>>>
>>>> that may also be useful
>>>>
>>>> On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <
>>>> alexis.parenty.h...@gmail.com> wrote:
>>>>
>>>> Hi Markus,
>>>>
>>>>
>>>> Yes! I might discover novel compounds that way!! Would be interesting
>>>> to see how they look like…
>>>>
>>>>
>>>> Good suggestion to also store the words that were correctly identified
>>>> as SMILES. I’ll add that to the script.
>>>>
>>>>
>>>> I also like your “distribution of word” idea. I could safely skip any
>>>> words that occur more than 1% of the time and could try to play around with
>>>> the threshold to find an optimum.
>>>>
>>>>
>>>> I will try every suggestions and will time it to see what is best. I’ll
>>>> keep everyone in the loop and will share the script and results.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Alexis
>>>>
>>>> On 2 December 2016 at 10:47, Markus Sitzmann <markus.sitzm...@gmail.com
>>>> > wrote:
>>>>
>>>> Hi Alexis,
>>>>
>>>> you may find also so some "novel" compounds by this approach :-).
>>>>
>>>> Whether your tuple solution improves performance strongly depends on
>>>> the content of your text documents and how often they repeat the same words
>>>> again - but my guess would be it will help. Probably the best way is even
>>>> to look at the distribution of words before you feed them to RDKit. You
>>>> should also "memorize" those ones that successfully generated a structure,
>>>> doesn't make sense to do it again, then.
>>>>
>>>> Markus
>>>>
>>>> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski <
>>>> mac...@wojcikowski.pl> wrote:
>>>>
>>>> Hi Alexis,
>>>>
>>>> You may want to filter with some regex strings containing not valid
>>>> characters (i.e. there is small subset of atoms that may be without
>>>> brackets). See "Atoms" section: http://www.daylight.com/dayhtm
>>>> l/doc/theory/theory.smiles.html
>>>>
>>>> The set might grow pretty quick and may be inefficient, so I'd parse
>>>> all strings passing above filter. Although there will be some false
>>>> positives like "CC" which may occur in text (emails especially).
>>>>
>>>> ----
>>>> Pozdrawiam,  |  Best regards,
>>>> Maciek Wójcikowski
>>>> mac...@wojcikowski.pl
>>>>
>>>> 2016-12-02 10:11 GMT+01:00 Alexis Parenty <
>>>> alexis.parenty.h...@gmail.com>:
>>>>
>>>> Dear all,
>>>>
>>>>
>>>> I am looking for a way to extract SMILES scattered in many text
>>>> documents (thousands documents of several pages each).
>>>>
>>>> At the moment, I am thinking to scan each words from the text and try
>>>> to make a mol object from them using Chem.MolFromSmiles() then store the
>>>> words if they return a mol object that is not None.
>>>>
>>>> Can anyone think of a better/quicker way?
>>>>
>>>>
>>>> Would it be worth storing in a tuple any word that do not return a mol
>>>> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>>>>
>>>>
>>>> Something along those lines
>>>>
>>>>
>>>> excluded_set = set()
>>>>
>>>> smiles_list = []
>>>>
>>>> For each_word in text:
>>>>
>>>>     If each_word not in excluded_set:
>>>>
>>>>             each_word_mol =  Chem.MolFromSmiles(each_word)
>>>>
>>>>             if each_word_mol is not None:
>>>>
>>>>                     smiles_list.append(each_word)
>>>>
>>>>              else:
>>>>
>>>>                      excluded_set.add(each_word_mol)
>>>>
>>>>
>>>> Would not searching into that growing tuple take actually more time
>>>> than trying to blindly make a mol object for every word?
>>>>
>>>>
>>>>
>>>> Any suggestion?
>>>>
>>>>
>>>> Many thanks and regards,
>>>>
>>>>
>>>> Alexis
>>>>
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Rdkit-discuss mailing 
>>>> listRdkit-discuss@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot______
>>>> _________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Extracting SMILES from text

Reply via email to