Re: [Rdkit-discuss] Extracting SMILES from text

Alexis Parenty Fri, 02 Dec 2016 02:42:55 -0800

Dear Pavel And Greg,



Thanks Greg for the regexps link. I’ll use that too.


Pavel, I need to track on which document the SMILES are coming from, but I
will indeed make a set of unique word for each document before looping.
Thanks!

Best,

Alexis

On 2 December 2016 at 11:21, Pavel <pavel_polishc...@ukr.net> wrote:

> Hi, Alexis,
>
>   if you should not track from which document SMILES come, you may just
> combine all words from all document in a list, take only unique words and
> try to test them. Thus, you should not store and check for valid/non-valid
> strings. That would reduce problem complexity as well.
>
> Pavel.
> On 12/02/2016 11:11 AM, Greg Landrum wrote:
>
> An initial start on some regexps that match SMILES is here:
> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b3
> 8da34e22fb
>
> that may also be useful
>
> On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Hi Markus,
>>
>>
>> Yes! I might discover novel compounds that way!! Would be interesting to
>> see how they look like…
>>
>>
>> Good suggestion to also store the words that were correctly identified as
>> SMILES. I’ll add that to the script.
>>
>>
>> I also like your “distribution of word” idea. I could safely skip any
>> words that occur more than 1% of the time and could try to play around with
>> the threshold to find an optimum.
>>
>>
>> I will try every suggestions and will time it to see what is best. I’ll
>> keep everyone in the loop and will share the script and results.
>>
>>
>> Thanks,
>>
>>
>> Alexis
>>
>> On 2 December 2016 at 10:47, Markus Sitzmann <markus.sitzm...@gmail.com>
>> wrote:
>>
>>> Hi Alexis,
>>>
>>> you may find also so some "novel" compounds by this approach :-).
>>>
>>> Whether your tuple solution improves performance strongly depends on
>>> the content of your text documents and how often they repeat the same words
>>> again - but my guess would be it will help. Probably the best way is even
>>> to look at the distribution of words before you feed them to RDKit. You
>>> should also "memorize" those ones that successfully generated a structure,
>>> doesn't make sense to do it again, then.
>>>
>>> Markus
>>>
>>> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski <
>>> mac...@wojcikowski.pl> wrote:
>>>
>>>> Hi Alexis,
>>>>
>>>> You may want to filter with some regex strings containing not valid
>>>> characters (i.e. there is small subset of atoms that may be without
>>>> brackets). See "Atoms" section: http://www.daylight.com/dayhtm
>>>> l/doc/theory/theory.smiles.html
>>>>
>>>> The set might grow pretty quick and may be inefficient, so I'd parse
>>>> all strings passing above filter. Although there will be some false
>>>> positives like "CC" which may occur in text (emails especially).
>>>>
>>>> ----
>>>> Pozdrawiam,  |  Best regards,
>>>> Maciek Wójcikowski
>>>> mac...@wojcikowski.pl
>>>>
>>>> 2016-12-02 10:11 GMT+01:00 Alexis Parenty <
>>>> alexis.parenty.h...@gmail.com>:
>>>>
>>>>> Dear all,
>>>>>
>>>>>
>>>>> I am looking for a way to extract SMILES scattered in many text
>>>>> documents (thousands documents of several pages each).
>>>>>
>>>>> At the moment, I am thinking to scan each words from the text and try
>>>>> to make a mol object from them using Chem.MolFromSmiles() then store the
>>>>> words if they return a mol object that is not None.
>>>>>
>>>>> Can anyone think of a better/quicker way?
>>>>>
>>>>>
>>>>> Would it be worth storing in a tuple any word that do not return a mol
>>>>> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>>>>>
>>>>>
>>>>> Something along those lines
>>>>>
>>>>>
>>>>> excluded_set = set()
>>>>>
>>>>> smiles_list = []
>>>>>
>>>>> For each_word in text:
>>>>>
>>>>>     If each_word not in excluded_set:
>>>>>
>>>>>             each_word_mol =  Chem.MolFromSmiles(each_word)
>>>>>
>>>>>             if each_word_mol is not None:
>>>>>
>>>>>                     smiles_list.append(each_word)
>>>>>
>>>>>              else:
>>>>>
>>>>>                      excluded_set.add(each_word_mol)
>>>>>
>>>>>
>>>>> Would not searching into that growing tuple take actually more time
>>>>> than trying to blindly make a mol object for every word?
>>>>>
>>>>>
>>>>>
>>>>> Any suggestion?
>>>>>
>>>>>
>>>>> Many thanks and regards,
>>>>>
>>>>>
>>>>> Alexis
>>>>>
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> Rdkit-discuss mailing list
>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>
>>>>>
>>>>
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>
>
>
> _______________________________________________
> Rdkit-discuss mailing 
> listRdkit-discuss@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Extracting SMILES from text

Reply via email to