Re: [Rdkit-discuss] Extracting SMILES from text

Markus Sitzmann Fri, 02 Dec 2016 01:48:13 -0800

Hi Alexis,

you may find also so some "novel" compounds by this approach :-).


Whether your tuple solution improves performance strongly depends on the
content of your text documents and how often they repeat the same words
again - but my guess would be it will help. Probably the best way is even
to look at the distribution of words before you feed them to RDKit. You
should also "memorize" those ones that successfully generated a structure,
doesn't make sense to do it again, then.

Markus

On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski <mac...@wojcikowski.pl>
wrote:

> Hi Alexis,
>
> You may want to filter with some regex strings containing not valid
> characters (i.e. there is small subset of atoms that may be without
> brackets). See "Atoms" section: http://www.daylight.com/
> dayhtml/doc/theory/theory.smiles.html
>
> The set might grow pretty quick and may be inefficient, so I'd parse all
> strings passing above filter. Although there will be some false positives
> like "CC" which may occur in text (emails especially).
>
> ----
> Pozdrawiam,  |  Best regards,
> Maciek Wójcikowski
> mac...@wojcikowski.pl
>
> 2016-12-02 10:11 GMT+01:00 Alexis Parenty <alexis.parenty.h...@gmail.com>:
>
>> Dear all,
>>
>>
>> I am looking for a way to extract SMILES scattered in many text documents
>> (thousands documents of several pages each).
>>
>> At the moment, I am thinking to scan each words from the text and try to
>> make a mol object from them using Chem.MolFromSmiles() then store the words
>> if they return a mol object that is not None.
>>
>> Can anyone think of a better/quicker way?
>>
>>
>> Would it be worth storing in a tuple any word that do not return a mol
>> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>>
>>
>> Something along those lines
>>
>>
>> excluded_set = set()
>>
>> smiles_list = []
>>
>> For each_word in text:
>>
>>     If each_word not in excluded_set:
>>
>>             each_word_mol =  Chem.MolFromSmiles(each_word)
>>
>>             if each_word_mol is not None:
>>
>>                     smiles_list.append(each_word)
>>
>>              else:
>>
>>                      excluded_set.add(each_word_mol)
>>
>>
>> Would not searching into that growing tuple take actually more time than
>> trying to blindly make a mol object for every word?
>>
>>
>>
>> Any suggestion?
>>
>>
>> Many thanks and regards,
>>
>>
>> Alexis
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Extracting SMILES from text

Reply via email to