Hi Maciek,

Thanks for your quick response. Excellent suggestions. I could filter out a
lot of crap that way... Maybe I could also add a filter on word length to
avoid having a lot of Ethane and Iodide false positives!


This also made me think that I could transform the text into a set to avoid
scanning all the duplicates. I can filter out quite a lot before embarking
into making mols.


Thanks,


Alexis

On 2 December 2016 at 10:21, Maciek Wójcikowski <mac...@wojcikowski.pl>
wrote:

> Hi Alexis,
>
> You may want to filter with some regex strings containing not valid
> characters (i.e. there is small subset of atoms that may be without
> brackets). See "Atoms" section: http://www.daylight.com/
> dayhtml/doc/theory/theory.smiles.html
>
> The set might grow pretty quick and may be inefficient, so I'd parse all
> strings passing above filter. Although there will be some false positives
> like "CC" which may occur in text (emails especially).
>
> ----
> Pozdrawiam,  |  Best regards,
> Maciek Wójcikowski
> mac...@wojcikowski.pl
>
> 2016-12-02 10:11 GMT+01:00 Alexis Parenty <alexis.parenty.h...@gmail.com>:
>
>> Dear all,
>>
>>
>> I am looking for a way to extract SMILES scattered in many text documents
>> (thousands documents of several pages each).
>>
>> At the moment, I am thinking to scan each words from the text and try to
>> make a mol object from them using Chem.MolFromSmiles() then store the words
>> if they return a mol object that is not None.
>>
>> Can anyone think of a better/quicker way?
>>
>>
>> Would it be worth storing in a tuple any word that do not return a mol
>> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>>
>>
>> Something along those lines
>>
>>
>> excluded_set = set()
>>
>> smiles_list = []
>>
>> For each_word in text:
>>
>>     If each_word not in excluded_set:
>>
>>             each_word_mol =  Chem.MolFromSmiles(each_word)
>>
>>             if each_word_mol is not None:
>>
>>                     smiles_list.append(each_word)
>>
>>              else:
>>
>>                      excluded_set.add(each_word_mol)
>>
>>
>> Would not searching into that growing tuple take actually more time than
>> trying to blindly make a mol object for every word?
>>
>>
>>
>> Any suggestion?
>>
>>
>> Many thanks and regards,
>>
>>
>> Alexis
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to