Hi Alexis,

You may want to filter with some regex strings containing not valid
characters (i.e. there is small subset of atoms that may be without
brackets). See "Atoms" section:
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

The set might grow pretty quick and may be inefficient, so I'd parse all
strings passing above filter. Although there will be some false positives
like "CC" which may occur in text (emails especially).

----
Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2016-12-02 10:11 GMT+01:00 Alexis Parenty <alexis.parenty.h...@gmail.com>:

> Dear all,
>
>
> I am looking for a way to extract SMILES scattered in many text documents
> (thousands documents of several pages each).
>
> At the moment, I am thinking to scan each words from the text and try to
> make a mol object from them using Chem.MolFromSmiles() then store the words
> if they return a mol object that is not None.
>
> Can anyone think of a better/quicker way?
>
>
> Would it be worth storing in a tuple any word that do not return a mol
> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>
>
> Something along those lines
>
>
> excluded_set = set()
>
> smiles_list = []
>
> For each_word in text:
>
>     If each_word not in excluded_set:
>
>             each_word_mol =  Chem.MolFromSmiles(each_word)
>
>             if each_word_mol is not None:
>
>                     smiles_list.append(each_word)
>
>              else:
>
>                      excluded_set.add(each_word_mol)
>
>
> Would not searching into that growing tuple take actually more time than
> trying to blindly make a mol object for every word?
>
>
>
> Any suggestion?
>
>
> Many thanks and regards,
>
>
> Alexis
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to