Hi, Alexis,
if you should not track from which document SMILES come, you may just
combine all words from all document in a list, take only unique words
and try to test them. Thus, you should not store and check for
valid/non-valid strings. That would reduce problem complexity as well.
Pavel.
On 12/02/2016 11:11 AM, Greg Landrum wrote:
An initial start on some regexps that match SMILES is here:
https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
that may also be useful
On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty
<alexis.parenty.h...@gmail.com <mailto:alexis.parenty.h...@gmail.com>>
wrote:
Hi Markus,
Yes! I might discover novel compounds that way!! Would be
interesting to see how they look like…
Good suggestion to also store the words that were correctly
identified as SMILES. I’ll add that to the script.
I also like your “distribution of word” idea. I could safely skip
any words that occur more than 1% of the time and could try to
play around with the threshold to find an optimum.
I will try every suggestions and will time it to see what is best.
I’ll keep everyone in the loop and will share the script and results.
Thanks,
Alexis
On 2 December 2016 at 10:47, Markus Sitzmann
<markus.sitzm...@gmail.com <mailto:markus.sitzm...@gmail.com>> wrote:
Hi Alexis,
you may find also so some "novel" compounds by this approach :-).
Whether your tuple solution improves performance strongly
depends on the content of your text documents and how often
they repeat the same words again - but my guess would be it
will help. Probably the best way is even to look at the
distribution of words before you feed them to RDKit. You
should also "memorize" those ones that successfully generated
a structure, doesn't make sense to do it again, then.
Markus
On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski
<mac...@wojcikowski.pl <mailto:mac...@wojcikowski.pl>> wrote:
Hi Alexis,
You may want to filter with some regex strings containing
not valid characters (i.e. there is small subset of atoms
that may be without brackets). See "Atoms" section:
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
<http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html>
The set might grow pretty quick and may be inefficient, so
I'd parse all strings passing above filter. Although there
will be some false positives like "CC" which may occur in
text (emails especially).
----
Pozdrawiam, | Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl <mailto:mac...@wojcikowski.pl>
2016-12-02 10:11 GMT+01:00 Alexis Parenty
<alexis.parenty.h...@gmail.com
<mailto:alexis.parenty.h...@gmail.com>>:
Dear all,
I am looking for a way to extract SMILES scattered in
many text documents (thousands documents of several
pages each).
At the moment, I am thinking to scan each words from
the text and try to make a mol object from them using
Chem.MolFromSmiles() then store the words if they
return a mol object that is not None.
Can anyone think of a better/quicker way?
Would it be worth storing in a tuple any word that do
not return a mol object from Chem.MolFromSmiles() and
exclude them from subsequent search?
Something along those lines
excluded_set = set()
smiles_list = []
For each_word in text:
If each_word not in excluded_set:
each_word_mol = Chem.MolFromSmiles(each_word)
if each_word_mol is not None:
smiles_list.append(each_word)
else:
excluded_set.add(each_word_mol)
Would not searching into that growing tuple take
actually more time than trying to blindly make a mol
object for every word?
Any suggestion?
Many thanks and regards,
Alexis
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the
world's most
engaging tech sites, SlashDot.org!
http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's
most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
<https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss