I could be wrong but I believe IBM system had a preprocessing step which
removed all known dictionary words - which would get rid of "submarine" etc.
I also believe this problem has been solved multiple times in the past,
NextMove software comes to mind, chemical tagger -
http://chemicaltagger.ch.cam.ac.uk/, etc.
my 2 cents,
Igor
On Fri, Dec 2, 2016 at 11:46 AM, Brian Kelley <fustiga...@gmail.com> wrote:
> I hacked a version of RDKit's smiles parser to compute heavy atom count,
> perhaps some version of this could be used to check smiles validity without
> making the actual molecule.
>
> From a fun historical perspective: IBM had an expert system to find IUPAC
> names in documents. They ended up finding things like "submarine" which
> was amusing. It turned out that just parsing all words with the IUPAC
> parser was by far the fastest and best solution. I expect the same will be
> true for finding smiles.
>
> It would be interesting to put the common OCR errors into the parser as
> well (l's and 1's are hard for instance).
>
>
> On Fri, Dec 2, 2016 at 10:46 AM, Peter Gedeck <peter.ged...@gmail.com>
> wrote:
>
>> Hello Alexis,
>>
>> Depending on the size of your document, you could consider limit storing
>> the already tested strings by word length and only memoize shorter words.
>> SMILES tend to be longer, so everything above a given number of characters
>> has a higher probability of being a SMILES. Large words probably also
>> contain a lot of chemical names. They often contain commas (,), so they are
>> easy to remove quickly.
>>
>> Best,
>>
>> Peter
>>
>>
>> On Fri, Dec 2, 2016 at 5:43 AM Alexis Parenty <
>> alexis.parenty.h...@gmail.com> wrote:
>>
>>> Dear Pavel And Greg,
>>>
>>>
>>>
>>> Thanks Greg for the regexps link. I’ll use that too.
>>>
>>>
>>> Pavel, I need to track on which document the SMILES are coming from, but
>>> I will indeed make a set of unique word for each document before looping.
>>> Thanks!
>>>
>>> Best,
>>>
>>> Alexis
>>>
>>> On 2 December 2016 at 11:21, Pavel <pavel_polishc...@ukr.net> wrote:
>>>
>>> Hi, Alexis,
>>>
>>> if you should not track from which document SMILES come, you may just
>>> combine all words from all document in a list, take only unique words and
>>> try to test them. Thus, you should not store and check for valid/non-valid
>>> strings. That would reduce problem complexity as well.
>>>
>>> Pavel.
>>> On 12/02/2016 11:11 AM, Greg Landrum wrote:
>>>
>>> An initial start on some regexps that match SMILES is here:
>>> https://gist.github.com/lsauer/1312860/264ae813c2bd2c2
>>> 7a769d261c8c6b38da34e22fb
>>>
>>> that may also be useful
>>>
>>> On Fri, Dec 2, 2016 at 11:07 AM, Alexis Parenty <
>>> alexis.parenty.h...@gmail.com> wrote:
>>>
>>> Hi Markus,
>>>
>>>
>>> Yes! I might discover novel compounds that way!! Would be interesting to
>>> see how they look like…
>>>
>>>
>>> Good suggestion to also store the words that were correctly identified
>>> as SMILES. I’ll add that to the script.
>>>
>>>
>>> I also like your “distribution of word” idea. I could safely skip any
>>> words that occur more than 1% of the time and could try to play around with
>>> the threshold to find an optimum.
>>>
>>>
>>> I will try every suggestions and will time it to see what is best. I’ll
>>> keep everyone in the loop and will share the script and results.
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Alexis
>>>
>>> On 2 December 2016 at 10:47, Markus Sitzmann <markus.sitzm...@gmail.com>
>>> wrote:
>>>
>>> Hi Alexis,
>>>
>>> you may find also so some "novel" compounds by this approach :-).
>>>
>>> Whether your tuple solution improves performance strongly depends on
>>> the content of your text documents and how often they repeat the same words
>>> again - but my guess would be it will help. Probably the best way is even
>>> to look at the distribution of words before you feed them to RDKit. You
>>> should also "memorize" those ones that successfully generated a structure,
>>> doesn't make sense to do it again, then.
>>>
>>> Markus
>>>
>>> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski <
>>> mac...@wojcikowski.pl> wrote:
>>>
>>> Hi Alexis,
>>>
>>> You may want to filter with some regex strings containing not valid
>>> characters (i.e. there is small subset of atoms that may be without
>>> brackets). See "Atoms" section: http://www.daylight.com/dayhtm
>>> l/doc/theory/theory.smiles.html
>>>
>>> The set might grow pretty quick and may be inefficient, so I'd parse all
>>> strings passing above filter. Although there will be some false positives
>>> like "CC" which may occur in text (emails especially).
>>>
>>> ----
>>> Pozdrawiam, | Best regards,
>>> Maciek Wójcikowski
>>> mac...@wojcikowski.pl
>>>
>>> 2016-12-02 10:11 GMT+01:00 Alexis Parenty <alexis.parenty.h...@gmail.com
>>> >:
>>>
>>> Dear all,
>>>
>>>
>>> I am looking for a way to extract SMILES scattered in many text
>>> documents (thousands documents of several pages each).
>>>
>>> At the moment, I am thinking to scan each words from the text and try to
>>> make a mol object from them using Chem.MolFromSmiles() then store the words
>>> if they return a mol object that is not None.
>>>
>>> Can anyone think of a better/quicker way?
>>>
>>>
>>> Would it be worth storing in a tuple any word that do not return a mol
>>> object from Chem.MolFromSmiles() and exclude them from subsequent search?
>>>
>>>
>>> Something along those lines
>>>
>>>
>>> excluded_set = set()
>>>
>>> smiles_list = []
>>>
>>> For each_word in text:
>>>
>>> If each_word not in excluded_set:
>>>
>>> each_word_mol = Chem.MolFromSmiles(each_word)
>>>
>>> if each_word_mol is not None:
>>>
>>> smiles_list.append(each_word)
>>>
>>> else:
>>>
>>> excluded_set.add(each_word_mol)
>>>
>>>
>>> Would not searching into that growing tuple take actually more time than
>>> trying to blindly make a mol object for every word?
>>>
>>>
>>>
>>> Any suggestion?
>>>
>>>
>>> Many thanks and regards,
>>>
>>>
>>> Alexis
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>
>>>
>>>
>>> _______________________________________________
>>> Rdkit-discuss mailing
>>> listRdkit-discuss@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot______
>>> _________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss