George,
  My point was actually parsing the words as IUPAC/SMILES is surprisingly 
effective as opposed to an ai or rule based system.  Without sanitization, 
Rdkit is about 60,000/second for smiles parsing on my laptop.  This is much 
faster when not making molecules, but I don't have the number handy.   

I expect it to be even faster with failing non smiles.  This should be 
sufficient for document scanning I think.

----
Brian Kelley

> On Dec 2, 2016, at 1:28 PM, George Papadatos <gpapada...@gmail.com> wrote:
> 
> I think Alexis was referring to converting actual SMILES strings found in 
> random text. Chemical entity recognition and name to structure conversion is 
> another story altogether and nowadays one can quickly go a long way with open 
> tools such as OSCAR + OPSIN in KNIME or with something like this: 
> http://chemdataextractor.org/docs/intro
> 
> George
> 
>> On 2 December 2016 at 17:35, Brian Kelley <fustiga...@gmail.com> wrote:
>> This was why they started using the dictionary lookup as I recall :). The 
>> iupac system they ended up using was Roger's when at OpenEye.
>> 
>> ----
>> Brian Kelley
>> 
>>> On Dec 2, 2016, at 12:33 PM, Igor Filippov <igor.v.filip...@gmail.com> 
>>> wrote:
>>> 
>>> I could be wrong but I believe IBM system had a preprocessing step which 
>>> removed all known dictionary words - which would get rid of "submarine" etc.
>>> I also believe this problem has been solved multiple times in the past, 
>>> NextMove software comes to mind, chemical tagger - 
>>> http://chemicaltagger.ch.cam.ac.uk/, etc.
>>> 
>>> my 2 cents,
>>> Igor
>>> 
>>> 
>>> 
>>> 
>>>> On Fri, Dec 2, 2016 at 11:46 AM, Brian Kelley <fustiga...@gmail.com> wrote:
>>>> I hacked a version of RDKit's smiles parser to compute heavy atom count, 
>>>> perhaps some version of this could be used to check smiles validity 
>>>> without making the actual molecule.
>>>> 
>>>> From a fun historical perspective:  IBM had an expert system to find IUPAC 
>>>> names in documents.  They ended up finding things like "submarine" which 
>>>> was amusing.  It turned out that just parsing all words with the IUPAC 
>>>> parser was by far the fastest and best solution.  I expect the same will 
>>>> be true for finding smiles.
>>>> 
>>>> It would be interesting to put the common OCR errors into the parser as 
>>>> well (l's and 1's are hard for instance).
>>>> 
>>>> 
>>>>> On Fri, Dec 2, 2016 at 10:46 AM, Peter Gedeck <peter.ged...@gmail.com> 
>>>>> wrote:
>>>>> Hello Alexis,
>>>>> 
>>>>> Depending on the size of your document, you could consider limit storing 
>>>>> the already tested strings by word length and only memoize shorter words. 
>>>>> SMILES tend to be longer, so everything above a given number of 
>>>>> characters has a higher probability of being a SMILES. Large words 
>>>>> probably also contain a lot of chemical names. They often contain commas 
>>>>> (,), so they are easy to remove quickly. 
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Peter
>>>>> 
>>>>> 
>>>>>> On Fri, Dec 2, 2016 at 5:43 AM Alexis Parenty 
>>>>>> <alexis.parenty.h...@gmail.com> wrote:
>>>>>> Dear Pavel And Greg,
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> Thanks Greg for the regexps link. I’ll use that too.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Pavel, I need to track on which document the SMILES are coming from, but 
>>>>>> I will indeed make a set of unique word for each document before 
>>>>>> looping. Thanks!
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Alexis
>>>>>> 
>>>>>> 
>>>>>> On 2 December 2016 at 11:21, Pavel <pavel_polishc...@ukr.net> wrote:
>>>>>> Hi, Alexis,
>>>>>> 
>>>>>>   if you should not track from which document SMILES come, you may just 
>>>>>> combine all words from all document in a list, take only unique words 
>>>>>> and try to test them. Thus, you should not store and check for 
>>>>>> valid/non-valid strings. That would reduce problem complexity as well.
>>>>>> 
>>>>>> Pavel.
>>>>>>> On 12/02/2016 11:11 AM, Greg Landrum wrote:
>>>>>>> An initial start on some regexps that match SMILES is here: 
>>>>>>> https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb
>>>>>>> 
>>>>>>> that may also be useful
>>>>>>> 
>>>>>>> On Fri, Dec 2, 2016 at 11:07 AM, Alexis           Parenty 
>>>>>>> <alexis.parenty.h...@gmail.com> wrote:
>>>>>>> Hi Markus,
>>>>>>> 
>>>>>>> 
>>>>>>> Yes! I might discover novel compounds that way!! Would be interesting 
>>>>>>> to see how they look like…
>>>>>>> 
>>>>>>> 
>>>>>>> Good suggestion to also store the words that were correctly identified 
>>>>>>> as SMILES. I’ll add that to the script.
>>>>>>> 
>>>>>>> 
>>>>>>> I also like your “distribution of word” idea. I could safely skip any 
>>>>>>> words that occur more than 1% of the time and could try to play around 
>>>>>>> with the threshold to find an optimum.
>>>>>>> 
>>>>>>> 
>>>>>>> I will try every suggestions and will time it to see what is best. I’ll 
>>>>>>> keep everyone in the loop and will share the script and results.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> 
>>>>>>> Alexis
>>>>>>> 
>>>>>>> 
>>>>>>> On 2 December 2016 at 10:47, Markus Sitzmann 
>>>>>>> <markus.sitzm...@gmail.com> wrote:
>>>>>>> Hi Alexis,
>>>>>>> 
>>>>>>> you may find also so some "novel" compounds by this approach :-).
>>>>>>> 
>>>>>>> Whether your tuple solution improves performance strongly depends on 
>>>>>>> the content of your text documents and how often they repeat the same 
>>>>>>> words again - but my guess would be it will help. Probably the best way 
>>>>>>> is even to look at the distribution of words before you feed them to 
>>>>>>> RDKit. You should also "memorize" those ones that successfully 
>>>>>>> generated a structure, doesn't make sense to do it again, then.
>>>>>>> 
>>>>>>> Markus
>>>>>>> 
>>>>>>> On Fri, Dec 2, 2016 at 10:21 AM, Maciek Wójcikowski 
>>>>>>> <mac...@wojcikowski.pl> wrote:
>>>>>>> Hi Alexis,
>>>>>>> 
>>>>>>> You may want to filter with some regex strings containing not valid 
>>>>>>> characters (i.e. there is small subset of atoms that may be without 
>>>>>>> brackets). See "Atoms" section: 
>>>>>>> http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html 
>>>>>>> 
>>>>>>> The set might grow pretty quick and may be inefficient, so I'd parse 
>>>>>>> all strings passing above filter. Although there will be some false 
>>>>>>> positives like "CC" which may occur                                     
>>>>>>> in text (emails especially).
>>>>>>> 
>>>>>>> ----
>>>>>>> Pozdrawiam,  |  Best regards,
>>>>>>> Maciek Wójcikowski
>>>>>>> mac...@wojcikowski.pl
>>>>>>> 
>>>>>>> 2016-12-02 10:11 GMT+01:00 Alexis Parenty 
>>>>>>> <alexis.parenty.h...@gmail.com>:
>>>>>>> Dear all,
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I am looking for a way to extract SMILES scattered in many text 
>>>>>>> documents (thousands documents of several pages each). 
>>>>>>> 
>>>>>>> At the moment, I am thinking to scan each words from the text and try 
>>>>>>> to make a mol object from them using Chem.MolFromSmiles() then store 
>>>>>>> the words if they return a mol object that is not None.
>>>>>>> 
>>>>>>> Can anyone think of a better/quicker way?
>>>>>>> 
>>>>>>> 
>>>>>>> Would it be worth storing in a tuple any word that do not return a mol 
>>>>>>> object from Chem.MolFromSmiles() and exclude them from subsequent 
>>>>>>> search? 
>>>>>>> 
>>>>>>> 
>>>>>>> Something along those lines
>>>>>>> 
>>>>>>> 
>>>>>>> excluded_set = set()
>>>>>>> 
>>>>>>> smiles_list = []
>>>>>>> 
>>>>>>> For each_word in text:
>>>>>>> 
>>>>>>>     If each_word not in excluded_set:
>>>>>>> 
>>>>>>>             each_word_mol =  Chem.MolFromSmiles(each_word)
>>>>>>> 
>>>>>>>             if each_word_mol is not None:
>>>>>>> 
>>>>>>>                     smiles_list.append(each_word)
>>>>>>> 
>>>>>>>              else:
>>>>>>> 
>>>>>>>                      excluded_set.add(each_word_mol)
>>>>>>> 
>>>>>>> 
>>>>>>> Would not searching into that growing tuple take actually more time 
>>>>>>> than trying to blindly make a mol object for every word?
>>>>>>>  
>>>>>>> 
>>>>>>> Any suggestion?
>>>>>>> 
>>>>>>> 
>>>>>>> Many thanks and regards,
>>>>>>> 
>>>>>>> 
>>>>>>> Alexis
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>>>> _______________________________________________
>>>>>>> Rdkit-discuss mailing list
>>>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>>>> _______________________________________________
>>>>>>> Rdkit-discuss mailing list
>>>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>>>> _______________________________________________
>>>>>>> Rdkit-discuss mailing list
>>>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>>>> _______________________________________________
>>>>>>> Rdkit-discuss mailing list
>>>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Check out the vibrant tech community on one of the world's most 
>>>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Rdkit-discuss mailing list
>>>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>> 
>>>>>> 
>>>>>> ------------------------------------------------------------------------------
>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>>> _______________________________________________
>>>>>> Rdkit-discuss mailing list
>>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>> 
>>>>>> 
>>>>>> ------------------------------------------------------------------------------
>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>> engaging tech sites, SlashDot.org! 
>>>>>> http://sdm.link/slashdot_______________________________________________
>>>>>> Rdkit-discuss mailing list
>>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> Rdkit-discuss mailing list
>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>> 
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>> 
>>> 
>> 
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>> 
> 
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to